Piecing the puzzle
    Self-publishing queryable research data on the Web
    Ruben Verborgh
    Ghent University – imec
   
  
    
      Who thinks publishing
      Linked Data on the Web
      is a good idea?
    
  
  
    
      Who publishes their
 own Linked Data?
    
  
  
    
      Who enables people to
      query their Linked Data?
    
  
  
    
      My open-source pipeline
      improves the queryability of
      Linked Data on websites.
    
  
  
    
      This pipeline unlocks
      the value of your data
      for client-side applications.
    
  
  
  
  
    
      My personal website contains metadata
      about my research and publications.
    
    This includes metadata for:
    
   
  
    
      My data is published following
      the Linked Data principles.
    
    
      - 
        1 FOAF profile in Turtle
        
          - partly manual, mostly autogenerated
 
          - >6,000 RDF triples about myself, people, publications
 
        
       
      - 
        260 HTML+RDFa pages
        
          - autogenerated using 5 template pages
 
          - >13,000 RDF triples about publications, blog posts, articles
 
          - (±50 triples per page)
 
        
       
    
   
  
    
      My data is modeled using several
      ontologies and vocabularies.
    
    
      - Friend of a Friend (FOAF)
 
      - Schema.org
 
      - Bibliographic Ontology (BIBO)
 
      - Citation Typing Ontology (CiTO)
 
      - DBpedia
 
    
   
  
    
      I publish my own Linked Data because
      we need to practice what we preach.
    
    
      - 
        How can we convince people
        they should publish 5-star Linked Data
        
        if we continue making up excuses
        for not doing it ourselves?
        
       
      - 
        My data is a home-cooking version
        of Semantic Web Dog Food.
       
    
   
  
    
      I publish my own Linked Data because
      others already publish it—wrongly.
    
    
      I struggle to keep up with the incompleteness,
      inaccuracies, duplicates, and wrong entries of:
    
    
   
  
    
      But who am I generating this data for?
    
    
      - 
        Clearly, existing research networks don’t read it.
        
          - …with the exception of Google Scholar.
 
        
       
      - 
        Schema.org-compatible search engines
        are probably very happy with it.
        
          - They crawl and see the whole dataset.
 
        
       
      - 
        We can’t build useful live applications on top of it.
        
          - Linked Data clients only see small parts of the data.
 
        
       
    
   
  
  
    
      The value of my Linked Data
      needs to be unlocked.
    
    I want to:
    
      - help people consume my data
 
      - lower the barrier for data reusers
 
      - 
        enable powerful cross-dataset queries
        
      
 
    
   
  
    
      Traversal-based Linked Data querying
      cannot answer all questions adequately.
    
    
      - 
        Completeness cannot be guaranteed.
        
          - Web linking is unidirectional.
 
        
       
      - 
        The semantic constructs in the query
        are seldom identical to those of the data.
        
          - 
            
foaf:Person,
            schema:Person,
            or
            wikidata:Q5?
           
          - 
            
rdfs:label,
            dc:title,
            foaf:name,
            or
            schema:name?
           
        
       
    
   
  
    
      Solving querying fully at the server side
      is too expensive for personal data.
    
    
      - Hosting a SPARQL endpoint is expensive.
 
      - A SPARQL endpoint with reasoning even more.
 
      - 
        Marking up all data in two directions
        and with multiple ontologies is unfeasible.
        
          - difficult to maintain
 
          - hard to express in RDFa Lite
 
        
       
    
   
  
    
      I designed a simple ETL pipeline
      to enrich and publish my website’s data.
    
    
      This process runs every night:
    
    
      - Extract RDF triples from
Turtle and HTML+RDFa documents. 
      - Reason  over this data and its ontologies.
 
      - Publish the result in a queryable interface.
 
    
   
  
    
      Reasoning on the data and its ontologies
      makes hidden semantics explicit.
    
    
      - Skolemize ontologies to remove blank nodes.
 
      - Compute deductive closure of ontologies.
 
      - Compute deductive closure of ontologies and data.
 
      - Subtract 2 from 3 to obtain only the enriched data.
 
      - Remove leftover skolemized IRIs.
 
    
   
  
    
      Reasoning expresses the same data
      in different ways for different clients.
    
    
      
        |   | time (s) | # triples | 
      
      
        | extraction |                                  170 |  17,000 | 
      
      
        | skolemization ontologies |         1 |  44,000 | 
        | closure ontologies |              39 | 145,000 | 
        | closure ontologies & data |   62 | 183,000 | 
        | subtraction |                      1 |  39,000 | 
        | removal |                          1 |  36,000 | 
      
      
        | total |                          273 |  36,000 | 
      
    
   
  
    
      Reasoning fills ontological gaps
      before querying happens.
    
    
      
        |   | # pre | # post | 
      
      
        dc:title |               657 | 714 | 
        rdfs:label |             473 | 714 | 
        foaf:name |              394 | 714 | 
        schema:name |            439 | 714 | 
      
      
        schema:isPartOf |        263 | 263 | 
        schema:hasPart |           0 | 263 | 
      
      
        cito:cites |               0 |  33 | 
        cito:citesAsAuthority |   14 |  14 | 
      
    
   
  
    
      The resulting data is published
      in a Triple Pattern Fragments interface.
    
    
      - 
        A TPF server lets clients access RDF data
        only by single triple patterns.
        
          - 
            Full SPARQL queries are executed by clients.
          
 
        
       
      - 
        TPF extends the Linked Data principles.
        
          - Also offer predicate- and object-based lookup.
 
          - Provide “dereferencing” of a URL on a different domain.
 
        
       
      - 
        TPF interfaces are cheap.
        
          - My server costs less than $5/month.
 
        
       
    
   
  
    
      TPF query clients find all results
      and find them faster.
    
    
    
       | # results | time (s) | 
      |   | LD | TPF | LD | TPF | 
    
    
      | people I know |           0 | 196 |  5.6 | 2.1 | 
    
    
      | publications I wrote |    0 | 205 | 10.8 | 4.0 | 
      | my publications |       134 | 205 | 12.6 | 4.1 | 
    
    
      | works I cite |            0 |  33 |  4.0 | 0.5 | 
    
    
      | my interests (federated) | 
                                          0 |   4 |  4.0 | 0.4 | 
    
    
   
  
  
    
      Open questions about
      creating Linked Data:
    
    
      - 
        How do we select what to publish as RDF?
        What kind of data do we prioritize?
       
      - 
        What data belongs in a FOAF profile,
        and what data on a webpage?
       
    
   
  
    
      Open questions about
      modeling Linked Data:
    
    
      - 
        What ontologies should we use?
        
          - …on webpages?
 
          - …in a FOAF profile?
 
        
       
      - 
        Should we describe the same concepts
        using multiple ontologies?
       
      - 
        Should we use generic properties and classes
        or specific subproperties and subclasses?
       
    
   
  
    
      Open questions about
      Linked Data identifiers:
    
    
      - 
        Should we reuse identifiers, mint our own, or both?
        
          - avoid 
owl:sameAs trouble? 
          - ability to dereference on own website?
 
        
       
      - 
        Should we publish data in named RDF graphs?
        
          - provenance?
 
          - conflict resolution?
 
        
       
    
   
  
  
    
      With minimal tooling,
      querying my Linked Data
      became better, faster,
      and more flexible—
      even across datasets.
    
  
  
    
      Your website’s Linked Data
      can become queryable too.
      Just use the pipeline.
    
  
  
    
      So no more excuses ;-)
      Self-publish
 your Linked Data.