Piecing the puzzle
Self-publishing queryable research data on the Web
Ruben Verborgh
Ghent University – imec
Who thinks publishing
Linked Data on the Web
is a good idea?
Who publishes their
own Linked Data?
Who enables people to
query their Linked Data?
My open-source pipeline
improves the queryability of
Linked Data on websites.
This pipeline unlocks
the value of your data
for client-side applications.
My personal website contains metadata
about my research and publications.
This includes metadata for:
My data is published following
the Linked Data principles.
-
1 FOAF profile in Turtle
- partly manual, mostly autogenerated
- >6,000 RDF triples about myself, people, publications
-
260 HTML+RDFa pages
- autogenerated using 5 template pages
- >13,000 RDF triples about publications, blog posts, articles
- (±50 triples per page)
My data is modeled using several
ontologies and vocabularies.
- Friend of a Friend (FOAF)
- Schema.org
- Bibliographic Ontology (BIBO)
- Citation Typing Ontology (CiTO)
- DBpedia
I publish my own Linked Data because
we need to practice what we preach.
-
How can we convince people
they should publish 5-star Linked Data
if we continue making up excuses
for not doing it ourselves?
-
My data is a home-cooking version
of Semantic Web Dog Food.
I publish my own Linked Data because
others already publish it—wrongly.
I struggle to keep up with the incompleteness,
inaccuracies, duplicates, and wrong entries of:
But who am I generating this data for?
-
Clearly, existing research networks don’t read it.
- …with the exception of Google Scholar.
-
Schema.org-compatible search engines
are probably very happy with it.
- They crawl and see the whole dataset.
-
We can’t build useful live applications on top of it.
- Linked Data clients only see small parts of the data.
The value of my Linked Data
needs to be unlocked.
I want to:
- help people consume my data
- lower the barrier for data reusers
-
enable powerful cross-dataset queries
Traversal-based Linked Data querying
cannot answer all questions adequately.
-
Completeness cannot be guaranteed.
- Web linking is unidirectional.
-
The semantic constructs in the query
are seldom identical to those of the data.
-
foaf:Person
,
schema:Person
,
or
wikidata:Q5
?
-
rdfs:label
,
dc:title
,
foaf:name
,
or
schema:name
?
Solving querying fully at the server side
is too expensive for personal data.
- Hosting a SPARQL endpoint is expensive.
- A SPARQL endpoint with reasoning even more.
-
Marking up all data in two directions
and with multiple ontologies is unfeasible.
- difficult to maintain
- hard to express in RDFa Lite
I designed a simple ETL pipeline
to enrich and publish my website’s data.
This process runs every night:
- Extract RDF triples from
Turtle and HTML+RDFa documents.
- Reason over this data and its ontologies.
- Publish the result in a queryable interface.
Reasoning on the data and its ontologies
makes hidden semantics explicit.
- Skolemize ontologies to remove blank nodes.
- Compute deductive closure of ontologies.
- Compute deductive closure of ontologies and data.
- Subtract 2 from 3 to obtain only the enriched data.
- Remove leftover skolemized IRIs.
Reasoning expresses the same data
in different ways for different clients.
| time (s) | # triples |
extraction | 170 | 17,000 |
skolemization ontologies | 1 | 44,000 |
closure ontologies | 39 | 145,000 |
closure ontologies & data | 62 | 183,000 |
subtraction | 1 | 39,000 |
removal | 1 | 36,000 |
total | 273 | 36,000 |
Reasoning fills ontological gaps
before querying happens.
| # pre | # post |
dc:title | 657 | 714 |
rdfs:label | 473 | 714 |
foaf:name | 394 | 714 |
schema:name | 439 | 714 |
schema:isPartOf | 263 | 263 |
schema:hasPart | 0 | 263 |
cito:cites | 0 | 33 |
cito:citesAsAuthority | 14 | 14 |
The resulting data is published
in a Triple Pattern Fragments interface.
-
A TPF server lets clients access RDF data
only by single triple patterns.
-
Full SPARQL queries are executed by clients.
-
TPF extends the Linked Data principles.
- Also offer predicate- and object-based lookup.
- Provide “dereferencing” of a URL on a different domain.
-
TPF interfaces are cheap.
- My server costs less than $5/month.
TPF query clients find all results
and find them faster.
| # results | time (s) |
| LD | TPF | LD | TPF |
people I know | 0 | 196 | 5.6 | 2.1 |
publications I wrote | 0 | 205 | 10.8 | 4.0 |
my publications | 134 | 205 | 12.6 | 4.1 |
works I cite | 0 | 33 | 4.0 | 0.5 |
my interests (federated) |
0 | 4 | 4.0 | 0.4 |
Open questions about
creating Linked Data:
-
How do we select what to publish as RDF?
What kind of data do we prioritize?
-
What data belongs in a FOAF profile,
and what data on a webpage?
Open questions about
modeling Linked Data:
-
What ontologies should we use?
- …on webpages?
- …in a FOAF profile?
-
Should we describe the same concepts
using multiple ontologies?
-
Should we use generic properties and classes
or specific subproperties and subclasses?
Open questions about
Linked Data identifiers:
-
Should we reuse identifiers, mint our own, or both?
- avoid
owl:sameAs
trouble?
- ability to dereference on own website?
-
Should we publish data in named RDF graphs?
- provenance?
- conflict resolution?
With minimal tooling,
querying my Linked Data
became better, faster,
and more flexible—
even across datasets.
Your website’s Linked Data
can become queryable too.
Just use the pipeline.
So no more excuses ;-)
Self-publish
your Linked Data.