Piecing the puzzle

Self-publishing queryable research data on the Web

Ruben Verborgh

Ghent University – imec

Who thinks publishing
Linked Data on the Web
is a good idea?

Who publishes their
own Linked Data?

Who enables people to
query their Linked Data?

My open-source pipeline
improves the queryability of
Linked Data on websites.

This pipeline unlocks
the value of your data
for client-side applications.

Piecing the puzzle

Use case: personal website
Making our Linked Data queryable
Open questions for self-publishers

My personal website contains metadata
about my research and publications.

This includes metadata for:

40 blog posts
200 scientific publications
2 Linked Research articles
- with RDFa annotations and citations

My data is published following
the Linked Data principles.

1 FOAF profile in Turtle
- partly manual, mostly autogenerated
- >6,000 RDF triples about myself, people, publications
260 HTML+RDFa pages
- autogenerated using 5 template pages
- >13,000 RDF triples about publications, blog posts, articles
- (±50 triples per page)

My data is modeled using several
ontologies and vocabularies.

Friend of a Friend (FOAF)
Schema.org
Bibliographic Ontology (BIBO)
Citation Typing Ontology (CiTO)
DBpedia

I publish my own Linked Data because
we need to practice what we preach.

How can we convince people
they should publish 5-star Linked Data
if we continue making up excuses
for not doing it ourselves?
My data is a home-cooking version
of Semantic Web Dog Food.

I publish my own Linked Data because
others already publish it—wrongly.

I struggle to keep up with the incompleteness, inaccuracies, duplicates, and wrong entries of:

But who am I generating this data for?

Clearly, existing research networks don’t read it.
- …with the exception of Google Scholar.
Schema.org-compatible search engines
are probably very happy with it.
- They crawl and see the whole dataset.
We can’t build useful live applications on top of it.
- Linked Data clients only see small parts of the data.

Piecing the puzzle

Use case: personal website
Making our Linked Data queryable
Open questions for self-publishers

The value of my Linked Data
needs to be unlocked.

I want to:

help people consume my data
lower the barrier for data reusers
enable powerful cross-dataset queries
- no more silos

Traversal-based Linked Data querying
cannot answer all questions adequately.

Completeness cannot be guaranteed.
- Web linking is unidirectional.
The semantic constructs in the query
are seldom identical to those of the data.
- foaf:Person, schema:Person, or wikidata:Q5?
- rdfs:label, dc:title, foaf:name, or schema:name?

Solving querying fully at the server side
is too expensive for personal data.

Hosting a SPARQL endpoint is expensive.
A SPARQL endpoint with reasoning even more.
Marking up all data in two directions
and with multiple ontologies is unfeasible.
- difficult to maintain
- hard to express in RDFa Lite

I designed a simple ETL pipeline
to enrich and publish my website’s data.

This process runs every night:

Extract RDF triples from
Turtle and HTML+RDFa documents.
Reason over this data and its ontologies.
Publish the result in a queryable interface.

Reasoning on the data and its ontologies
makes hidden semantics explicit.

Skolemize ontologies to remove blank nodes.
Compute deductive closure of ontologies.
Compute deductive closure of ontologies and data.
Subtract 2 from 3 to obtain only the enriched data.
Remove leftover skolemized IRIs.

Reasoning expresses the same data
in different ways for different clients.

	time (s)	# triples
extraction	170	17,000
skolemization ontologies	1	44,000
closure ontologies	39	145,000
closure ontologies & data	62	183,000
subtraction	1	39,000
removal	1	36,000
total	273	36,000

Reasoning fills ontological gaps
before querying happens.

	# pre	# post
`dc:title`	657	714
`rdfs:label`	473	714
`foaf:name`	394	714
`schema:name`	439	714
`schema:isPartOf`	263	263
`schema:hasPart`	0	263
`cito:cites`	0	33
`cito:citesAsAuthority`	14	14

The resulting data is published
in a Triple Pattern Fragments interface.

A TPF server lets clients access RDF data
only by single triple patterns.
- Full SPARQL queries are executed by clients.
TPF extends the Linked Data principles.
1. Also offer predicate- and object-based lookup.
2. Provide “dereferencing” of a URL on a different domain.
TPF interfaces are cheap.
- My server costs less than $5/month.

TPF query clients find all results
and find them faster.

	LD	TPF	LD	TPF
	# results		time (s)
people I know	0	196	5.6	2.1
publications I wrote	0	205	10.8	4.0
my publications	134	205	12.6	4.1
works I cite	0	33	4.0	0.5
my interests (federated)	0	4	4.0	0.4

Piecing the puzzle

Use case: personal website
Making our Linked Data queryable
Open questions for self-publishers

Open questions about
creating Linked Data:

How do we select what to publish as RDF?
What kind of data do we prioritize?
What data belongs in a FOAF profile,
and what data on a webpage?

Open questions about
modeling Linked Data:

What ontologies should we use?
- …on webpages?
- …in a FOAF profile?
Should we describe the same concepts
using multiple ontologies?
Should we use generic properties and classes
or specific subproperties and subclasses?

Open questions about
Linked Data identifiers:

Should we reuse identifiers, mint our own, or both?
- avoid owl:sameAs trouble?
- ability to dereference on own website?
Should we publish data in named RDF graphs?
- provenance?
- conflict resolution?

Piecing the puzzle

Use case: personal website
Making our Linked Data queryable
Open questions for self-publishers

With minimal tooling,
querying my Linked Data
became better, faster,
and more flexible—
even across datasets.

Your website’s Linked Data
can become queryable too.
Just use the pipeline.

So no more excuses ;-)
Self-publish
your Linked Data.

Piecing the puzzle

Self-publishing queryable research data on the Web

@RubenVerborgh, Ghent University – imec

Browse my Linked Data at data.verborgh.org.
Query my Linked Data at query.verborgh.org.

Piecing the puzzle

Self-publishing queryable research data on the Web

Who thinks publishing Linked Data on the Web is a good idea?

Who publishes their own Linked Data?

Who enables people to query their Linked Data?

My open-source pipeline improves the queryability of Linked Data on websites.

This pipeline unlocks the value of your data for client-side applications.

Piecing the puzzle

Piecing the puzzle

My personal website contains metadata about my research and publications.

My data is published following the Linked Data principles.

My data is modeled using several ontologies and vocabularies.

I publish my own Linked Data because we need to practice what we preach.

I publish my own Linked Data because others already publish it—wrongly.

But who am I generating this data for?

Piecing the puzzle

The value of my Linked Data needs to be unlocked.

Traversal-based Linked Data querying cannot answer all questions adequately.

Solving querying fully at the server side is too expensive for personal data.

I designed a simple ETL pipeline to enrich and publish my website’s data.

Reasoning on the data and its ontologies makes hidden semantics explicit.

Reasoning expresses the same data in different ways for different clients.

Reasoning fills ontological gaps before querying happens.

The resulting data is published in a Triple Pattern Fragments interface.

TPF query clients find all results and find them faster.

Piecing the puzzle

Open questions about creating Linked Data:

Open questions about modeling Linked Data:

Open questions about Linked Data identifiers:

Piecing the puzzle

With minimal tooling, querying my Linked Data became better, faster, and more flexible— even across datasets.

Your website’s Linked Data can become queryable too. Just use the pipeline.

So no more excuses ;-) Self-publish your Linked Data.

Piecing the puzzle

Self-publishing queryable research data on the Web

Who thinks publishing
Linked Data on the Web
is a good idea?

Who publishes their
own Linked Data?

Who enables people to
query their Linked Data?

My open-source pipeline
improves the queryability of
Linked Data on websites.

This pipeline unlocks
the value of your data
for client-side applications.

My personal website contains metadata
about my research and publications.

My data is published following
the Linked Data principles.

My data is modeled using several
ontologies and vocabularies.

I publish my own Linked Data because
we need to practice what we preach.

I publish my own Linked Data because
others already publish it—wrongly.

The value of my Linked Data
needs to be unlocked.

Traversal-based Linked Data querying
cannot answer all questions adequately.

Solving querying fully at the server side
is too expensive for personal data.

I designed a simple ETL pipeline
to enrich and publish my website’s data.

Reasoning on the data and its ontologies
makes hidden semantics explicit.

Reasoning expresses the same data
in different ways for different clients.

Reasoning fills ontological gaps
before querying happens.

The resulting data is published
in a Triple Pattern Fragments interface.

TPF query clients find all results
and find them faster.

Open questions about
creating Linked Data:

Open questions about
modeling Linked Data:

Open questions about
Linked Data identifiers:

With minimal tooling,
querying my Linked Data
became better, faster,
and more flexible—
even across datasets.

Your website’s Linked Data
can become queryable too.
Just use the pipeline.

So no more excuses ;-)
Self-publish
your Linked Data.