Web Fundamentals
Linked Data Publishing

Ruben Verborgh, Ghent Universityimec

Web Fundamentals
Linked Data Publishing

Ruben Verborgh

Ghent University imec IDLab

Creative Commons License Except where otherwise noted, the content of these slides is licensed under a Creative Commons Attribution 4.0 International License.

Where does Linked Data come from,
and where does it go?

[photo of a stork getting Linked Data]
©2010 Marianne de Wit
©2014 M. Schmachtenberg, C. Bizer, A. Jentzsch and R. Cyganiak

Where does Linked Data come from,
and where does it go?

Web Fundamentals
Linked Data Publishing

Web Fundamentals
Linked Data Publishing

Many Linked Data life cycles are proposed.
This simple cycle consists of 5 steps.

Generation is the step in which
we convert non-RDF data to RDF.

Most representations on the Web
are generated by templates.

Templating is not always sufficient
to create 5-star Linked Data.

Linked Data can be generated in batch
through a mapping process.

Mapping can be performed with
ad-hoc scripts for a specific dataset.

R2RML is an RDF vocabulary to describe
a mapping of relational data into RDF.

RML is a generalization of R2RML
toward heterogeneous data sources.

Let us consider an example
of musical performances.

A JSON file contains a list of performances:

{ ... "Performance" :
  { "Perf_ID": "567",
    "Venue": { "Name": "Vooruit",
               "Venue_ID": "78" },
    "Location": { "longitude": "3.725379",
								  "latitude": "51.0477644" } },
    ...
}

The venues could be mapped
using the following RML document.

<#VenuesMapping>
    rml:logicalSource [
        rml:source "http://ex.com/performances.json";
        rml:referenceFormulation ql:JSONPath;
        rml:iterator "$.Performance.[*]"
    ];
    rr:subjectMap [
        rr:template "http://ex.com/venues/{Venue_ID}"
    ].

The venues could be mapped
using the following RML document.

<#VenuesMapping>
    rr:predicateObjectMap [
        rr:predicate geo:long;
        rr:objectMap [ rml:reference "longitude";
                       rr:datatype xsd:float ]
    ], [
        rr:predicate geo:lat;
        rr:objectMap [ rml:reference "latitude";
                       rr:datatype xsd:float ]
    ].

The execution of the mapping
results in an RDF dataset.

...
<http://ex.com/venues/78>
    geo:long 3.725379;
    geo:lat 51.0477644.

<http://ex.com/venues/91>
    geo:long 3.728515;
    geo:lat 51.056008.
...

The mapping can be extended
to include other resources and properties.

After Linked Data has been generated,
we can validate it using semantics.

Validation can be applied
on different data levels.

Databases only allow for
rudimentary constraint validation.

Schemas can validate conformance to
a reusable set of structural constraints.

Ontologies allow for more specific
content-based validation.

In the following example, type checking
identifies an incorrect triple.

The ontology defines the following constraints:

foaf:knows rdfs:domain foaf:Person;
           rdfs:range foaf:Person.
:Mathematics a :Course.
:Course owl:disjointWith foaf:Person.

This triple violates those constraints:

:Albert foaf:knows :Mathematics.

Violations across triples can be identified,
but not always automatically resolved.

The ontology defines the following constraints:

:isBiologicalFatherOf a owl:IrreflexiveProperty;
                        owl:InverseFunctionalProperty.

The triples below are inconsistent:

:Albert  :isBiologicalFatherOf :Albert.
:Albert  :isBiologicalFatherOf :Delphine.
:Jacques :isBiologicalFatherOf :Delphine.

Which ones are correct is not known.

Automated validation tells you
whether data makes sense.

By validating during the mapping process,
we detect quality issues before they occur.

As soon as Linked Data is ready,
it can be published for consumption.

There are roughly 3 ways of
publishing Linked Data on the Web.

A data dump places all dataset triples
in one or more archive files.

A data dump places all dataset triples
in one or more archive files.

A SPARQL endpoint lets clients evaluate
arbitrary (read-only) queries on a server.

A SPARQL endpoint lets clients evaluate
arbitrary (read-only) queries on a server.

Linked Data documents provide
per-topic access to a dataset.

Linked Data documents provide
per-topic access to a dataset.

Once Linked Data is published on the Web,
clients can evaluate queries over it.

Just like on the “human” Web,
querying goes beyond browsing.

The possibilities for query evaluation
depend on how data is made available.

Evaluating queries over a federation
of interfaces introduces new challenges.

Enhancements let client feedback
find its way back to the source.

Data doesn’t stop when published.
It only just begins.

Unfortunately, such feedback loops
are still rare for Linked Data.

Open challenges include:

Provenance allows modeling
the history trail of facts.

Reverse mappings could feed edits
back to the original source.

Web Fundamentals
Linked Data Publishing

The original Semantic Web vision
features intelligent agents.

Schedule bi-weekly appointments
with a licensed physical therapist,
specialized in a particular field,
living nearby home or my workplace.

adapted from The Semantic Web

Do we still need the Semantic Web
with a smartphone in our pockets?

[an iPhone running Siri]

The current generation of agents
only performs preprogrammed acts.

Before Linked Data, the Semantic Web
suffered from a chicken-and-egg problem.

We have all the infrastructure,
so where are all the intelligent agents?

The SemWeb’s answer to live querying
has been “public SPARQL endpoints”.

Would you put your SQL database
publicly on the Web—even just read-only?

Most public SPARQL endpoints
have less than 95% availability.

Linked Data availability on the Web
is a serious, two-sided problem.

If we all host a private endpoint,
it is no longer a Semantic Web.

If you have operational need
for SPARQL-accessible data,
you must have your own infrastructure.

No public endpoints.
Public endpoints are for lookups and discovery;
sort of a dataset demo.

Orri Erling, OpenLink (2014)

21 years of Semantic Web research
has mostly led to intelligent servers.

Web Fundamentals
Linked Data Publishing

Possible Linked Data interfaces exist
in between the two extremes.

Linked Data Fragments is a uniform view
on Linked Data interfaces.

Every Linked Data interface
offers specific fragments
of a Linked Data set.

Each type of Linked Data Fragment
is defined by three characteristics.

Linked Data Fragment

data
What triples does the fragment contain?
metadata
Do we know more about the data/fragment?
controls
How can we access more data?

Each type of Linked Data Fragment
is defined by three characteristics.

data dump

data
all dataset triples
metadata
number of triples, file size
controls
(none)

Each type of Linked Data Fragment
is defined by three characteristics.

SPARQL query result

data
triples matching the query
metadata
(none)
controls
(none)

Each type of Linked Data Fragment
is defined by three characteristics.

Linked Data document

data
triples about a topic
metadata
creator, maintainer, …
controls
links to other Linked Data documents

We designed a new trade-off mix
with low cost and high availability.

A Triple Pattern Fragments interface
is low-cost and enables clients to query.

A Triple Pattern Fragment is designed
to have a good information/cost balance.

Triple Pattern Fragment

data
matches of a triple pattern (paged)
metadata
total number of matches
controls
access to all other Triple Pattern Fragments
of the same dataset

This Triple Pattern Fragment shows
subjects born in London from DBpedia.

Triple Pattern Fragments are lightweight,
because they do not require a triple store.

Triple patterns are not the final answer.
No interface ever will be.

Web Fundamentals
Linked Data Publishing

Triple Pattern Fragment servers
enable clients to be intelligent.

controls
The HTML representation explains:
you can query by triple pattern.

Triple Pattern Fragment servers
enable clients to be intelligent.

controls
The RDF representation explains:
you can query by triple pattern.
<https://fragments.dbpedia.org/2016-04/en#dataset> hydra:search [
  hydra:template "https://fragments.dbpedia.org/2016-04/en{?s,p,o}";
  hydra:mapping
    [ hydra:variable "s"; hydra:property rdf:subject ],
    [ hydra:variable "p"; hydra:property rdf:predicate ],
    [ hydra:variable "o"; hydra:property rdf:object ]
].

Triple Pattern Fragment servers
enable clients to be intelligent.

metadata
The HTML representation explains:
this is the number of matches.

Triple Pattern Fragment servers
enable clients to be intelligent.

metadata
The RDF representation explains:
this is the number of matches.
<#fragment> void:triples 7937.

How can a client evaluate
a SPARQL query over a TPF interface?

Let’s follow the execution
of an example SPARQL query.

Find artists born in cities named Waterloo.

SELECT ?person ?city WHERE {
    ?person rdf:type dbpedia-owl:Artist.
    ?person dbpedia-owl:birthPlace ?city.
    ?city foaf:name "Waterloo"@en.
}

Fragment: https://fragments.dbpedia.org/2016-04/en

The client looks inside of the fragment
to see how it can access the dataset.

<https://fragments.dbpedia.org/2016-04/en#dataset> hydra:search [
  hydra:template "https://fragments.dbpedia.org/2016-04/en{?s,p,o}";
  hydra:mapping
    [ hydra:variable "s"; hydra:property rdf:subject ],
    [ hydra:variable "p"; hydra:property rdf:predicate ],
    [ hydra:variable "o"; hydra:property rdf:object ]
].

You can query the dataset by triple pattern.

The client splits the query
into the available fragments.

  1. ?person rdf:type dbo:Artist.
  2. ?person dbo:birthPlace ?city.
  3. ?city foaf:name "Waterloo"@en.

It gets the first page of all fragments
and inspects their metadata.

  1. ?person rdf:type dbo:Artist. 96,000
    • (first 100 triples)
  2. ?person dbo:birthPlace ?city. 12,000,000
    • (first 100 triples)
  3. ?city foaf:name "Waterloo"@en. 26
    • (first 100 triples)

It starts with the smallest fragment,
because it is most selective.

  1. ?person rdf:type dbo:Artist.
  2. ?person dbo:birthPlace ?city.
  3. ?city foaf:name "Waterloo"@en. 26

This process continues recursively
until all options have been tested.

  1. ?person rdf:type dbo:Artist.
  2. ?person dbo:birthPlace dbr:Waterloo,_Iowa.
  3. ?city foaf:name "Waterloo"@en.
    • dbr:Waterloo,_Iowa foaf:name "Waterloo"@en.
    • dbr:Waterloo,_London foaf:name "Waterloo"@en.
    • dbr:Waterloo,_Ontario foaf:name "Waterloo"@en.

It gets the first page of all fragments
and inspects their metadata.

  1. ?person rdf:type dbo:Artist. 96,000
    • (first 100 triples)
  2. ?person dbo:birthPlace dbr:Waterloo,_Iowa. 45
    • (first 100 triples)

It starts with the smallest fragment,
because it is most selective.

  1. ?person rdf:type dbo:Artist. 96,000
  2. ?person dbo:birthPlace dbr:Waterloo,_Iowa. 45
    • dbr:Allan_Carpenter dbo:birthPlace dbr:Waterloo,_Iowa.
    • dbr:Adam_DeVine dbo:birthPlace dbr:Waterloo,_Iowa.
    • dbr:Bonnie_Koloc dbo:birthPlace dbr:Waterloo,_Iowa.

This process continues recursively
until all options have been tested.

  1. dbr:Allan_Carpenter rdf:type dbo:Artist.
  2. ?person dbo:birthPlace dbr:Waterloo,_Iowa. 26
    • dbr:Allan_Carpenter dbo:birthPlace dbr:Waterloo,_Iowa.
    • dbr:Adam_DeVine dbo:birthPlace dbr:Waterloo,_Iowa.
    • dbr:Bonnie_Koloc dbo:birthPlace dbr:Waterloo,_Iowa.

It gets the first page of the fragment,
which provides mappings for a solution.

  1. dbr:Allan_Carpenter rdf:type dbo:Artist. 1
    • dbr:Allan_Carpenter rdf:type dbo:Artist.

We found a solution mapping.

?person
dbr:Allan_Carpenter
?city
dbr:Waterloo,_Iowa

Some paths will result in empty fragments.
They do not lead to a consistent solution.

  1. dbr:Adam_DeVine rdf:type dbo:Artist. 0

No solution mapping.

At least, according to DBpedia.
It turns out that Adam DeVine is actually an actor.

Executing this query in the browser client
only takes a couple of seconds.

Web Fundamentals
Linked Data Publishing

We evaluated Triple Pattern Fragments
for server cost and availability.

We ran the Berlin SPARQL benchmark
on Amazon EC2 virtual machines.

We evaluated Triple Pattern Fragments
for server cost and availability.

We configured the Amazon machines
to generate large loads in a Web-like setting.

The query throughput is lower,
but resilient to high client numbers.

The server traffic is higher,
but individual requests are lighter.

Caching is significantly more effective,
as clients reuse fragments for queries.

The server requires much less CPU,
allowing higher availability at lower cost.

The server enables clients to be intelligent,
so it can remain simple and lightweight.

These experiments verify the possibility
(and necessity) of new types of solutions.

Web Fundamentals
Linked Data Publishing

The Semantic Web should focus on making
each of the life cycle phases sustainable.