Linked Data Publishing

Where does Linked Data come from,
and where does it go?

Many applications capture data directly
in the format they will use afterwards.
In contrast, most Linked Data
is not linked at the start.
Most Linked Data is actually
first captured in a different way.

Web Fundamentals
Linked Data Publishing

Linked Data Life Cycle
Semantic Web Challenges
Linked Data Fragments

Many Linked Data life cycles are proposed.
This simple cycle consists of 5 steps.

Generation is the step in which
we convert non-RDF data to RDF.

Most representations on the Web
are generated by templates.

Data resides in a back-end database.
The front-end Web application translates database entries via templates into representations.
- HTML
- JSON
- …
The data model from the database
easily maps to the target representation.

Templating is not always sufficient
to create 5-star Linked Data.

You can upgrade a JSON document to JSON-LD
by providing a relevant JSON-LD context.
- A context identifies properties and structural elements,
  but does not create links between resources.
Remodelling data to RDF on the fly is tricky.
Linked Data transcends your application.
- Link to external concepts.
- Reuse external identifiers.

Linked Data can be generated in batch
through a mapping process.

A mapping processor takes input sources
and a mapping file as input.
The mapping file explains how input data
should be converted to the RDF model.
- selectors to locate data
- rules to transform data
Different processors have different features.
- source type support
- interlinking together with mapping
- …

Mapping can be performed with
ad-hoc scripts for a specific dataset.

You write custom code to handle your case.
The short-term costs might be low,
but long-term maintenance can prove difficult.
- Changes require a developer.
- The mapping is not reusable across datasets.
Different but related data sources result in
duplicated effort and incompatibilities.

R2RML is an RDF vocabulary to describe
a mapping of relational data into RDF.

A triples map has access to tables (and queries)
through logical tables.
These logical tables are mapped with term maps.
- Subject maps generate resource identifiers.
- Predicate-object maps generate the rest of the triple.
Term maps generate individual RDF terms.

RML is a generalization of R2RML
toward heterogeneous data sources.

RML abstracts different types of logical sources.
- A source is modelled as an iterator of items.
- Sources can be
  - databases
  - CSV
  - XML
  - JSON
  - …
- The mechanism is extensible through new vocabularies.
RML term maps use logical sources to create terms.
- The contents of iterator items are accessible
  through data-source-specific reference formulations.
RML can map and interlink heterogeneous sources.

Let us consider an example
of musical performances.

A JSON file contains a list of performances:

{ ... "Performance" :
  { "Perf_ID": "567",
    "Venue": { "Name": "Vooruit",
               "Venue_ID": "78" },
    "Location": { "longitude": "3.725379",
								  "latitude": "51.0477644" } },
    ...
}

The venues could be mapped
using the following RML document.

<#VenuesMapping>
    rml:logicalSource [
        rml:source "http://ex.com/performances.json";
        rml:referenceFormulation ql:JSONPath;
        rml:iterator "$.Performance.[*]"
    ];
    rr:subjectMap [
        rr:template "http://ex.com/venues/{Venue_ID}"
    ].

The venues could be mapped
using the following RML document.

<#VenuesMapping>
    rr:predicateObjectMap [
        rr:predicate geo:long;
        rr:objectMap [ rml:reference "longitude";
                       rr:datatype xsd:float ]
    ], [
        rr:predicate geo:lat;
        rr:objectMap [ rml:reference "latitude";
                       rr:datatype xsd:float ]
    ].

The execution of the mapping
results in an RDF dataset.

...
<http://ex.com/venues/78>
    geo:long 3.725379;
    geo:lat 51.0477644.

<http://ex.com/venues/91>
    geo:long 3.728515;
    geo:lat 51.056008.
...

The mapping can be extended
to include other resources and properties.

RML mappings can incorporate data
from other sources, even in different formats.
- Parts of mapping definitions can be reused.
RML mappings can look up data from Web APIs.
Interlinking can happen at mapping time,
rather than as a separate step after mapping.
- Reuse identifiers as much as possible.
- Link data as early as possible.

After Linked Data has been generated,
we can validate it using semantics.

Validation can be applied
on different data levels.

validation on the individual field level
- spelling mistakes
- field overloading
- simple datatypes
- …
validation on the structural level
- integrity
- required and prohibited structures
- …
validation on the semantic level
- domain and range of values
- inconsistencies
- …

Databases only allow for
rudimentary constraint validation.

They perform elementary value-type checks.
- field Temperature has type INTEGER
They can ensure referential integrity.
- Does a referenced record exist?

Schemas can validate conformance to
a reusable set of structural constraints.

The Shapes Constraint Language (SHACL) allows for
the specification of validation rules in RDF.
- Use nested shapes to describe the desired structure.
Shape Expressions (ShEx) serve similar goals,
but have a higher expressivity.
Both languages soften the open-world assumption
by providing expectations for a specific context.
- This is useful for apps with specific assumptions.

Ontologies allow for more specific
content-based validation.

More specific types can be defined,
and their definitions can be reused.
- temperature
- days per month
- …
Type checking can take more factors into account.
- domain and range
- incompatible types
- …
Additional reasoning can further identify problems.

In the following example, type checking
identifies an incorrect triple.

The ontology defines the following constraints:

foaf:knows rdfs:domain foaf:Person;
           rdfs:range foaf:Person.
:Mathematics a :Course.
:Course owl:disjointWith foaf:Person.

This triple violates those constraints:

:Albert foaf:knows :Mathematics.

Violations across triples can be identified,
but not always automatically resolved.

The ontology defines the following constraints:

:isBiologicalFatherOf a owl:IrreflexiveProperty;
                        owl:InverseFunctionalProperty.

The triples below are inconsistent:

:Albert  :isBiologicalFatherOf :Albert.
:Albert  :isBiologicalFatherOf :Delphine.
:Jacques :isBiologicalFatherOf :Delphine.

Which ones are correct is not known.

Automated validation tells you
whether data makes sense.

RDFUnit assesses the quality of a dataset
by running automated tests on it.
Ontologies used in a dataset can be looked up
by dereferencing its concepts.
The constraints in the ontology are transformed
into SPARQL queries and then evaluated.
- This results in warnings and/or errors.

By validating during the mapping process,
we detect quality issues before they occur.

Batch mapping processes
can generate millions of triples.
- If we find a quality issue only afterwards,
  we have to restart the entire mapping.
Checking quality during mapping allows
pinpointing the cause of errors and fixing them.
- Mapping rules that fail need not be not executed first.

As soon as Linked Data is ready,
it can be published for consumption.

There are roughly 3 ways of
publishing Linked Data on the Web.

data dump
- provide an archive file with the entire dataset
SPARQL endpoint
- expose a triple store’s query interface
Linked Data documents
- browse triples per resource / topic

A data dump places all dataset triples
in one or more archive files.

Dumps need to be downloaded entirely
before they can be queried.
- Dump files can be several gigabytes.
They offer the client full flexibility
to choose how data is processed.
Keeping data up-to date requires effort.
- redownload the entire dump
- download and apply incremental patches

A data dump places all dataset triples
in one or more archive files.

A SPARQL endpoint lets clients evaluate
arbitrary (read-only) queries on a server.

This gives clients direct access to
(only) the data they are interested in.
- Only very little bandwidth is required.
Data is always up-to-date.
The per-request cost for SPARQL endpoints
is much higher than for other HTTP servers.
- Few servers allow arbitrarily complicated queries.

A SPARQL endpoint lets clients evaluate
arbitrary (read-only) queries on a server.

Linked Data documents provide
per-topic access to a dataset.

They follow the Linked Data principles.
- The information structure resembles typical webpages.
Browsing up-to-date datasets is straightforward.
Query evaluation is possible through
link-traversal-based querying.
- The evaluation of SPARQL queries is rather slow.
- Completeness cannot always be guaranteed.

Linked Data documents provide
per-topic access to a dataset.

Once Linked Data is published on the Web,
clients can evaluate queries over it.

Just like on the “human” Web,
querying goes beyond browsing.

Where can we find the data we need?
How can we access that data?
How do we combine it with other data?

The possibilities for query evaluation
depend on how data is made available.

Is the data available in RDF?
- Then we should discover the ontologies used.
Is the data linked to other data?
- Then we can / might need to involve other datasets.
In what interfaces is the data available?
- The client might need to evaluate (a part of) the query.

Evaluating queries over a federation
of interfaces introduces new challenges.

Which interface has the necessary data?
- If there are multiple, which one is the best?
How will the query evaluation be coordinated?
- Does one SPARQL endpoint talk to others?
- Does the client talk to all SPARQL endpoints?
In what order are subqueries executed?
- We should minimize high numbers of intermediary results.

Enhancements let client feedback
find its way back to the source.

Data doesn’t stop when published.
It only just begins.

When users query data sources,
they might spot mistakes or missing data.
Can users correct data?
Can they create new data?
Especially open data should be open to corrections.
- Feedback is a core added value of open.

Unfortunately, such feedback loops
are still rare for Linked Data.

Open challenges include:

How can end users edit triples?
How can edits be reviewed?
How do we keep track of history?

Provenance allows modeling
the history trail of facts.

Provenance captures entities, activities, and people
involved in producing a resource.
Provenance can assess quality, reliability, or trust.
- Tim Berners-Lee sketched an Oh yeah? button
  you click to gain trust in information on the Web.
Many provenance challenges are still open.
- How to generate provenance (during mapping)?
- How to store n provenance triples per data triple?

Reverse mappings could feed edits
back to the original source.

If the RDF triples are generated from raw data,
edits to the triples should be ported back.
- If not, they are overwritten by the next mapping.
If the mapping file declaratively specifies
how a source maps to triples,
we might be able to reverse it automatically.
- Another reason to not prefer custom scripts.

Web Fundamentals
Linked Data Publishing

Linked Data Life Cycle
Semantic Web Challenges
Linked Data Fragments

The original Semantic Web vision
features intelligent agents.

Schedule bi-weekly appointments
with a licensed physical therapist,
specialized in a particular field,
living nearby home or my workplace.
adapted from The Semantic Web

Do we still need the Semantic Web
with a smartphone in our pockets?

On the surface, the current generation of smart devices delivers much of what the Semantic Web promised.
- Just realize the intelligence
  is not on your smartphone.

The current generation of agents
only performs preprogrammed acts.

The Semantic Web’s goal is to allow this with unknown services and data.
- The service inclusion process for current digital agents
  is non-transparent and non-democratic, unlike the Web.
- Even if it were democratic, it wouldn’t scale,
  since the integration is hardcoded.
Machines should discover services and data
and use them without any prior knowledge.
- Semantic service descriptions did not really catch on.

Before Linked Data, the Semantic Web
suffered from a chicken-and-egg problem.

Applications and data were waiting for each other.
- Have you used the Semantic Web yet?
Now, at least the data is there on a large scale.
- But that is not sufficient to call the Semantic Web a success.
What more do we need to kickstart things?

We have all the infrastructure,
so where are all the intelligent agents?

We published billions of triples,
spanning many domains.
Many answers we seek are out there.
But how do we find them?
How do we access them?

The SemWeb’s answer to live querying
has been “public SPARQL endpoints”.

A SPARQL endpoint allows clients
to send any SPARQL query for evaluation.
- /sparql?query={query}
What could possibly go wrong?
- What if clients send expensive queries?
- What if many clients send medium queries?
- How will you mirror the server’s data?

Would you put your SQL database
publicly on the Web—even just read-only?

Normally, Web interfaces purposely
limit access to the underlying database.
RDF removes the model barrier, and thus
the need for clients to know the data schema.
- The interface is not necessary as an abstraction.
Yet RDF does not remove the complexity barrier.
- The interface might still be necessary as a limiter.

Most public SPARQL endpoints
have less than 95% availability.

The average SPARQL endpoint
is down for 1.5 days each month.
- Web servers usually express availability in number of nines.
Building a reliable application on top of
a publicly queryable Linked Data source
is thus currently not realistic.
- Things only get worse if you need multiple sources.

Linked Data availability on the Web
is a serious, two-sided problem.

Public SPARQL endpoints that exist, lack uptime.
- High uptime would be possible,
  but comes with a high server cost.
For most datasets, no public endpoint exists.
- Publishers provide data dumps instead,
  but these cannot be queried live.

If we all host a private endpoint,
it is no longer a Semantic Web.

If you have operational need
for SPARQL-accessible data,
you must have your own infrastructure.

No public endpoints.
Public endpoints are for lookups and discovery;
sort of a dataset demo.
Orri Erling, OpenLink (2014)

21 years of Semantic Web research
has mostly led to intelligent servers.

Without a cost model behind it,
server intelligence is not scalable.
- Not every Web resource can be created just for you.
Instead of trying to be intelligent ourselves,
we should enable clients to be intelligent.
- What is a good basic set of conditions to guarantee
  realistic availability of Linked Data on the Web?

Web Fundamentals
Linked Data Publishing

Linked Data Life Cycle
Semantic Web Challenges
Linked Data Fragments

Possible Linked Data interfaces exist
in between the two extremes.

Linked Data Fragments is a uniform view
on Linked Data interfaces.

Every Linked Data interface
offers specific fragments
of a Linked Data set.

Each type of Linked Data Fragment
is defined by three characteristics.

Linked Data Fragment

data: What triples does the fragment contain?
metadata: Do we know more about the data/fragment?
controls: How can we access more data?

Each type of Linked Data Fragment
is defined by three characteristics.

data dump

data: all dataset triples
metadata: number of triples, file size
controls: (none)

Each type of Linked Data Fragment
is defined by three characteristics.

SPARQL query result

data: triples matching the query
metadata: (none)
controls: (none)

Each type of Linked Data Fragment
is defined by three characteristics.

Linked Data document

data: triples about a topic
metadata: creator, maintainer, …
controls: links to other Linked Data documents

We designed a new trade-off mix
with low cost and high availability.

A Triple Pattern Fragments interface
is low-cost and enables clients to query.

A Triple Pattern Fragment is designed
to have a good information/cost balance.

Triple Pattern Fragment

data: matches of a triple pattern (paged)
metadata: total number of matches
controls: access to all other Triple Pattern Fragments
of the same dataset

This Triple Pattern Fragment shows
subjects born in London from DBpedia.

Triple Pattern Fragments are lightweight,
because they do not require a triple store.

The interface can be realized with many back-ends.
- A SPARQL endpoint could serve as back-end.
Since queries are relatively simple,
a less expensive data infrastructure is sufficient.
The Header–Dictionary–Triples (HDT) format
stores triples in a compressed file.
- Especially triple-pattern lookups (and counts) are fast.

Triple patterns are not the final answer.
No interface ever will be.

There’s no silver bullet.
Publication and querying always involves trade-offs.
Triple Pattern Fragments aim to test how far
we can get with simple servers and smart clients.
To verify this, we need to execute the same queries
on different systems and measure the impact.

Web Fundamentals
Linked Data Publishing

Linked Data Life Cycle
Semantic Web Challenges
Linked Data Fragments

Triple Pattern Fragment servers
enable clients to be intelligent.

controls: The HTML representation explains:
you can query by triple pattern.

Triple Pattern Fragment servers
enable clients to be intelligent.

controls: The RDF representation explains:
you can query by triple pattern.

<https://fragments.dbpedia.org/2016-04/en#dataset> hydra:search [
  hydra:template "https://fragments.dbpedia.org/2016-04/en{?s,p,o}";
  hydra:mapping
    [ hydra:variable "s"; hydra:property rdf:subject ],
    [ hydra:variable "p"; hydra:property rdf:predicate ],
    [ hydra:variable "o"; hydra:property rdf:object ]
].

Triple Pattern Fragment servers
enable clients to be intelligent.

metadata: The HTML representation explains:
this is the number of matches.

Triple Pattern Fragment servers
enable clients to be intelligent.

metadata: The RDF representation explains:
this is the number of matches.

<#fragment> void:triples 7937.

How can a client evaluate
a SPARQL query over a TPF interface?

Give the client a SPARQL query,
and the URL of any TPF of the dataset.
It uses the controls inside of the fragment
to determine how to access the dataset.
It reads the metadata to decide
how to plan the query.

Let’s follow the execution
of an example SPARQL query.

Find artists born in cities named Waterloo.

SELECT ?person ?city WHERE {
    ?person rdf:type dbpedia-owl:Artist.
    ?person dbpedia-owl:birthPlace ?city.
    ?city foaf:name "Waterloo"@en.
}

Fragment: https://fragments.dbpedia.org/2016-04/en

The client looks inside of the fragment
to see how it can access the dataset.

<https://fragments.dbpedia.org/2016-04/en#dataset> hydra:search [
  hydra:template "https://fragments.dbpedia.org/2016-04/en{?s,p,o}";
  hydra:mapping
    [ hydra:variable "s"; hydra:property rdf:subject ],
    [ hydra:variable "p"; hydra:property rdf:predicate ],
    [ hydra:variable "o"; hydra:property rdf:object ]
].

You can query the dataset by triple pattern.

The client splits the query
into the available fragments.

?person rdf:type dbo:Artist.
?person dbo:birthPlace ?city.
?city foaf:name "Waterloo"@en.

It gets the first page of all fragments
and inspects their metadata.

?person rdf:type dbo:Artist. 96,000
- (first 100 triples)
?person dbo:birthPlace ?city. 12,000,000
- (first 100 triples)
?city foaf:name "Waterloo"@en. 26
- (first 100 triples)

It starts with the smallest fragment,
because it is most selective.

?person rdf:type dbo:Artist.
?person dbo:birthPlace ?city.
?city foaf:name "Waterloo"@en. 26
- dbr:Waterloo,_Iowa foaf:name "Waterloo"@en.
- dbr:Waterloo,_London foaf:name "Waterloo"@en.
- dbr:Waterloo,_Ontario foaf:name "Waterloo"@en.
- …

This process continues recursively
until all options have been tested.

?person rdf:type dbo:Artist.
?person dbo:birthPlace dbr:Waterloo,_Iowa.
~~?city foaf:name "Waterloo"@en.~~
- dbr:Waterloo,_Iowa foaf:name "Waterloo"@en.
- dbr:Waterloo,_London foaf:name "Waterloo"@en.
- dbr:Waterloo,_Ontario foaf:name "Waterloo"@en.
- …

It gets the first page of all fragments
and inspects their metadata.

?person rdf:type dbo:Artist. 96,000
- (first 100 triples)
?person dbo:birthPlace dbr:Waterloo,_Iowa. 45
- (first 100 triples)

It starts with the smallest fragment,
because it is most selective.

?person rdf:type dbo:Artist. 96,000
?person dbo:birthPlace dbr:Waterloo,_Iowa. 45
- dbr:Allan_Carpenter dbo:birthPlace dbr:Waterloo,_Iowa.
- dbr:Adam_DeVine dbo:birthPlace dbr:Waterloo,_Iowa.
- dbr:Bonnie_Koloc dbo:birthPlace dbr:Waterloo,_Iowa.
- …

This process continues recursively
until all options have been tested.

dbr:Allan_Carpenter rdf:type dbo:Artist.
~~?person dbo:birthPlace dbr:Waterloo,_Iowa.~~ 26
- dbr:Allan_Carpenter dbo:birthPlace dbr:Waterloo,_Iowa.
- dbr:Adam_DeVine dbo:birthPlace dbr:Waterloo,_Iowa.
- dbr:Bonnie_Koloc dbo:birthPlace dbr:Waterloo,_Iowa.
- …

It gets the first page of the fragment,
which provides mappings for a solution.

dbr:Allan_Carpenter rdf:type dbo:Artist. 1
- dbr:Allan_Carpenter rdf:type dbo:Artist.

We found a solution mapping.

?person: dbr:Allan_Carpenter
?city: dbr:Waterloo,_Iowa

Some paths will result in empty fragments.
They do not lead to a consistent solution.

dbr:Adam_DeVine rdf:type dbo:Artist. 0

No solution mapping.

At least, according to DBpedia.
It turns out that Adam DeVine is actually an actor.

Executing this query in the browser client
only takes a couple of seconds.

Web Fundamentals
Linked Data Publishing

Linked Data Life Cycle
Semantic Web Challenges
Linked Data Fragments

We evaluated Triple Pattern Fragments
for server cost and availability.

We ran the Berlin SPARQL benchmark
on Amazon EC2 virtual machines.

100 million triples
high query diversity
BGP, UNION, FILTER, …

We evaluated Triple Pattern Fragments
for server cost and availability.

We configured the Amazon machines
to generate large loads in a Web-like setting.

1 server (4 cores)
1 cache
1–244 simultaneous clients (1 core each)

The query throughput is lower,
but resilient to high client numbers.

The server traffic is higher,
but individual requests are lighter.

Caching is significantly more effective,
as clients reuse fragments for queries.

The server requires much less CPU,
allowing higher availability at lower cost.

The server enables clients to be intelligent,
so it can remain simple and lightweight.

These experiments verify the possibility
(and necessity) of new types of solutions.

Processing everything on the server is costly.
Processing everything on the client isn’t Web.
Solutions that divide the workload
can offer new perspectives,
if we accept the trade-offs they bring.
Is it realistic to make all queries on the Web fast?
- Maybe we should focus on obtaining first results soon.

Web Fundamentals
Linked Data Publishing

Linked Data Life Cycle
Semantic Web Challenges
Linked Data Fragments

The Semantic Web should focus on making
each of the life cycle phases sustainable.

Continuously improve Linked Data
through efficient iterations of the cycle.
If we want to see intelligent agents,
we must stop building intelligent servers.
Another perspective on sustainability:
how can I enable others to act on my data
now and in the future?

Where does Linked Data come from, and where does it go?

Where does Linked Data come from, and where does it go?

Many Linked Data life cycles are proposed. This simple cycle consists of 5 steps.

Generation is the step in which we convert non-RDF data to RDF.

Most representations on the Web are generated by templates.

Templating is not always sufficient to create 5-star Linked Data.

Linked Data can be generated in batch through a mapping process.

Mapping can be performed with ad-hoc scripts for a specific dataset.

R2RML is an RDF vocabulary to describe a mapping of relational data into RDF.

RML is a generalization of R2RML toward heterogeneous data sources.

Let us consider an example of musical performances.

The venues could be mapped using the following RML document.

The venues could be mapped using the following RML document.

The execution of the mapping results in an RDF dataset.

The mapping can be extended to include other resources and properties.

After Linked Data has been generated, we can validate it using semantics.

Validation can be applied on different data levels.

Databases only allow for rudimentary constraint validation.

Schemas can validate conformance to a reusable set of structural constraints.

Ontologies allow for more specific content-based validation.

In the following example, type checking identifies an incorrect triple.

Violations across triples can be identified, but not always automatically resolved.

Automated validation tells you whether data makes sense.

By validating during the mapping process, we detect quality issues before they occur.

As soon as Linked Data is ready, it can be published for consumption.

There are roughly 3 ways of publishing Linked Data on the Web.

A data dump places all dataset triples in one or more archive files.

A data dump places all dataset triples in one or more archive files.

A SPARQL endpoint lets clients evaluate arbitrary (read-only) queries on a server.

A SPARQL endpoint lets clients evaluate arbitrary (read-only) queries on a server.

Linked Data documents provide per-topic access to a dataset.

Linked Data documents provide per-topic access to a dataset.

Once Linked Data is published on the Web, clients can evaluate queries over it.

Just like on the “human” Web, querying goes beyond browsing.

The possibilities for query evaluation depend on how data is made available.

Evaluating queries over a federation of interfaces introduces new challenges.

Enhancements let client feedback find its way back to the source.

Data doesn’t stop when published. It only just begins.

Unfortunately, such feedback loops are still rare for Linked Data.

Provenance allows modeling the history trail of facts.

Reverse mappings could feed edits back to the original source.

The original Semantic Web vision features intelligent agents.

Do we still need the Semantic Web with a smartphone in our pockets?

The current generation of agents only performs preprogrammed acts.

Before Linked Data, the Semantic Web suffered from a chicken-and-egg problem.

We have all the infrastructure, so where are all the intelligent agents?

The SemWeb’s answer to live querying has been “public SPARQL endpoints”.

Would you put your SQL database publicly on the Web—even just read-only?

Most public SPARQL endpoints have less than 95% availability.

Linked Data availability on the Web is a serious, two-sided problem.

If we all host a private endpoint, it is no longer a Semantic Web.

21 years of Semantic Web research has mostly led to intelligent servers.

Possible Linked Data interfaces exist in between the two extremes.

Linked Data Fragments is a uniform view on Linked Data interfaces.

Each type of Linked Data Fragment is defined by three characteristics.

Linked Data Fragment

Each type of Linked Data Fragment is defined by three characteristics.

data dump

Each type of Linked Data Fragment is defined by three characteristics.

SPARQL query result

Each type of Linked Data Fragment is defined by three characteristics.

Linked Data document

We designed a new trade-off mix with low cost and high availability.

A Triple Pattern Fragments interface is low-cost and enables clients to query.

A Triple Pattern Fragment is designed to have a good information/cost balance.

Triple Pattern Fragment

This Triple Pattern Fragment shows subjects born in London from DBpedia.

Triple Pattern Fragments are lightweight, because they do not require a triple store.

Triple patterns are not the final answer. No interface ever will be.

Triple Pattern Fragment servers enable clients to be intelligent.

Triple Pattern Fragment servers enable clients to be intelligent.

Triple Pattern Fragment servers enable clients to be intelligent.

Triple Pattern Fragment servers enable clients to be intelligent.

How can a client evaluate a SPARQL query over a TPF interface?

Let’s follow the execution of an example SPARQL query.

The client looks inside of the fragment to see how it can access the dataset.

The client splits the query into the available fragments.

It gets the first page of all fragments and inspects their metadata.

It starts with the smallest fragment, because it is most selective.

This process continues recursively until all options have been tested.

Where does Linked Data come from,
and where does it go?

Where does Linked Data come from,
and where does it go?

Many Linked Data life cycles are proposed.
This simple cycle consists of 5 steps.

Generation is the step in which
we convert non-RDF data to RDF.

Most representations on the Web
are generated by templates.

Templating is not always sufficient
to create 5-star Linked Data.

Linked Data can be generated in batch
through a mapping process.

Mapping can be performed with
ad-hoc scripts for a specific dataset.

R2RML is an RDF vocabulary to describe
a mapping of relational data into RDF.

RML is a generalization of R2RML
toward heterogeneous data sources.

Let us consider an example
of musical performances.

The venues could be mapped
using the following RML document.

The venues could be mapped
using the following RML document.

The execution of the mapping
results in an RDF dataset.

The mapping can be extended
to include other resources and properties.

After Linked Data has been generated,
we can validate it using semantics.

Validation can be applied
on different data levels.

Databases only allow for
rudimentary constraint validation.

Schemas can validate conformance to
a reusable set of structural constraints.

Ontologies allow for more specific
content-based validation.

In the following example, type checking
identifies an incorrect triple.

Violations across triples can be identified,
but not always automatically resolved.

Automated validation tells you
whether data makes sense.

By validating during the mapping process,
we detect quality issues before they occur.

As soon as Linked Data is ready,
it can be published for consumption.

There are roughly 3 ways of
publishing Linked Data on the Web.

A data dump places all dataset triples
in one or more archive files.

A data dump places all dataset triples
in one or more archive files.

A SPARQL endpoint lets clients evaluate
arbitrary (read-only) queries on a server.

A SPARQL endpoint lets clients evaluate
arbitrary (read-only) queries on a server.

Linked Data documents provide
per-topic access to a dataset.

Linked Data documents provide
per-topic access to a dataset.

Once Linked Data is published on the Web,
clients can evaluate queries over it.

Just like on the “human” Web,
querying goes beyond browsing.

The possibilities for query evaluation
depend on how data is made available.

Evaluating queries over a federation
of interfaces introduces new challenges.

Enhancements let client feedback
find its way back to the source.

Data doesn’t stop when published.
It only just begins.

Unfortunately, such feedback loops
are still rare for Linked Data.

Provenance allows modeling
the history trail of facts.

Reverse mappings could feed edits
back to the original source.

The original Semantic Web vision
features intelligent agents.

Do we still need the Semantic Web
with a smartphone in our pockets?

The current generation of agents
only performs preprogrammed acts.

Before Linked Data, the Semantic Web
suffered from a chicken-and-egg problem.

We have all the infrastructure,
so where are all the intelligent agents?

The SemWeb’s answer to live querying
has been “public SPARQL endpoints”.

Would you put your SQL database
publicly on the Web—even just read-only?

Most public SPARQL endpoints
have less than 95% availability.

Linked Data availability on the Web
is a serious, two-sided problem.

If we all host a private endpoint,
it is no longer a Semantic Web.

21 years of Semantic Web research
has mostly led to intelligent servers.

Possible Linked Data interfaces exist
in between the two extremes.

Linked Data Fragments is a uniform view
on Linked Data interfaces.

Each type of Linked Data Fragment
is defined by three characteristics.

Each type of Linked Data Fragment
is defined by three characteristics.

Each type of Linked Data Fragment
is defined by three characteristics.

Each type of Linked Data Fragment
is defined by three characteristics.

We designed a new trade-off mix
with low cost and high availability.

A Triple Pattern Fragments interface
is low-cost and enables clients to query.

A Triple Pattern Fragment is designed
to have a good information/cost balance.

This Triple Pattern Fragment shows
subjects born in London from DBpedia.

Triple Pattern Fragments are lightweight,
because they do not require a triple store.

Triple patterns are not the final answer.
No interface ever will be.

Triple Pattern Fragment servers
enable clients to be intelligent.

Triple Pattern Fragment servers
enable clients to be intelligent.

Triple Pattern Fragment servers
enable clients to be intelligent.

Triple Pattern Fragment servers
enable clients to be intelligent.

How can a client evaluate
a SPARQL query over a TPF interface?

Let’s follow the execution
of an example SPARQL query.

The client looks inside of the fragment
to see how it can access the dataset.

The client splits the query
into the available fragments.

It gets the first page of all fragments
and inspects their metadata.

It starts with the smallest fragment,
because it is most selective.

This process continues recursively
until all options have been tested.

It gets the first page of all fragments
and inspects their metadata.

It starts with the smallest fragment,
because it is most selective.

This process continues recursively
until all options have been tested.

It gets the first page of the fragment,
which provides mappings for a solution.

Some paths will result in empty fragments.
They do not lead to a consistent solution.