Where does Linked Data come from,
and where does it go?
-
Many applications capture data directly
in the format they will use afterwards.
-
In contrast, most Linked Data
is not linked at the start.
-
Most Linked Data is actually
first captured in a different way.
Many Linked Data life cycles are proposed.
This simple cycle consists of 5 steps.
Generation is the step in which
we convert non-RDF data to RDF.
Most representations on the Web
are generated by templates.
-
Data resides in a back-end database.
-
The front-end Web application translates
database entries via templates into representations.
-
The data model from the database
easily maps to the target representation.
Templating is not always sufficient
to create 5-star Linked Data.
-
You can upgrade a JSON document to JSON-LD
by providing a relevant JSON-LD context.
-
A context identifies properties and structural elements,
but does not create links between resources.
-
Remodelling data to RDF on the fly is tricky.
-
Linked Data transcends your application.
- Link to external concepts.
- Reuse external identifiers.
Linked Data can be generated in batch
through a mapping process.
-
A mapping processor takes input sources
and a mapping file as input.
-
The mapping file explains how input data
should be converted to the RDF model.
- selectors to locate data
- rules to transform data
-
Different processors have different features.
- source type support
- interlinking together with mapping
- …
Mapping can be performed with
ad-hoc scripts for a specific dataset.
-
You write custom code to handle your case.
-
The short-term costs might be low,
but long-term maintenance can prove difficult.
- Changes require a developer.
- The mapping is not reusable across datasets.
-
Different but related data sources result in
duplicated effort and incompatibilities.
R2RML is an RDF vocabulary to describe
a mapping of relational data into RDF.
RML is a generalization of R2RML
toward heterogeneous data sources.
-
RML abstracts different types of logical sources.
- A source is modelled as an iterator of items.
-
Sources can be
- The mechanism is extensible through new vocabularies.
-
RML term maps use logical sources to create terms.
-
RML can map and interlink heterogeneous sources.
The mapping can be extended
to include other resources and properties.
-
RML mappings can incorporate data
from other sources, even in different formats.
-
Parts of mapping definitions can be reused.
-
RML mappings can look up data from Web APIs.
-
Interlinking can happen at mapping time,
rather than as a separate step after mapping.
- Reuse identifiers as much as possible.
- Link data as early as possible.
After Linked Data has been generated,
we can validate it using semantics.
Validation can be applied
on different data levels.
-
validation on the individual field level
- spelling mistakes
- field overloading
- simple datatypes
- …
-
validation on the structural level
- integrity
- required and prohibited structures
- …
-
validation on the semantic level
- domain and range of values
- inconsistencies
- …
Databases only allow for
rudimentary constraint validation.
-
They perform elementary value-type checks.
- field
Temperature
has type INTEGER
-
They can ensure referential integrity.
- Does a referenced record exist?
Schemas can validate conformance to
a reusable set of structural constraints.
-
The Shapes Constraint Language (SHACL)
allows for
the specification of validation rules in RDF.
-
Use nested shapes to describe the desired structure.
-
Shape Expressions (ShEx)
serve similar goals,
but have a higher expressivity.
-
Both languages soften the open-world assumption
by providing expectations for a specific context.
-
This is useful for apps with specific assumptions.
Ontologies allow for more specific
content-based validation.
-
More specific types can be defined,
and their definitions can be reused.
- temperature
- days per month
- …
-
Type checking can take more factors into account.
- domain and range
- incompatible types
- …
-
Additional reasoning
can further identify problems.
In the following example, type checking
identifies an incorrect triple.
The ontology defines the following constraints:
foaf:knows rdfs:domain foaf:Person;
rdfs:range foaf:Person.
:Mathematics a :Course.
:Course owl:disjointWith foaf:Person.
This triple violates those constraints:
:Albert foaf:knows :Mathematics.
Violations across triples can be identified,
but not always automatically resolved.
The ontology defines the following constraints:
:isBiologicalFatherOf a owl:IrreflexiveProperty;
owl:InverseFunctionalProperty.
The triples below are inconsistent:
:Albert :isBiologicalFatherOf :Albert.
:Albert :isBiologicalFatherOf :Delphine.
:Jacques :isBiologicalFatherOf :Delphine.
Which ones are correct is not known.
Automated validation tells you
whether data makes sense.
-
RDFUnit assesses the quality of a dataset
by running automated tests on it.
-
Ontologies used in a dataset can be looked up
by dereferencing its concepts.
-
The constraints in the ontology are transformed
into SPARQL queries and then evaluated.
- This results in warnings and/or errors.
By validating during the mapping process,
we detect quality issues before they occur.
-
Batch mapping processes
can generate millions of triples.
-
If we find a quality issue only afterwards,
we have to restart the entire mapping.
-
Checking quality during mapping allows
pinpointing the cause of errors and fixing them.
- Mapping rules that fail need not be not executed first.
As soon as Linked Data is ready,
it can be published for consumption.
There are roughly 3 ways of
publishing Linked Data on the Web.
A data dump places all dataset triples
in one or more archive files.
-
Dumps need to be downloaded entirely
before they can be queried.
- Dump files can be several gigabytes.
-
They offer the client full flexibility
to choose how data is processed.
-
Keeping data up-to date requires effort.
- redownload the entire dump
- download and apply incremental patches
A data dump places all dataset triples
in one or more archive files.
A SPARQL endpoint lets clients evaluate
arbitrary (read-only) queries on a server.
-
This gives clients direct access to
(only) the data they are interested in.
- Only very little bandwidth is required.
-
Data is always up-to-date.
-
The per-request cost for SPARQL endpoints
is much higher than for other HTTP servers.
- Few servers allow arbitrarily complicated queries.
A SPARQL endpoint lets clients evaluate
arbitrary (read-only) queries on a server.
Linked Data documents provide
per-topic access to a dataset.
-
They follow the Linked Data principles.
- The information structure resembles typical webpages.
-
Browsing up-to-date datasets is straightforward.
-
Query evaluation is possible through
link-traversal-based querying.
-
The evaluation of SPARQL queries is rather slow.
-
Completeness cannot always be guaranteed.
Linked Data documents provide
per-topic access to a dataset.
Once Linked Data is published on the Web,
clients can evaluate queries over it.
Just like on the “human” Web,
querying goes beyond browsing.
- Where can we find the data we need?
- How can we access that data?
- How do we combine it with other data?
The possibilities for query evaluation
depend on how data is made available.
-
Is the data available in RDF?
- Then we should discover the ontologies used.
-
Is the data linked to other data?
- Then we can / might need to involve other datasets.
-
In what interfaces is the data available?
- The client might need to evaluate (a part of) the query.
Evaluating queries over a federation
of interfaces introduces new challenges.
-
Which interface has the necessary data?
- If there are multiple, which one is the best?
-
How will the query evaluation be coordinated?
- Does one SPARQL endpoint talk to others?
- Does the client talk to all SPARQL endpoints?
-
In what order are subqueries executed?
- We should minimize high numbers of intermediary results.
Enhancements let client feedback
find its way back to the source.
Data doesn’t stop when published.
It only just begins.
-
When users query data sources,
they might spot mistakes or missing data.
-
Can users correct data?
Can they create new data?
-
Especially open data should be open to corrections.
- Feedback is a core added value of open.
Unfortunately, such feedback loops
are still rare for Linked Data.
Open challenges include:
- How can end users edit triples?
- How can edits be reviewed?
- How do we keep track of history?
Provenance allows modeling
the history trail of facts.
-
Provenance captures entities, activities, and people
involved in producing a resource.
-
Provenance can assess quality, reliability, or trust.
-
Tim Berners-Lee sketched an Oh yeah? button
you click to gain trust in information on the Web.
-
Many provenance challenges are still open.
- How to generate provenance (during mapping)?
- How to store n provenance triples per data triple?
Reverse mappings could feed edits
back to the original source.
-
If the RDF triples are generated from raw data,
edits to the triples should be ported back.
- If not, they are overwritten by the next mapping.
-
If the mapping file declaratively specifies
how a source maps to triples,
we might be able to reverse it automatically.
The original Semantic Web vision
features intelligent agents.
Schedule bi-weekly appointments
with a licensed physical therapist,
specialized in a particular field,
living nearby home or my workplace.
adapted from The Semantic Web
Do we still need the Semantic Web
with a smartphone in our pockets?
-
On the surface, the current generation of smart devices delivers
much of what the Semantic Web promised.
- Just realize the intelligence
is not on your smartphone.
The current generation of agents
only performs preprogrammed acts.
-
The Semantic Web’s goal is to allow this
with unknown services and data.
-
The service inclusion process for current digital agents
is non-transparent and non-democratic, unlike the Web.
-
Even if it were democratic, it wouldn’t scale,
since the integration is hardcoded.
-
Machines should discover services and data
and use them without any prior knowledge.
Before Linked Data, the Semantic Web
suffered from a chicken-and-egg problem.
-
Applications and data were waiting for each other.
- Have you used the Semantic Web yet?
-
Now, at least the data is there on a large scale.
-
What more do we need to kickstart things?
-
We published billions of triples,
spanning many domains.
-
Many answers we seek are out there.
-
But how do we find them?
How do we access them?
The SemWeb’s answer to live querying
has been “public SPARQL endpoints”.
-
A SPARQL endpoint allows clients
to send any SPARQL query for evaluation.
-
What could possibly go wrong?
- What if clients send expensive queries?
- What if many clients send medium queries?
- How will you mirror the server’s data?
Would you put your SQL database
publicly on the Web—even just read-only?
-
Normally, Web interfaces purposely
limit access to the underlying database.
-
RDF removes the model barrier, and thus
the need for clients to know the data schema.
- The interface is not necessary as an abstraction.
-
Yet RDF does not remove the complexity barrier.
- The interface might still be necessary as a limiter.
-
The average SPARQL endpoint
is down for 1.5 days each month.
-
Web servers usually express availability in number of nines.
-
Building a reliable application on top of
a publicly queryable Linked Data source
is thus currently not realistic.
- Things only get worse if you need multiple sources.
Linked Data availability on the Web
is a serious, two-sided problem.
-
Public SPARQL endpoints that exist, lack uptime.
-
High uptime would be possible,
but comes with a high server cost.
-
For most datasets, no public endpoint exists.
-
Publishers provide data dumps instead,
but these cannot be queried live.
If we all host a private endpoint,
it is no longer a Semantic Web.
If you have operational need
for SPARQL-accessible data,
you must have your own infrastructure.
No public endpoints.
Public endpoints are for lookups and discovery;
sort of a dataset demo.
Orri Erling, OpenLink (2014)
21
years of Semantic Web research
has mostly led to intelligent servers.
-
Without a cost model behind it,
server intelligence is not scalable.
- Not every Web resource can be created just for you.
-
Instead of trying to be intelligent ourselves,
we should enable clients to be intelligent.
-
What is a good basic set of conditions to guarantee
realistic availability of Linked Data on the Web?
Possible Linked Data interfaces exist
in between the two extremes.
Linked Data Fragments is a uniform view
on Linked Data interfaces.
Every Linked Data interface
offers specific fragments
of a Linked Data set.
Each type of Linked Data Fragment
is defined by three characteristics.
Linked Data Fragment
- data
- What triples does the fragment contain?
- metadata
- Do we know more about the data/fragment?
- controls
- How can we access more data?
Each type of Linked Data Fragment
is defined by three characteristics.
data dump
- data
- all dataset triples
- metadata
- number of triples, file size
- controls
- (none)
Each type of Linked Data Fragment
is defined by three characteristics.
SPARQL query result
- data
- triples matching the query
- metadata
- (none)
- controls
- (none)
Each type of Linked Data Fragment
is defined by three characteristics.
Linked Data document
- data
- triples about a topic
- metadata
- creator, maintainer, …
- controls
- links to other Linked Data documents
We designed a new trade-off mix
with low cost and high availability.
A Triple Pattern Fragments interface
is low-cost and enables clients to query.
A Triple Pattern Fragment is designed
to have a good information/cost balance.
Triple Pattern Fragment
- data
- matches of a triple pattern (paged)
- metadata
- total number of matches
- controls
- access to all other Triple Pattern Fragments
of the same dataset
Triple Pattern Fragments are lightweight,
because they do not require a triple store.
-
The interface can be realized with many back-ends.
- A SPARQL endpoint could serve as back-end.
-
Since queries are relatively simple,
a less expensive data infrastructure is sufficient.
-
The Header–Dictionary–Triples (HDT) format
stores triples in a compressed file.
- Especially triple-pattern lookups (and counts) are fast.
Triple patterns are not the final answer.
No interface ever will be.
-
There’s no silver bullet.
Publication and querying always involves trade-offs.
-
Triple Pattern Fragments aim to test how far
we can get with simple servers and smart clients.
-
To verify this, we need to execute the same queries
on different systems and measure the impact.
How can a client evaluate
a SPARQL query over a TPF interface?
-
Give the client a SPARQL query,
and the URL of any TPF of the dataset.
-
It uses the controls inside of the fragment
to determine how to access the dataset.
-
It reads the metadata to decide
how to plan the query.
Let’s follow the execution
of an example SPARQL query.
Find artists born in cities named Waterloo.
SELECT ?person ?city WHERE {
?person rdf:type dbpedia-owl:Artist.
?person dbpedia-owl:birthPlace ?city.
?city foaf:name "Waterloo"@en.
}
Fragment: https://fragments.dbpedia.org/2016-04/en
The client looks inside of the fragment
to see how it can access the dataset.
<https://fragments.dbpedia.org/2016-04/en#dataset> hydra:search [
hydra:template "https://fragments.dbpedia.org/2016-04/en{?s,p,o}";
hydra:mapping
[ hydra:variable "s"; hydra:property rdf:subject ],
[ hydra:variable "p"; hydra:property rdf:predicate ],
[ hydra:variable "o"; hydra:property rdf:object ]
].
You can query the dataset by triple pattern.
The client splits the query
into the available fragments.
-
?person rdf:type dbo:Artist.
-
?person dbo:birthPlace ?city.
-
?city foaf:name "Waterloo"@en.
It gets the first page of all fragments
and inspects their metadata.
-
?person rdf:type dbo:Artist.
96,000
-
?person dbo:birthPlace ?city.
12,000,000
-
?city foaf:name "Waterloo"@en.
26
It starts with the smallest fragment,
because it is most selective.
-
?person rdf:type dbo:Artist.
-
?person dbo:birthPlace ?city.
-
?city foaf:name "Waterloo"@en.
26
dbr:Waterloo,_Iowa foaf:name "Waterloo"@en.
dbr:Waterloo,_London foaf:name "Waterloo"@en.
dbr:Waterloo,_Ontario foaf:name "Waterloo"@en.
- …
This process continues recursively
until all options have been tested.
-
?person rdf:type dbo:Artist.
-
?person dbo:birthPlace dbr:Waterloo,_Iowa.
-
?city foaf:name "Waterloo"@en.
dbr:Waterloo,_Iowa foaf:name "Waterloo"@en.
dbr:Waterloo,_London foaf:name "Waterloo"@en.
dbr:Waterloo,_Ontario foaf:name "Waterloo"@en.
- …
It gets the first page of all fragments
and inspects their metadata.
-
?person rdf:type dbo:Artist.
96,000
-
?person dbo:birthPlace dbr:Waterloo,_Iowa.
45
It starts with the smallest fragment,
because it is most selective.
-
?person rdf:type dbo:Artist.
96,000
-
?person dbo:birthPlace dbr:Waterloo,_Iowa.
45
dbr:Allan_Carpenter dbo:birthPlace dbr:Waterloo,_Iowa.
dbr:Adam_DeVine dbo:birthPlace dbr:Waterloo,_Iowa.
dbr:Bonnie_Koloc dbo:birthPlace dbr:Waterloo,_Iowa.
- …
This process continues recursively
until all options have been tested.
-
dbr:Allan_Carpenter rdf:type dbo:Artist.
-
?person dbo:birthPlace dbr:Waterloo,_Iowa.
26
dbr:Allan_Carpenter dbo:birthPlace dbr:Waterloo,_Iowa.
dbr:Adam_DeVine dbo:birthPlace dbr:Waterloo,_Iowa.
dbr:Bonnie_Koloc dbo:birthPlace dbr:Waterloo,_Iowa.
- …
It gets the first page of the fragment,
which provides mappings for a solution.
-
dbr:Allan_Carpenter rdf:type dbo:Artist.
1
dbr:Allan_Carpenter rdf:type dbo:Artist.
We found a solution mapping.
- ?person
- dbr:Allan_Carpenter
- ?city
- dbr:Waterloo,_Iowa
Some paths will result in empty fragments.
They do not lead to a consistent solution.
-
dbr:Adam_DeVine rdf:type dbo:Artist.
0
No solution mapping.
At least, according to DBpedia.
It turns out that Adam DeVine is actually an actor.
Executing this query in the browser client
only takes a couple of seconds.
We evaluated Triple Pattern Fragments
for server cost and availability.
We configured the Amazon machines
to generate large loads in a Web-like setting.
- 1 server (4 cores)
- 1 cache
- 1–244 simultaneous clients (1 core each)
The query throughput is lower,
but resilient to high client numbers.
The server traffic is higher,
but individual requests are lighter.
Caching is significantly more effective,
as clients reuse fragments for queries.
The server requires much less CPU,
allowing higher availability at lower cost.
The server enables clients to be intelligent,
so it can remain simple and lightweight.
These experiments verify the possibility
(and necessity) of new types of solutions.
-
Processing everything on the server is costly.
Processing everything on the client isn’t Web.
-
Solutions that divide the workload
can offer new perspectives,
if we accept the trade-offs they bring.
-
Is it realistic to make all queries on the Web fast?
- Maybe we should focus on obtaining first results soon.
The Semantic Web should focus on making
each of the life cycle phases sustainable.
-
Continuously improve Linked Data
through efficient iterations of the cycle.
-
If we want to see intelligent agents,
we must stop building intelligent servers.
-
Another perspective on sustainability:
how can I enable others to act on my data
now and in the future?