One flew over the cuckoo’s nest
The role of aggregation on a decentralized Web
Ruben Verborgh
Ghent University – imec
Centralization and decentralization
both have their merits.
Technology should serve a purpose,
not an ideology.
Cultural heritage starts decentralized.
Why do we centralize via aggregation?
-
visibility and discovery
- data has higher chance of being found in an aggregator
-
quality
- aggregator can align data cross datasets
-
infrastructure
- only the aggregator needs complex software
Having access to more data
makes disambiguation easier.
- Dick Bruna
- Bruna, Dick
- Bruna, D. (1927-)
- D Bruna (1927-2017)
…although it also increases the problem size.
Aggregation makes interlinking
of collections easier.
-
Without aggregation,
interlinking is an n×n problem.
- …assuming you can find all n others
-
With aggregation,
interlinking is an n×1 problem.
Centralized infrastructure simplifies
access to these innovations.
Pros
- no effort in software
- no effort in training people
Cons
- limited access to collection-specific knowledge
- do we get the innovations back?
I have been publishing my own metadata
since before most of these existed.
-
The only correct publication record
is the one that I publish myself.
-
I am the source of truth
of my publication metadata.
-
I have one page of all my publications
and one page for each publication.
- All are semantically marked up.
-
I publish all of this as Linked Data.
Why spend time on this while
the aggregators already do?
-
They’re doing it wrong.
- incomplete
- incorrect
- inconsistent
-
I don’t have time to correct all of them
and keep them in sync.
- They will mess it up anyways.
I want to be the source of truth.
I don’t need to be the only source.
-
I have this recurring dream in which
all of these platforms just harvest my data.
-
They can have it for free—it’s CC0.
-
I almost wish I could pay
to give them the correct data.
-
I’d need to set up one-off integrations (if allowed)
and they would break every month.
What gets lost in translation
when data is aggregated?
-
Every collection has its own accents
and uniquely defining properties.
-
Aggregators expect some homogeneity
in order to realize their benefits.
-
Some of the properties with the most investments
are not reflected adequately.
What do data producers get back
in return from aggregators?
-
All of the benefits we talked about before!
What flows back to data producers
as a return from aggregators?
-
Do you receive the improvements
that were made to your metadata?
-
Can you leverage the connections
that were made with your data?
-
Do you receive additional data
that can help you improve?
Imagine all sorts of feedback
we are missing out on.
- What are people looking at most?
- What metadata fields do people use?
- What are people searching for?
This knowledge lets (only) you improve your data
and the experience of those who eventually use it.
Decentralization can be realized
at very different scales.
Every piece of data in decentralized apps
can come from a different place.
Multiple decentralized Web apps
share access to data stores.
Different app and storage providers
compete independently.
Hard-coded client–server contracts are
unsustainable with multiple sources.
Query-based contracts can make
decentralized Web apps more sustainable.
The Paradox of Freedom:
you can only be free if you follow rules.
-
Decentralization means making our own choices.
-
Unless we agree on some basic things,
no one will see the result of our choices.
-
Agreement can be layered:
- 100% agrees on a small set (labeling, authorship, …)
- 80% agrees on a larger set (places, dimensions)
- 5% agrees on many smaller sets (sizes, colors, …)
We need to identify those rules
we all need to agree on.
- vocabularies
- data shapes
- interfaces
Lessons learned from aggregating hundreds of datasets
are highly useful to inform the discussion.
Decentralization needs replication
for realistic performance.
In addition to technological changes,
we need a shift of mindset.
-
from the one
to one of many
-
from source
to station
-
from platform
to service
Current networks are centered
around the aggregator.
We need to create network flows
to and from the aggregator.
The individual network nodes
need to become the source of truth.
Aggregators need to become part
of a larger network.
Aggregators serve as a crucial
but transparent layer in the network.
Aggregators’ main responsibility becomes
fostering a network between nodes.
The question of identity
becomes one of role.