A Perspective on Preservation of Linked Data

A Perspective on Preservation of Linked Data Richard Cyganiak DERI, NUI Galway

How is Linked Data preservation different? • Easier because RDF is (sometimes) self-describing • Representation information and context tends to be explicit and machine-processable • Harder because it is tied to a particular technology infrastructure • If the domain name is lost, a dataset can no longer be LD (cf. TimBL's four principles) • Doesn't mean the data is no longer useful

Why think about preservation of LD? • Can the preservation community teach us how to make data more self-describing? • Preservation requires packaging. LD needsbetter data packaging • Preservation requires versioning. LD needs better versioning • LD datasets do go offline. How can we deal with it? Preserving the bits is not necessarily the hardest problem!

Access and formats • Multiple methods of publishing/accessing LD • Dereferenceable URIs • SPARQL endpoints • RDF dumps (triple/quad) • Embedding into web pages (RDFa, microdata) • Focus on RDF dumps to keep things tractable and to maximise usefulness for non-RDF data

Vocabularies • Meaning of an LD dataset depends on used vocabularies (a.k.a. ontologies) • Most important representation information • Vocabularies can change and disappear too • Need to be preserved alongside the data • Vocabularies would be good starting point for LD preservation • Note: LOV already archives versions of 100s of vocabularies (http://lov.okfn.org/)

Versioning • How to package individual versions of a dataset in an explicit, machine-readable way? • There is no strong notion of versioning in the RDF community. • Books have editions. Software products have releases. This is important for data too. What version of Dataset X are you using? • “Dependencies” between datasets and vocabularies, incl. versions? • See also: Memento

Cataloging and packaging • How can the various parts of a dataset and its surrounding information be packaged and held together in an explicit, machine-readable way? • What metadata needs to be recorded about these packages to preserve context and make them findable? • Potential benefit: Tooling for setting up a local copy of a published/archived dataset including all its dependencies • See also: OKFN's data packages • http://www.dataprotocols.org/en/latest/data-packages.html

Existing relevant (?) standards • VoID • Metadata standard for RDF vocabularies • DCAT • Upcoming W3C standard for data catalogs • PROV • W3C standard for provenance • DDI Discovery Vocabulary • Used by data archives to document statistical microdata, survey data, etc.

Summary • The most important repository for LD preservation will be one that versions vocabularies • Focus on bulk RDF (dumps, not SPARQL endpoints or deref URI crawling) • Work towards good practices for making data self-describing and for metadata? • Work towards standards and good practices for packaging, versioning, dependencies? • Use existing standards: VoID, DCAT, PROV, Disco • Preservation across time… • But also preservation across space and communities

A Perspective on Preservation of Linked Data