200 likes | 207 Views
A Sitemap extension to enable efficient interaction with large quantities of Linked Data. Giovanni Tummarello, Ph.D DERI Galway. Linked Data on the Semantic Web. The “Semantic Web”, how we start to mean it today: The set composed by all the RDF models which can be resolved by a URL (source).
E N D
A Sitemap extension to enable efficient interaction with large quantities of Linked Data Giovanni Tummarello, Ph.D DERI Galway
Linked Data on the Semantic Web • The “Semantic Web”, how we start to mean it today: The set composed by all the RDF models which can be resolved by a URL (source). • Size of the current of the current Semantic Web: 50-100m documents • Most of it produced by mapping relational databases using the “linked data” approach: • The identifier (URI) is actually a URL. We call these URI/URLs • ..Minted in the same namespace of the data producer.. • So that the data producers Web server can generate a description of the entity, when this is “resolved”, e.g. via HTTP • Example http://dbpedia.org/resource/Berlin
Cost of creating new documents of on the SW • If you have the data, is moderately low • From your existing DB, apply a layer (e.g. D2R or Viruoso) • Produce as many RDF files retrievable from your URL prefix as your entities • Success? • More is needed to make your data useful (e.g. linking to OTHER URIs if your entities are not something completely “yours”) • Need to make the world know your data is there.
Large quantities of linked data: how to expose? • The fact that the data is HTTP retrievable in small bits makes it crawlable. • But data producers are very scared of this: • Million of hits for each refresh • Each hit triggers potentially many complex query to generate the RDF view of the entity • DOS on the SW have happened (e.g. See Geonames blog) and they are not fun. • And clearly something better must be possible • Most data producers do in fact already provide full dumps of the base data • Or SPARQL endpoints
The idea: Extending Sitemaps to expose data • Sitemaps: • Originally by Google, immediately adopted by all (Yahoo, MSN) etc • Expose the “deep web”, by providing a list of pages “to be crawled” • Written in XML, Linked directly in the robot.txt Example: <?xml version="1.0" encoding="UTF-8"?> < urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> < url> < loc>http://www.example.com/</loc> < lastmod>2005-01-01</lastmod> < changefreq>monthly</changefreq> < priority>0.8</priority> </url> </urlset>
The Semantic Sitemap Extention Example first: <urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"xmlns:sc="http://sw.deri.org/2007/07/sitemapextension/scschema.xsd"> <sc:dataset> <sc:datasetLabel>Product Catalog for Example.org</sc:datasetLabel> <sc:dataDumpLocation>http://example.org/cataloguedump.rdf </sc:dataDumpLocation> <sc:linkedDataPrefix>http://example.org/products/</sc:linkedDataPrefix> <changefreq>monthly</changefreq> </sc:dataset> </urlset>
The Semantic Sitemap Extention Example first: <urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"xmlns:sc="http://sw.deri.org/2007/07/sitemapextension/scschema.xsd"> <sc:dataset> <sc:datasetLabel>Product Catalog for Example.org</sc:datasetLabel> <sc:dataDumpLocation>http://example.org/cataloguedump.rdf </sc:dataDumpLocation> <sc:linkedDataPrefix>http://example.org/products/</sc:linkedDataPrefix> <changefreq>monthly</changefreq> </sc:dataset> </urlset>
The Semantic Sitemap Extention Example first: <urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"xmlns:sc="http://sw.deri.org/2007/07/sitemapextension/scschema.xsd"> <sc:dataset> <sc:datasetLabel>Product Catalog for Example.org</sc:datasetLabel> <sc:dataDumpLocation>http://example.org/cataloguedump.rdf </sc:dataDumpLocation> <sc:linkedDataPrefix>http://example.org/products/</sc:linkedDataPrefix> <changefreq>monthly</changefreq> </sc:dataset> </urlset>
The Semantic Sitemap Extention Example first: <urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"xmlns:sc="http://sw.deri.org/2007/07/sitemapextension/scschema.xsd"> <sc:dataset> <sc:datasetLabel>Product Catalog for Example.org</sc:datasetLabel> <sc:dataDumpLocation>http://example.org/cataloguedump.rdf </sc:dataDumpLocation> <sc:linkedDataPrefix>http://example.org/products/</sc:linkedDataPrefix> <changefreq>monthly</changefreq> </sc:dataset> </urlset>
The Semantic Sitemap Extention Example first: <urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"xmlns:sc="http://sw.deri.org/2007/07/sitemapextension/scschema.xsd"> <sc:dataset> <sc:datasetLabel>Product Catalog for Example.org</sc:datasetLabel> <sc:dataDumpLocation>http://example.org/cataloguedump.rdf </sc:dataDumpLocation><sc:linkedDataPrefix>http://example.org/products/</sc:linkedDataPrefix> <changefreq>monthly</changefreq> </sc:dataset> </urlset>
The Semantic Sitemap Extention Example first: <urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"xmlns:sc="http://sw.deri.org/2007/07/sitemapextension/scschema.xsd"> <sc:dataset> <sc:datasetLabel>Product Catalog for Example.org</sc:datasetLabel> <sc:dataDumpLocation>http://example.org/cataloguedump.rdf </sc:dataDumpLocation> <sc:linkedDataPrefix>http://example.org/products/</sc:linkedDataPrefix><changefreq>monthly</changefreq> </sc:dataset> </urlset>
Other features • Location of the sparql endpoint of the dataset <sc:sparqlEndPoint>http://example.org/queryengine/sparql</sc:sparqlEndPoint> • A reppresentative URI/URL <sc:sampleURI>http://example.org/products/id1234<sc:sampleURI> • Split data dumps <sc:dataFragmentDump>http://example.org/data
How it is meant to be used As a crawler: • If you are given a URL for an RDF site check for the sitemap • If a dump is available, download that instead As a client: • If you have a dump, and want an update • Check the sitemap, to locate it in case it has changed position • Or to locate a SPARQL endpoint
Dumps (1) Tripledumps vs Quaddumps • The Semantic Web, is a quadruple space (triple+source) • A Semantic Web site dump should therefore be in a quad format • But almost always, the only thing that really matters is a single triplestore • How to “slice” such a dataset to obtain the individual linked data files ? • The individual site owners decide how to generate the single linked data files. • Unfortunately there is no standard interpretation of SPARQL describe • Some reasonable choices exist however but might fail for specific use cases • Guess work or standardization?
Dumps (2) Compression and others • In case of a tripledump, one should specify the format such as: rdf/xml, ntriples, turtle, n3 • In case of a quaddump: • Trig, Trix, Nquad • filename Archival – Archives where the filenames are created by URL encoding the source location, • Compression: The can be compressed, in this case one of the following formats should be specified: • Tar, zip, gzip, bzip2, targzip, tarbz2
Who uses it? Data producers • Geonames • DBpedia • Uniprot • DBLP • … (takes 10 minutes to do one..) Data consumers • Sindice • Next: SWSE, DBin 2.0
Implementation in action: Sindice • Can help a user or a client (e.g. Tabulator) to find useful Semantic Web Sources to import. • Quick to update, monitors changes, crawls (soon) • First beta target: to index the currently known Semantic Web • Discovers, and uses Semantic Sitemaps
Sindice scenario DBPedia DBLP GeoNames The tabulator Disco, Piggy Bank, SIOC Explorer etc..
Semantic Sitempas: credits • Also thanks to:Chris Bizer (Free University Berlin)Richard Cyganiak (Free University Berlin)Renaud Delbru (DERI Galway)Andreas Harth (DERI Galway)Aidan Hogan (DERI Galway)Leandro Lopez Stefano Mazzocchi (SIMILE- MIT)Christian Morbidoni (SEMEDIA - Universita' Politecnica delle Marche)Michele Nucci (SEMEDIA - Universita' Politecnica delle Marche)Eyal Oren (DERI Galway)Leo Sauermann (DFKI)
Conclusions • Sitemaps are born in the document web, explicitly to expose databases and the “inner web” • The idea: a Semantic Sitemap extention to covers efficient handling of RDF datasets by clients and search engines • Details to be somehow polished, but it works already • Full specs at http://sw.deri.org/2007/07/sitemapextension/