470 likes | 584 Views
Linked Data: Survey of Adoption. Aidan Hogan. Day 2 Session 2. Linked Open Data …so, what’s out there?. The Web of Data !. August 2007. November 2007. February 2008. March 2008. September 2008. March 2009. July 2009. September 2010.
E N D
Linked Data: Survey of Adoption Aidan Hogan Day 2 Session 2
Linked Open Data …so, what’s out there?
The Web of Data! August 2007 November 2007 February 2008 March 2008 September 2008 March 2009 July 2009 September 2010 Images from:http://richard.cyganiak.de/2007/10/lod/; Cyganiak, Jentzsch
Publications Media User-generated Government Cross-Domain Geographic Life sciences
Anatomy of the LOD cloud: cross-domain • Freebase • ~300 million triples • General knowledge • User contributed • Acquired by Google • OpenCalais • ~4.5 million triples • Thomson Reuters export • OpenCyc • ~2 million triples • Upper ontology concepts • DBpedia • ~1 billion triples • Exports from Wikipedia • Central hub • Yago • ~19 million triples • Smaller/more precise data from Wikipedia • WordNet • ~4.5 million triples • Synonyms, etc.
User-generated Cross-Domain
Anatomy of the LOD cloud: user-generated • semanticweb.org • ~50 thousand triples • SemWeb related topics • Semantic Media Wiki! • Revyu • ~20 thousand triples • User contributed reviews • FlickrWrappr • ~56 million triples • Exports from photo site • DogFood • ~200 thousand triples • SemWeb confs. and papers • RDF ohloh • ~700 thousand triples • Exports from open-source development site
Publications User-generated
Anatomy of the LOD cloud: publications Library Exports • DBLP • ~28 million triples • Com. Sci. publications • DBLP • ~28 million triples • Com. Sci. publications • DBLP • ~28 million triples • Com. Sci. publications • ePrints • ~8.4 million triples • ePrints exporter Academic Publications
User-generated Life sciences
Anatomy of the LOD cloud: life-sciences • Drug Bank • ~800 thousand triples • Detailed pharmacology for FDA-approved drugs • Sider • ~200 thousand triples • Drug side-effects • DailyMed • ~200 thousand triples • Detaileddrug info from NLM • DiseaseSome • ~91 thousand triples • Disorders and disease • LinkedCT • ~7 million triples • Clinical trials info • UniProt • 100’s millions triples • Info on proteins and sequences • PubMed • 800 million triples • HCLS publications
Geographic Life sciences
Anatomy of the LOD cloud: geographical • GeoNames • ~100 million triples • 10 million places with lat, long, population, subdivisions, post-codes, etc. • 2000 U.S. Census • ~1 billion triples • Population statistics per geographical location • Linked Sensor Data • ~1 billion triples • Sensor observations from 20 thousand weather observatories • Linked GeoData • ~3 billion triples • OpenStreetMap geolocations
Government Geographic
Anatomy of the LOD cloud: governmental • UK Legislation • ~2 billion triples • UK primary and secondary legislation info • NASA • ~100 thousand triples • Spacecraft, star catalogues, etc. • EuroStat • ~5 million triples • Various statistics for EU countries • UK Postcodes • ~27 million triples • Every UK postcode • GovTrack • ~13 million triples • US Congress bills, sponsorship, voting records
Media Government
Anatomy of the LOD cloud: media • Music (Various) • 100’s millions triples • MySpace • AudioScrobbler • MusicBrainz • discogs • LastFM • Music (Various) • 100’s millions triples • MySpace • AudioScrobbler • MusicBrainz • discogs • LastFM • Music (Various) • 100’s millions triples • MySpace • AudioScrobbler • MusicBrainz • discogs • LastFM • Music (Various) • 100’s millions triples • MySpace • AudioScrobbler • MusicBrainz • discogs • LastFM • Music (Various) • 100’s millions triples • MySpace • AudioScrobbler • MusicBrainz • discogs • LastFM • Poképédia • ~115 thousand triples • Everything you ever wanted to know about Pokémon (but were afraid to ask) • BBC Programmes • ~60 million triples • Extensive info on BBC TV and radio programmes • New York Times • ~400 thousand triples • Extensive news vocabulary and cat. schemes • Linked Movie Database • ~6 million triples • Movie database • Open (smaller) version of IMDb
Publications Media User-generated Government Cross-Domain Geographic Life sciences
Graph Structure (i): Clustering Life-sciences (esp. Bio2RDF) Publications (esp. RKB) Core Image from http://blog.larkc.eu/?p=1941;C. Guéret
Graph Structure (ii): Interlinkage Interactive http://gromgull.net/2010/01/swball/swball.svg;G.A. Grimnes
Graph Structure (iii): owl:sameAs linkage Interactive http://inkdroid.org/empirical-cloud/;E. Summers
Licencing Image by L. Dodds
SPARQL • 66% of the datasets have a SPARQL endpoint • 35% offer an RDF dump See http://www.w3.org/wiki/SparqlEndpoints
Data Overview • 207 datasets • 68 (33%) published directly by data producers • 137 (67%) published by third-parties Info from http://www4.wiwiss.fu-berlin.de/lodcloud/state/:; Bizer, Jentzsch, Cyganiak
Data Overview • 207 datasets • ~28 billion triples • Highest volume from large legacy producers Info from http://www4.wiwiss.fu-berlin.de/lodcloud/state/:; Bizer, Jentzsch, Cyganiak
Data Overview • 207 datasets • ~28 billion triples • ~395 million links • More links in more focused domains • (Or datasets by the same group) Info from http://www4.wiwiss.fu-berlin.de/lodcloud/state/:; Bizer, Jentzsch, Cyganiak
(Linked) Vocabularies Overview … • Formalised using RDFS and OWL standards introduced yesterday • (Typically OWL Full) … Image fromhttp://blog.dbtune.org/public/.081005_lod_constellation_m.jpg:; Giasson, Bergman
(Linked) Vocabularies: Dublin Core (DC) • Dublin Core • Models terms for personal information Table fromhttp://dublincore.org/documents/dcmi-terms/
(Linked) Vocabularies: FOAF • Friend Of A Friend • Models terms for personal information Image fromhttp://www.deri.ie/fileadmin/images/blog/:; Breslin
(Linked) Vocabularies: SIOC • Semantically Interlinked Online Communities • Models terms for online communities and presence Image fromhttp://rdfs.org/sioc/spec/ :;Bojārs, Breslin et al.
(Linked) Vocabularies: SKOS • Simple Knowledge Organization System • Metavocabulary for concepts schemes Image fromhttp://www.w3.org/TR/swbp-skos-core-guide:; Miles, Brickley
(Linked) Vocabularies: FOAF+SIOC+SKOS • Example of how vocabularies can interleave Image fromhttp://sioc-project.org/node/158;; Breslin
(Linked) Vocabularies: DOAP • Description Of A Project • Models terms for projects(research, software, etc.) Image fromhttp://code.google.com/p/baetle/wiki/DoapOntology ;; Breslin
(Linked) Vocabularies: Music Ontology • Models terms for music artists, songs, albums etc. • (very detailed) Image fromhttp://musicontology.com/;; Raimond, Giasson
(Linked) Vocabularies: GoodRelations (i) • Models terms for e-commerce, products, offerings etc. Image fromhttp://www.heppnetz.de/projects/goodrelations/primer/;; Hepp
(Linked) Vocabularies: GoodRelations (ii) • Models terms for e-commerce, products, offerings etc.
(Linked) Vocabularies: Music Ontology • Classes and properties for Wikipedia export • Cross-domain • 272 classes • 1,300 properties • (Too big to show) • Used to model structured info-boxes in Wikipedia Seehttp://wiki.dbpedia.org/
(Linked) Vocabularies: Interlinkage Interactivehttp://labs.mondeca.com/dataset/lov/;; Vatant, Vandenbussche
“In order to make it as easy as possible for client applications to process your data, you should reuse terms from well-known vocabularies wherever possible. You should only define new terms yourself if you can not find required terms in existing vocabularies. ... It is common practice to mix terms from different vocabularies.” LOD Vocabulary Usage “How to Published Linked Data on the Web” Bizer, Cyganiak, Heath http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/
LOD Vocabulary Usage Info fromhttp://www4.wiwiss.fu-berlin.de/lodcloud/state/:; Bizer, Jentzsch, Cyganiak
LOD Vocabulary Usage • Preferential Attachment: more commonly used classes and properties are more likely to be used by others • Self-organising phenomenon/emergence • Causes power-law (long-tail) distributions... Property Usage Class Usage log/log scale
Linked Open Challenges? • ...still many open challenges (and opportunities) • Linked Data still in it’s infancy (<4 years old) • Publishing Linked Data • How to generate and maintain links to other datasets • Modelling issues when decoupled from applications • Economic issues: who pays for server overheads? • Revenue streams? Incentives? • Social issues: community-driven, collaborative, knowledge-bases • Consuming Linked Data • Scalability (10’s of billions of triples) • Dealing with low data quality (Web data) • Heterogeneous data • many vocabularies • different URI naming schemes • Getting value from Linked Data through applications!!