360 likes | 474 Views
How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus. Mike Dean Principal Engineer BBN Technologies mdean@bbn.com. Assumptions. Technology – Intermediate Familiarity with RDF and OWL Interest in Semantic Web usage patterns Semantic Web Challenge.
E N D
How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com
Assumptions • Technology – Intermediate • Familiarity with RDF and OWL • Interest in • Semantic Web usage patterns • Semantic Web Challenge
Presenter Background • Principal Engineer at BBN Technologies (1984-present) • Principal Investigator for DARPA Agent Markup Language (DAML) Integration and Transition (2000-2005) • Chaired the Joint US/EU Committee that developed DAML+OIL and SWRL • Developer and/or Principal Investigator for many Semantic Web tools, datasets, and applications (2000-present) • Member of the W3C RDF Core, Web Ontology, and Rule Interchange Format Working Groups • Co-editor of the W3C OWL Reference • Member of the Semantic Web Challenge Advisory Board since its inception • Local co-chair for ISWC2009 • Other SemTech presentations • Semantic Query: Solving the Needs of a Net-Centric Data Sharing Environment (2007, w/ Matt Fisher) • Semantic Queries and Mediation in a RESTful Architecture (2008, w/ John Gilman and Matt Fisher) • Use of SWRL for Ontology Translation (2008) • Semantic Web @ BBN: Application to the Digital Whitewater Challenge (2009, w/ John Hebeler)
Semantic Web Challenge • Founded in 2003 by Michel Klein and Ubbo Visser • Demonstrate the value of the Semantic Web through applications • Submissions evaluated according to a set of minimal requirements and additional desirable features • Has become an annual event at International Semantic Web Conferences • 22 submissions in 2008
2008 Billion Triples Challenge • A new Semantic Web Challenge track in 2008 • Do “something interesting” with a large subset of a billion provided triples • Co-chaired by Jim Hendler and Peter Mika • 12 real web data sets • Not a scientific sample • Enough to be interesting and probably representative • Stable snapshot • Our analysis initially arose from discussing a possible application • We now know “yes, there is enough data to support what we wanted to do” • Tools and techniques should be generally applicable to other corpora
2008 Billion Triples Corpus http://www.cs.vu.nl/~pmika/swc/btc.html
Data Set Characterization • Metrics that can impact selection/tuning of KB implementations • Statement count • Number of classes and predicates • Statements per subject/predicate/object • Degree of interconnectedness (percentage of non-literal statements, with/without rdf:type) • RDFS and OWL reasoning employed • Use of reification
Analysis • Stream processing of the compressed data set archives • Statement counts • Datatype, language, predicate, and type counts • Use of RDF, RDFS, OWL, FOAF, and other vocabularies • (May include duplicate statements) • Load each dataset into its own Parliament KB • (Eliminates duplicates within dataset) • (Both programs used code based on Peter Mika’s WARC example with the OpenRDF RIO parser and no inference) • Process the statement and resource tables • Mark each node as resource and/or literal • URI, blank node, and literal counts • Chain length statistics and histograms • (Parliament worked very well here. Each operation took 1-736 seconds.)
Stream Processing • Many Semantic Web tools provide streaming parsers rather than, or in addition to, model access • Analogous to XML SAX vs. DOM • For suitable applications, this can be a lot faster than loading statements into a KB • Streaming analysis of the 2009 corpus was performed at an overall rate of 103K statements/sec on a Mac laptop with a portable external disk • Compare to loading 10-20K statements/second on a server
Statements • Statement (subject, predicate, object) • Resource object • rdf:type predicate • Other predicate • Literal object • rdf:datatype • Plain literal • xml:lang • Neither datatype nor language
Resources and Literals • Node • Resource • URI • Blank Node • Literal
Chain Lengths • How long are the linked-list chains used by Parliament? • How many statements share the same subject, predicate, or object? • Histograms proved unwieldy • Presenting summary statistics instead • rdf:type statements significantly impact results
RDF/RDFS/OWL Usage • 80,309,558 rdf:type statements in 11 data sets • 4,033,540 rdfs:subClassOf statements in 6 data sets • 2,988,396 owl:Class instances in 6 data sets • 1,492,214 rdf:_1 statements in 7 data sets • 1,042,032 owl:Restriction instances in 5 data sets • 480,771 owl:sameAs statements in 9 data sets • 299,962 rdfs:Class instances in same 6 data sets as owl:Class • 265,124 rdfs:domain statements in 6 data sets • 252,175 rdfs:range statements in 6 data sets • ~238,000 reified statements in 4 data sets • 50,482 instances of rdf:Bag in 5 data sets • 22,154 instances of owl:Ontology in 5 data sets • 14,913 owl:imports statements in 3 data sets • 83 rdf:_2000 statements in 3 data sets • 1 rdf:_10763 statement in 1 data set
Popular Vocabularies • FOAF • 29,308,169 Person instances in 7 data sets • 25,864,527 knows statements in 6 data sets • Dublin Core • 43,591,844 title statements in 7 data sets • 4,416,716 date statements in 6 data sets • Geospatial • 7,075,380 wgs84_pos:lat statements in 9 data sets • 4,436 georss:point statements in 5 data sets • SKOS • 6,619,912 subject statements in 4 data sets • 403,912 Concept instances in 4 data sets • RSS 1.0 • 2,893,750 item instances in 6 data sets • OWL-S • 92 0.9-1.2 Profiles in 3 data sets • OWL-Time • No usage?
Errors • 95,937 Java exceptions • Lots of bad languages and datatypes • Lots of namespace/URI typos/confusion • Slightly different statement counts, due to exceptions, duplicates, etc. • 1,063,616,774 statements (4% less)
Crawled Data • Webscope, Falcon, Swoogle, Watson, SWSE-1, and SWSE-2 consisted of crawled data from a wide range of sites • Included some data I published in 2002
DBpedia • Information extracted from Wikipedia pages • Example <http://dbpedia.org/resource/San_Jose%2C_California> rdfs:label "San Jose, California"@en ; dbpedia:officialName "City of San Jose"@en ; geo:lat "37.304"^^xsd:float ; geo:long "-121.873"^^xsd:float ; dbpedia:populationTotal "929936" ; dbpedia:areaLandSqMi "174.9" ; dbpedia:timezone <http://dbpedia.org/resource/Pacific_Time_Zone> ; foaf:homepage <http://www.sanjoseca.gov> ; foaf:img <http://upload.wikimedia.org/wikipedia/commons/3/3f/SJPan.jpg> ; foaf:page <http://en.wikipedia.org/wiki/San_Jose%2C_California> ; dbpedia:wikilink <http://dbpedia.org/resource/April_3> , ... ; owl:sameAs <http://sws.geonames.org/5392171/> , <http://www.rdfabout.com/rdf/usgov/geo/us/ca/counties/santa_clara_county/san_jose> . • See http://dbpedia.org
Freebase • Collections of curated datasets • RDF-like data model • Data exports available, but no standard mapping to RDF until rdf.freebase.com was announced at ISWC2008 • Follows Linked Data principles • Standard RDF dump still not available • Some anomalies in the corpus mappings affected statistics • Used freebase:type rather than rdf:type • Language codes had a prepended /, e.g. “/en” • freebase.org (a different site) should be freebase.com • Example <http://www.freebase.org/guid/9202a8c04000641f800000000006809a> <http://www.freebase.org/type/object/name> "San Jose, California"@/en ; <http://www.freebase.org/type/object/type> <http://www.freebase.org/location/citytown> , <http://www.freebase.org/location/us_citytown> ; <http://www.freebase.org/location/citytown/founded> "1777-11-29" ; <http://www.freebase.org/location/location/area> "461.5” . • See http://freebase.com and http://rdf.freebase.com
Geonames • 8 million geographic names and locations • Example <http://sws.geonames.org/6484236/> a geonames:Feature> ; geonames:featureClass geonames:S ; geonames:featureCode geonames:S.HTL ; geonames:inCountry <http://www.geonames.org/countries/#US> ; geonames:locationMap "http://www.geonames.org/6484236/the-fairmont-san-jose.html" ; geonames:name "The Fairmont San Jose" ; geonames:nearbyFeatures> <http://sws.geonames.org/6484236/nearby.rdf> ; geonames:parentFeature <http://sws.geonames.org/5332921/> ; geo:lat "37.3326" ; geo:long "-121.8893" . • See http://geonames.org
SwetoDBLP • Metadata on publications in Computer Science (originally Databases and Logic Programming) • Example <http://dblp.uni-trier.de/rec/bibtex/conf/geos/KolasHD05> a opus:Article_in_Proceedings ; rdfs:label "Geospatial Semantic Web: Architecture of Ontologies." ; opus:author [ a rdf:Seq ; rdf:_1 <http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/k/Kolas:Dave.html> ; rdf:_2 <http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/h/Hebeler:John.html> ; rdf:_3 <http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/d/Dean:Mike.html> ] ; opus:isIncludedIn <http://dblp.uni-trier.de/rec/bibtex/conf/geos/2005> ; opus:book_title "GeoS" ; opus:year "2005"^^xsd:gYear ; opus:pages "183-194" ; dcelem:relation "http://www.informatik.uni-trier.de/~ley/db/conf/geos/geos2005.html#KolasHD05" ; opus:last_modified_date "2005-11-08"^^xsd:date . <http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/d/Dean:Mike.html> a foaf:Person ; foaf:name "Mike Dean" . • See http://lsdis.cs.uga.edu/projects/semdis/swetodblp/
WordNet • Lexical database of English, including multiple word senses and synonym sets • Example wn20instances:wordsense-semantic-adjective-1 a wn20schema:AdjectiveWordSense ; rdfs:label "semantic"@en-us ; wn20schema:adjectivePertainsTo wn20instances:wordsense-semantics-noun-1 ; wn20schema:tagCount "3"@en-us ; wn20schema:word wn20instances:word-semantic . wn20instances:word-semantic a wn20schema:Word ; wn20schema:lexicalForm "semantic"@en-us . • See http://www.w3.org/2006/03/wn/wn20/
US Census • 1 billion triples published by Joshua Tauberer in April 2007 • Highly tabular data • Example <http://www.rdfabout.com/rdf/usgov/geo/us/ca/counties/santa_clara_county/san_jose> a <http://www.rdfabout.com/rdf/schema/usgovt/Town> ; dc:title "San Jose" ; dcterms:hasPart <http://www.rdfabout.com/rdf/usgov/geo/us/ca/counties/santa_clara_county/san_jose/fruitdale> , <http://www.rdfabout.com/rdf/usgov/geo/us/ca/counties/santa_clara_county/san_jose/seven_trees> , ... ; dcterms:isPartOf <http://www.rdfabout.com/rdf/usgov/geo/us/ca/counties/santa_clara_county> ; census:details <http://www.rdfabout.com/rdf/usgov/geo/us/ca/counties/santa_clara_county/san_jose/censustables> ; census:households 559949 ; census:landArea "1144714122 m^2" ; census:population 1621316 ; census:waterArea "20064384 m^2" ; geo:lat "37.318892" ; geo:long "-121.928244" . • See http://www.rdfabout.com/demo/census/
2009 Corpus • All crawled data, using Falcon-S, Sindice, Swoogle, SWSE, and Watson • 1,151,383,509 statements in 116 chunks of 10 million • Represented in NQuads format • Explicit source/context for each statement • No parsing errors • See http://vmlion25.deri.ie/ • Includes sampled statistics (which I found to be highly accurate) • Sources by “Pay Level Domain”
LUBM • The Lehigh University Benchmark (LUBM) is widely used for Semantic Web benchmarking • Synthetic data generated for a specified number of universities • Example <http://www.Department0.University0.edu/FullProfessor0> a ub:FullProfessor ; ub:doctoralDegreeFrom <http://www.University241.edu> ; ub:emailAddress "FullProfessor0@Department0.University0.edu" ; ub:mastersDegreeFrom <http://www.University875.edu> ; ub:name "FullProfessor0" ; ub:researchInterest "Research20" ; ub:teacherOf <http://www.Department0.University0.edu/GraduateCourse1> , <http://www.Department0.University0.edu/Course0> , <http://www.Department0.University0.edu/GraduateCourse0> ; ub:telephone "xxx-xxx-xxxx" ; ub:undergraduateDegreeFrom <http://www.University84.edu> ; ub:worksFor <http://www.Department0.University0.edu> . • See http://swat.cse.lehigh.edu/projects/lubm/
Further Analysis • Node level comparison of the 2009 corpus • Increased factoring of rdf:type statements • How many rdf:type’s are associated with each resource? • Overlap between 2008 and 2009 corpora • Analysis and reporting by Pay Level Domain rather than dataset • By vocabulary (aggregated source vs. aggregated predicate/type) • Drilldown into particular patterns, e.g. 32K element set/bag • Additional graph metrics (e.g. diameter)
2008 Billion Triples Winners • SemaPlorer: map-based exploration and visualization • SearchWebDB: inexact keyword search • MaRVIN: scalable reasoning from LarKC • i-MoCo: storage and browsing of 250M+ triples with an iPhone application • SAOR: Scalable Authoritative OWL Reasoning • Virtuoso: sophisticated storage and querying
2009 Challenge • Consider entering the Semantic Web Challenge • Submissions due October 1 • Submissions will be presented and winners named at the 8th International Semantic Web Conference (ISWC2009) October 25-29 near Washington, DC
More Information • Semantic Web Challenge • http://challenge.semanticweb.org • Analysis Code and Raw Data • 2008: http://asio.bbn.com/2008/10/btc/ • 2009: http://asio.bbn.com/2009/06/btc/ • ISWC2009 • http://iswc2009.semanticweb.org