How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus

How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com

Assumptions • Technology – Intermediate • Familiarity with RDF and OWL • Interest in • Semantic Web usage patterns • Semantic Web Challenge

Presenter Background • Principal Engineer at BBN Technologies (1984-present) • Principal Investigator for DARPA Agent Markup Language (DAML) Integration and Transition (2000-2005) • Chaired the Joint US/EU Committee that developed DAML+OIL and SWRL • Developer and/or Principal Investigator for many Semantic Web tools, datasets, and applications (2000-present) • Member of the W3C RDF Core, Web Ontology, and Rule Interchange Format Working Groups • Co-editor of the W3C OWL Reference • Member of the Semantic Web Challenge Advisory Board since its inception • Local co-chair for ISWC2009 • Other SemTech presentations • Semantic Query: Solving the Needs of a Net-Centric Data Sharing Environment (2007, w/ Matt Fisher) • Semantic Queries and Mediation in a RESTful Architecture (2008, w/ John Gilman and Matt Fisher) • Use of SWRL for Ontology Translation (2008) • Semantic Web @ BBN: Application to the Digital Whitewater Challenge (2009, w/ John Hebeler)

Semantic Web Challenge • Founded in 2003 by Michel Klein and Ubbo Visser • Demonstrate the value of the Semantic Web through applications • Submissions evaluated according to a set of minimal requirements and additional desirable features • Has become an annual event at International Semantic Web Conferences • 22 submissions in 2008

2008 Billion Triples Challenge • A new Semantic Web Challenge track in 2008 • Do “something interesting” with a large subset of a billion provided triples • Co-chaired by Jim Hendler and Peter Mika • 12 real web data sets • Not a scientific sample • Enough to be interesting and probably representative • Stable snapshot • Our analysis initially arose from discussing a possible application • We now know “yes, there is enough data to support what we wanted to do” • Tools and techniques should be generally applicable to other corpora

2008 Billion Triples Corpus http://www.cs.vu.nl/~pmika/swc/btc.html

Data Set Characterization • Metrics that can impact selection/tuning of KB implementations • Statement count • Number of classes and predicates • Statements per subject/predicate/object • Degree of interconnectedness (percentage of non-literal statements, with/without rdf:type) • RDFS and OWL reasoning employed • Use of reification

Analysis • Stream processing of the compressed data set archives • Statement counts • Datatype, language, predicate, and type counts • Use of RDF, RDFS, OWL, FOAF, and other vocabularies • (May include duplicate statements) • Load each dataset into its own Parliament KB • (Eliminates duplicates within dataset) • (Both programs used code based on Peter Mika’s WARC example with the OpenRDF RIO parser and no inference) • Process the statement and resource tables • Mark each node as resource and/or literal • URI, blank node, and literal counts • Chain length statistics and histograms • (Parliament worked very well here. Each operation took 1-736 seconds.)

Stream Processing • Many Semantic Web tools provide streaming parsers rather than, or in addition to, model access • Analogous to XML SAX vs. DOM • For suitable applications, this can be a lot faster than loading statements into a KB • Streaming analysis of the 2009 corpus was performed at an overall rate of 103K statements/sec on a Mac laptop with a portable external disk • Compare to loading 10-20K statements/second on a server

Classes and Predicates

Statements • Statement (subject, predicate, object) • Resource object • rdf:type predicate • Other predicate • Literal object • rdf:datatype • Plain literal • xml:lang • Neither datatype nor language

Statement % (distinct values)

Resources and Literals • Node • Resource • URI • Blank Node • Literal

Node %

Chain Lengths • How long are the linked-list chains used by Parliament? • How many statements share the same subject, predicate, or object? • Histograms proved unwieldy • Presenting summary statistics instead • rdf:type statements significantly impact results

Mean chain lengths (std dev)

RDF/RDFS/OWL Usage • 80,309,558 rdf:type statements in 11 data sets • 4,033,540 rdfs:subClassOf statements in 6 data sets • 2,988,396 owl:Class instances in 6 data sets • 1,492,214 rdf:_1 statements in 7 data sets • 1,042,032 owl:Restriction instances in 5 data sets • 480,771 owl:sameAs statements in 9 data sets • 299,962 rdfs:Class instances in same 6 data sets as owl:Class • 265,124 rdfs:domain statements in 6 data sets • 252,175 rdfs:range statements in 6 data sets • ~238,000 reified statements in 4 data sets • 50,482 instances of rdf:Bag in 5 data sets • 22,154 instances of owl:Ontology in 5 data sets • 14,913 owl:imports statements in 3 data sets • 83 rdf:_2000 statements in 3 data sets • 1 rdf:_10763 statement in 1 data set

Popular Vocabularies • FOAF • 29,308,169 Person instances in 7 data sets • 25,864,527 knows statements in 6 data sets • Dublin Core • 43,591,844 title statements in 7 data sets • 4,416,716 date statements in 6 data sets • Geospatial • 7,075,380 wgs84_pos:lat statements in 9 data sets • 4,436 georss:point statements in 5 data sets • SKOS • 6,619,912 subject statements in 4 data sets • 403,912 Concept instances in 4 data sets • RSS 1.0 • 2,893,750 item instances in 6 data sets • OWL-S • 92 0.9-1.2 Profiles in 3 data sets • OWL-Time • No usage?

Errors • 95,937 Java exceptions • Lots of bad languages and datatypes • Lots of namespace/URI typos/confusion • Slightly different statement counts, due to exceptions, duplicates, etc. • 1,063,616,774 statements (4% less)

Crawled Data • Webscope, Falcon, Swoogle, Watson, SWSE-1, and SWSE-2 consisted of crawled data from a wide range of sites • Included some data I published in 2002

DBpedia • Information extracted from Wikipedia pages • Example <http://dbpedia.org/resource/San_Jose%2C_California> rdfs:label "San Jose, California"@en ; dbpedia:officialName "City of San Jose"@en ; geo:lat "37.304"^^xsd:float ; geo:long "-121.873"^^xsd:float ; dbpedia:populationTotal "929936" ; dbpedia:areaLandSqMi "174.9" ; dbpedia:timezone <http://dbpedia.org/resource/Pacific_Time_Zone> ; foaf:homepage <http://www.sanjoseca.gov> ; foaf:img <http://upload.wikimedia.org/wikipedia/commons/3/3f/SJPan.jpg> ; foaf:page <http://en.wikipedia.org/wiki/San_Jose%2C_California> ; dbpedia:wikilink <http://dbpedia.org/resource/April_3> , ... ; owl:sameAs <http://sws.geonames.org/5392171/> , <http://www.rdfabout.com/rdf/usgov/geo/us/ca/counties/santa_clara_county/san_jose> . • See http://dbpedia.org

Freebase • Collections of curated datasets • RDF-like data model • Data exports available, but no standard mapping to RDF until rdf.freebase.com was announced at ISWC2008 • Follows Linked Data principles • Standard RDF dump still not available • Some anomalies in the corpus mappings affected statistics • Used freebase:type rather than rdf:type • Language codes had a prepended /, e.g. “/en” • freebase.org (a different site) should be freebase.com • Example <http://www.freebase.org/guid/9202a8c04000641f800000000006809a> <http://www.freebase.org/type/object/name> "San Jose, California"@/en ; <http://www.freebase.org/type/object/type> <http://www.freebase.org/location/citytown> , <http://www.freebase.org/location/us_citytown> ; <http://www.freebase.org/location/citytown/founded> "1777-11-29" ; <http://www.freebase.org/location/location/area> "461.5” . • See http://freebase.com and http://rdf.freebase.com

Geonames • 8 million geographic names and locations • Example <http://sws.geonames.org/6484236/> a geonames:Feature> ; geonames:featureClass geonames:S ; geonames:featureCode geonames:S.HTL ; geonames:inCountry <http://www.geonames.org/countries/#US> ; geonames:locationMap "http://www.geonames.org/6484236/the-fairmont-san-jose.html" ; geonames:name "The Fairmont San Jose" ; geonames:nearbyFeatures> <http://sws.geonames.org/6484236/nearby.rdf> ; geonames:parentFeature <http://sws.geonames.org/5332921/> ; geo:lat "37.3326" ; geo:long "-121.8893" . • See http://geonames.org

SwetoDBLP • Metadata on publications in Computer Science (originally Databases and Logic Programming) • Example <http://dblp.uni-trier.de/rec/bibtex/conf/geos/KolasHD05> a opus:Article_in_Proceedings ; rdfs:label "Geospatial Semantic Web: Architecture of Ontologies." ; opus:author [ a rdf:Seq ; rdf:_1 <http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/k/Kolas:Dave.html> ; rdf:_2 <http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/h/Hebeler:John.html> ; rdf:_3 <http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/d/Dean:Mike.html> ] ; opus:isIncludedIn <http://dblp.uni-trier.de/rec/bibtex/conf/geos/2005> ; opus:book_title "GeoS" ; opus:year "2005"^^xsd:gYear ; opus:pages "183-194" ; dcelem:relation "http://www.informatik.uni-trier.de/~ley/db/conf/geos/geos2005.html#KolasHD05" ; opus:last_modified_date "2005-11-08"^^xsd:date . <http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/d/Dean:Mike.html> a foaf:Person ; foaf:name "Mike Dean" . • See http://lsdis.cs.uga.edu/projects/semdis/swetodblp/

WordNet • Lexical database of English, including multiple word senses and synonym sets • Example wn20instances:wordsense-semantic-adjective-1 a wn20schema:AdjectiveWordSense ; rdfs:label "semantic"@en-us ; wn20schema:adjectivePertainsTo wn20instances:wordsense-semantics-noun-1 ; wn20schema:tagCount "3"@en-us ; wn20schema:word wn20instances:word-semantic . wn20instances:word-semantic a wn20schema:Word ; wn20schema:lexicalForm "semantic"@en-us . • See http://www.w3.org/2006/03/wn/wn20/

US Census • 1 billion triples published by Joshua Tauberer in April 2007 • Highly tabular data • Example <http://www.rdfabout.com/rdf/usgov/geo/us/ca/counties/santa_clara_county/san_jose> a <http://www.rdfabout.com/rdf/schema/usgovt/Town> ; dc:title "San Jose" ; dcterms:hasPart <http://www.rdfabout.com/rdf/usgov/geo/us/ca/counties/santa_clara_county/san_jose/fruitdale> , <http://www.rdfabout.com/rdf/usgov/geo/us/ca/counties/santa_clara_county/san_jose/seven_trees> , ... ; dcterms:isPartOf <http://www.rdfabout.com/rdf/usgov/geo/us/ca/counties/santa_clara_county> ; census:details <http://www.rdfabout.com/rdf/usgov/geo/us/ca/counties/santa_clara_county/san_jose/censustables> ; census:households 559949 ; census:landArea "1144714122 m^2" ; census:population 1621316 ; census:waterArea "20064384 m^2" ; geo:lat "37.318892" ; geo:long "-121.928244" . • See http://www.rdfabout.com/demo/census/

2009 Corpus • All crawled data, using Falcon-S, Sindice, Swoogle, SWSE, and Watson • 1,151,383,509 statements in 116 chunks of 10 million • Represented in NQuads format • Explicit source/context for each statement • No parsing errors • See http://vmlion25.deri.ie/ • Includes sampled statistics (which I found to be highly accurate) • Sources by “Pay Level Domain”

LUBM • The Lehigh University Benchmark (LUBM) is widely used for Semantic Web benchmarking • Synthetic data generated for a specified number of universities • Example <http://www.Department0.University0.edu/FullProfessor0> a ub:FullProfessor ; ub:doctoralDegreeFrom <http://www.University241.edu> ; ub:emailAddress "FullProfessor0@Department0.University0.edu" ; ub:mastersDegreeFrom <http://www.University875.edu> ; ub:name "FullProfessor0" ; ub:researchInterest "Research20" ; ub:teacherOf <http://www.Department0.University0.edu/GraduateCourse1> , <http://www.Department0.University0.edu/Course0> , <http://www.Department0.University0.edu/GraduateCourse0> ; ub:telephone "xxx-xxx-xxxx" ; ub:undergraduateDegreeFrom <http://www.University84.edu> ; ub:worksFor <http://www.Department0.University0.edu> . • See http://swat.cse.lehigh.edu/projects/lubm/

Statement % (distinct values)

RDF/RDFS/OWL Usage

Popular Vocabularies

Corpus Composition

Further Analysis • Node level comparison of the 2009 corpus • Increased factoring of rdf:type statements • How many rdf:type’s are associated with each resource? • Overlap between 2008 and 2009 corpora • Analysis and reporting by Pay Level Domain rather than dataset • By vocabulary (aggregated source vs. aggregated predicate/type) • Drilldown into particular patterns, e.g. 32K element set/bag • Additional graph metrics (e.g. diameter)

2008 Billion Triples Winners • SemaPlorer: map-based exploration and visualization • SearchWebDB: inexact keyword search • MaRVIN: scalable reasoning from LarKC • i-MoCo: storage and browsing of 250M+ triples with an iPhone application • SAOR: Scalable Authoritative OWL Reasoning • Virtuoso: sophisticated storage and querying

2009 Challenge • Consider entering the Semantic Web Challenge • Submissions due October 1 • Submissions will be presented and winners named at the 8th International Semantic Web Conference (ISWC2009) October 25-29 near Washington, DC

More Information • Semantic Web Challenge • http://challenge.semanticweb.org • Analysis Code and Raw Data • 2008: http://asio.bbn.com/2008/10/btc/ • 2009: http://asio.bbn.com/2009/06/btc/ • ISWC2009 • http://iswc2009.semanticweb.org

How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus