Performing Object Consolidation on the Semantic Web Data Graph

Performing Object Consolidation on the Semantic Web Data Graph Aidan Hogan Andreas Harth Stefan Decker

Introduction • Aim: To merge equivalent RDF instances for large scale RDF datasets; a.k.a. perform object consolidation • Background: • RDF (Resource Description Framework) is data model used in Semantic Web technologies • Ideal for entity centric applications where structured descriptions of entities are provided in RDF (e.g. SWSE); anything can be described in RDF • URIs are used as identifiers for entities • Ideally, URIs are used consistently across data sources to describe entities; information on entities can be collected and merged from different sources

Motivation • Problem: • URIs often not agreed upon (or not provided) for entities across sources; especially real world entities (e.g. cannot achieve agreement upon a URI for a person). Therefore, may have many instances split for one entity. • Entity centric applications will see multiple instances as multiple entities – problematic! Example later…

Towards a Solution • Towards a solution: • RDF data backed by ontologies in which certain properties may be described as being Inverse Functional • Inverse Functional Properties have values unique to an entity (e.g., chat usernames unique to people, ISBN code unique to books, etc.). • Therefore, if two instances have the same value for the same Inverse Functional Property, they are equivalent and can be merged.

Example • Three sources provide data on one person – different identifiers used • Two different Inverse Functional Properties: • foaf:mbox referring to a person’s email • foaf:homepage referring to a person’s homepage

Benefit • Before consolidation, three instances one entity. For example an entity centric search engine would return three results for the one person. • After consolidation, one instances one entity.

Our Dataset • Want to perform object consolidation on entire RDF Semantic Web data graph… • 470M statements from multiple schemas describing 72M instances from over 3M data sources • 84% of instances have no URI identifier • Majority of data is FOAF (Friend of a Friend) descriptions of people (78%) with 99.9% having no idenitifiers • => We need scalable algorithm for performing object consolidation

Step 1 • Need to identify Inverse Functional Properties in dataset • Inverse functional properties are defined in ontologies • Need to retrieve ontologies describing properties in the dataset • Can dereference the property URIs to find the pertinent ontologies • Examples of inverse functional properties found were • foaf:mbox (email property), foaf:homepage, foaf:weblog, foaf:aimChatID and other chat ID properties, doap:homepage

Step 2 • Need to re-order data on-disk • initially data in NQuads unsorted SPOC order • Subject = identifier of entity being described • Predicate = property of entity being described • Object = value of property • Context = data-source of SPO triple • data re-ordered to POCS order… http://andreasharth.org#me foaf:name Andreas Harth http://andreasharth.org/foaf.rdf • …and sorted. Now data is grouped by same predicates and then objects.

Step 3 • Scan data for equivalent instances • scan sorted POCS data looking for equivalent instances • if a predicate is an inverse functional property and has two identical values as object, the instances with identifiers as subject are equivalent and describe the same entity • equivalence is transitive and so a “same-as table” is used to store and perform transitive closure. • each row of the table contains equivalent identifiers • no identifier can appear in more than one row

Step 4 • Pick identifiers • Now we have a list of equivalent instance identifiers… we need to pick one and use it for consolidated instance • We… • Pick URIs before blank nodes • Pick more common used identifiers after above restriction • Another scan of data is performed to count the number of statements identifiers appear in (if they appear in same-as list). • The new identifiers are called pivot identifiers

Step 5 • Rewrite identifiers • Data is scanned and identifiers in subject and object position are rewritten to pivot identifiers • …one iteration complete • It’s possible that more than one iteration may be required… If a value of an inverse functional property is changed in one iteration, more equivalences may be found by another iteration

Evaluation • Encountered issues applying algorithm to 480M dataset • foaf:weblog defined as inverse functional -- given values which are communal weblogs or shared weblogs (not unique to a person) • we removed foaf:weblog from list of inverse functional properties • many people define common arbitrary values for properties such as chat IDs; e.g., ask, none • we define a black-list for such values

Evaluation • 2,443,939 instances consolidated to 401,385 • 1 iteration required • The following table shows the number of atomic equivalences found through the main inverse functional properties

Thanks!

Performing Object Consolidation on the Semantic Web Data Graph

Performing Object Consolidation on the Semantic Web Data Graph

Presentation Transcript

Trust on the Semantic Web

The Semantic Object Model

Finding knowledge, data and answers on the Semantic Web

Graph Databases and the Semantic Web

Learning Objects on the Semantic Web

Finding knowledge, data and answers on the Semantic Web

Data Integration on the Semantic Sensor Web

Data on the (Semantic) Web

The web graph

The Ontological Semantic Perspective on the Semantic Web

Semantic Web Fred Automated Goal Resolution on the Semantic Web

LCA data on the Semantic Web

Semantic Data lives everywhere on the Web

Agents on the Semantic Web

Languages on the Semantic Web

XML on Semantic Web

Data Quality on the Semantic Web

Finding knowledge, data and answers on the Semantic Web

Instance Data Evaluation on the Semantic Web

Searching for Knowledge and Data on the Semantic Web

Multimedia on the Semantic Web

Semantic Similarity Computation on the Web of Data