150 likes | 306 Views
Performing Object Consolidation on the Semantic Web Data Graph. Aidan Hogan Andreas Harth Stefan Decker. Introduction. Aim: To merge equivalent RDF instances for large scale RDF datasets; a.k.a. perform object consolidation Background:
E N D
Performing Object Consolidation on the Semantic Web Data Graph Aidan Hogan Andreas Harth Stefan Decker
Introduction • Aim: To merge equivalent RDF instances for large scale RDF datasets; a.k.a. perform object consolidation • Background: • RDF (Resource Description Framework) is data model used in Semantic Web technologies • Ideal for entity centric applications where structured descriptions of entities are provided in RDF (e.g. SWSE); anything can be described in RDF • URIs are used as identifiers for entities • Ideally, URIs are used consistently across data sources to describe entities; information on entities can be collected and merged from different sources
Motivation • Problem: • URIs often not agreed upon (or not provided) for entities across sources; especially real world entities (e.g. cannot achieve agreement upon a URI for a person). Therefore, may have many instances split for one entity. • Entity centric applications will see multiple instances as multiple entities – problematic! Example later…
Towards a Solution • Towards a solution: • RDF data backed by ontologies in which certain properties may be described as being Inverse Functional • Inverse Functional Properties have values unique to an entity (e.g., chat usernames unique to people, ISBN code unique to books, etc.). • Therefore, if two instances have the same value for the same Inverse Functional Property, they are equivalent and can be merged.
Example • Three sources provide data on one person – different identifiers used • Two different Inverse Functional Properties: • foaf:mbox referring to a person’s email • foaf:homepage referring to a person’s homepage
Benefit • Before consolidation, three instances one entity. For example an entity centric search engine would return three results for the one person. • After consolidation, one instances one entity.
Our Dataset • Want to perform object consolidation on entire RDF Semantic Web data graph… • 470M statements from multiple schemas describing 72M instances from over 3M data sources • 84% of instances have no URI identifier • Majority of data is FOAF (Friend of a Friend) descriptions of people (78%) with 99.9% having no idenitifiers • => We need scalable algorithm for performing object consolidation
Step 1 • Need to identify Inverse Functional Properties in dataset • Inverse functional properties are defined in ontologies • Need to retrieve ontologies describing properties in the dataset • Can dereference the property URIs to find the pertinent ontologies • Examples of inverse functional properties found were • foaf:mbox (email property), foaf:homepage, foaf:weblog, foaf:aimChatID and other chat ID properties, doap:homepage
Step 2 • Need to re-order data on-disk • initially data in NQuads unsorted SPOC order • Subject = identifier of entity being described • Predicate = property of entity being described • Object = value of property • Context = data-source of SPO triple • data re-ordered to POCS order… http://andreasharth.org#me foaf:name Andreas Harth http://andreasharth.org/foaf.rdf • …and sorted. Now data is grouped by same predicates and then objects.
Step 3 • Scan data for equivalent instances • scan sorted POCS data looking for equivalent instances • if a predicate is an inverse functional property and has two identical values as object, the instances with identifiers as subject are equivalent and describe the same entity • equivalence is transitive and so a “same-as table” is used to store and perform transitive closure. • each row of the table contains equivalent identifiers • no identifier can appear in more than one row
Step 4 • Pick identifiers • Now we have a list of equivalent instance identifiers… we need to pick one and use it for consolidated instance • We… • Pick URIs before blank nodes • Pick more common used identifiers after above restriction • Another scan of data is performed to count the number of statements identifiers appear in (if they appear in same-as list). • The new identifiers are called pivot identifiers
Step 5 • Rewrite identifiers • Data is scanned and identifiers in subject and object position are rewritten to pivot identifiers • …one iteration complete • It’s possible that more than one iteration may be required… If a value of an inverse functional property is changed in one iteration, more equivalences may be found by another iteration
Evaluation • Encountered issues applying algorithm to 480M dataset • foaf:weblog defined as inverse functional -- given values which are communal weblogs or shared weblogs (not unique to a person) • we removed foaf:weblog from list of inverse functional properties • many people define common arbitrary values for properties such as chat IDs; e.g., ask, none • we define a black-list for such values
Evaluation • 2,443,939 instances consolidated to 401,385 • 1 iteration required • The following table shows the number of atomic equivalences found through the main inverse functional properties