1 / 15

Performing Object Consolidation on the Semantic Web Data Graph

Performing Object Consolidation on the Semantic Web Data Graph. Aidan Hogan Andreas Harth Stefan Decker. Introduction. Aim: To merge equivalent RDF instances for large scale RDF datasets; a.k.a. perform object consolidation Background:

jodie
Download Presentation

Performing Object Consolidation on the Semantic Web Data Graph

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performing Object Consolidation on the Semantic Web Data Graph Aidan Hogan Andreas Harth Stefan Decker

  2. Introduction • Aim: To merge equivalent RDF instances for large scale RDF datasets; a.k.a. perform object consolidation • Background: • RDF (Resource Description Framework) is data model used in Semantic Web technologies • Ideal for entity centric applications where structured descriptions of entities are provided in RDF (e.g. SWSE); anything can be described in RDF • URIs are used as identifiers for entities • Ideally, URIs are used consistently across data sources to describe entities; information on entities can be collected and merged from different sources

  3. Motivation • Problem: • URIs often not agreed upon (or not provided) for entities across sources; especially real world entities (e.g. cannot achieve agreement upon a URI for a person). Therefore, may have many instances split for one entity. • Entity centric applications will see multiple instances as multiple entities – problematic! Example later…

  4. Towards a Solution • Towards a solution: • RDF data backed by ontologies in which certain properties may be described as being Inverse Functional • Inverse Functional Properties have values unique to an entity (e.g., chat usernames unique to people, ISBN code unique to books, etc.). • Therefore, if two instances have the same value for the same Inverse Functional Property, they are equivalent and can be merged.

  5. Example • Three sources provide data on one person – different identifiers used • Two different Inverse Functional Properties: • foaf:mbox referring to a person’s email • foaf:homepage referring to a person’s homepage

  6. Benefit • Before consolidation, three instances one entity. For example an entity centric search engine would return three results for the one person. • After consolidation, one instances one entity.

  7. Our Dataset • Want to perform object consolidation on entire RDF Semantic Web data graph… • 470M statements from multiple schemas describing 72M instances from over 3M data sources • 84% of instances have no URI identifier • Majority of data is FOAF (Friend of a Friend) descriptions of people (78%) with 99.9% having no idenitifiers • => We need scalable algorithm for performing object consolidation

  8. Step 1 • Need to identify Inverse Functional Properties in dataset • Inverse functional properties are defined in ontologies • Need to retrieve ontologies describing properties in the dataset • Can dereference the property URIs to find the pertinent ontologies • Examples of inverse functional properties found were • foaf:mbox (email property), foaf:homepage, foaf:weblog, foaf:aimChatID and other chat ID properties, doap:homepage

  9. Step 2 • Need to re-order data on-disk • initially data in NQuads unsorted SPOC order • Subject = identifier of entity being described • Predicate = property of entity being described • Object = value of property • Context = data-source of SPO triple • data re-ordered to POCS order… http://andreasharth.org#me foaf:name Andreas Harth http://andreasharth.org/foaf.rdf • …and sorted. Now data is grouped by same predicates and then objects.

  10. Step 3 • Scan data for equivalent instances • scan sorted POCS data looking for equivalent instances • if a predicate is an inverse functional property and has two identical values as object, the instances with identifiers as subject are equivalent and describe the same entity • equivalence is transitive and so a “same-as table” is used to store and perform transitive closure. • each row of the table contains equivalent identifiers • no identifier can appear in more than one row

  11. Step 4 • Pick identifiers • Now we have a list of equivalent instance identifiers… we need to pick one and use it for consolidated instance • We… • Pick URIs before blank nodes • Pick more common used identifiers after above restriction • Another scan of data is performed to count the number of statements identifiers appear in (if they appear in same-as list). • The new identifiers are called pivot identifiers

  12. Step 5 • Rewrite identifiers • Data is scanned and identifiers in subject and object position are rewritten to pivot identifiers • …one iteration complete • It’s possible that more than one iteration may be required… If a value of an inverse functional property is changed in one iteration, more equivalences may be found by another iteration

  13. Evaluation • Encountered issues applying algorithm to 480M dataset • foaf:weblog defined as inverse functional -- given values which are communal weblogs or shared weblogs (not unique to a person) • we removed foaf:weblog from list of inverse functional properties • many people define common arbitrary values for properties such as chat IDs; e.g., ask, none • we define a black-list for such values

  14. Evaluation • 2,443,939 instances consolidated to 401,385 • 1 iteration required • The following table shows the number of atomic equivalences found through the main inverse functional properties

  15. Thanks!

More Related