1 / 19

Aidan Hogan, Antoine Zimmermann, Jürgen Umbrich , Axel Polleres , Stefan Decker

Scalable and Distributed Methods for Entity Matching, Consolidation and Disambiguation over Linked Data Corpora. Aidan Hogan, Antoine Zimmermann, Jürgen Umbrich , Axel Polleres , Stefan Decker Presented by Joseph Park. Introduction. Linked Data best practices:

stu
Download Presentation

Aidan Hogan, Antoine Zimmermann, Jürgen Umbrich , Axel Polleres , Stefan Decker

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable and Distributed Methods for Entity Matching,Consolidation and Disambiguation over Linked Data Corpora Aidan Hogan, Antoine Zimmermann, Jürgen Umbrich, Axel Polleres, Stefan Decker Presented by Joseph Park

  2. Introduction • Linked Data best practices: • Use URIs as names for things (not just documents) • Make those URIs dereferenceable via HTTP • Return useful and relevant RDF content upon lookup of those URIs • Include links to other datasets • Linked Open Data project • Goal of providing dereferenceable machine readable data in RDF • Emphasis on reuse of URIs and inter-linkage between remote datasets • Web of Data • 30 billion published RDF triples

  3. Aims & Goals • Focus on finding equivalent entities • E.g. people, places, musicians, proteins • Two entities are equivalent if they are coreferent • Interest in identifying coreferences and merge knowledge contributions provided by distinct parties (consolidation)

  4. OWL:SameAs • owl:sameAs • A core OWL property that defines equivalences between individuals • Two individuals related by owl:sameAs are coreferent • Inferring new owl:sameAs relations: • Inverse-functional properties (e.g:biologicalMotherOf) • Functional properties (e.g:hasBiologicalMother) • Cardinality and max-cardinality restrictions

  5. Constraints to Owl:SameAs

  6. Experiment • 1.118 billion quadruples • Crawled from 3.985 million web documents • 1.106 billion are unique • 947 million are unique triples • 9 machines linked by Gigabit ethernet

  7. Baseline – owl:Sameas • Extracted 11.93 million raw owl:sameAs quadruples • Only 3.77 million unique triples • 1000 randomly chosen pairs hand-checked • Trivially same (661 times) • Same (301 times) • Different (28 times) • Unclear (10 times)

  8. Constraint Counts • No documents used owl:maxQualifiedCardinality • 434 functional properties • 57 inverse-functional properties • 109 cardinality restrictions with a value of 1 • 52.93 million memberships of inverse-functional properties • 22.14 million asserted • 11.09 million memberships of functional properties • 1.17 million asserted • 2.56 million cardinality triples • 533 thousand asserted

  9. reasoning using constraints • Zero owl:sameAs inferences through cardinality rules • 106.8 thousand owl:sameAs through functional-property reasoning • 8.7 million owl:sameAs through inverse-functional-property reasoning • Resulted in a total of 12.03 million owl:sameAs statements

  10. Results from constraints • From the 12.03 million owl:sameAs quadruples • 1000 randomly chosen and hand-checked: • Trivially same (145 times) • Same (823 times) • Different (23 times) • Unclear (9 times)

  11. Statistical concurrence • Entity concurrence—sharing of outlinks, inlinks, and attribute values • Higher score means more discriminating shared characteristics

  12. Running Example

  13. Quantifying concurrence • Observed cardinality (e.g. Card_G_ex(foaf:maker; dblp:AliceB10) = 2) • Observed inverse-cardinality (e.g. ICard_G_ex(foaf:gender; "female") = 2) • Average inverse-cardinality (e.g. AIC_G_ex(foaf:gender) = 1.5) • Can also be viewed as average non-zero cardinalities • For example, foaf:gender; 1 for “male”, 2 for “female”

  14. Adjusted Average inverse-cardinality

  15. Concurrence Coefficients

  16. Coefficient Example

  17. Aggregated concurrence score • Same process as determining the probability of two independent events occurring (given the same outcome event) • P(AB) = P(A) + P(B) – P(A*B)

  18. Results from Concurrence • Average cardinality of about 1.5 • Average inverse-cardinality of about 2.64 • Total of 636.9 million weighted concurrence pairs • Mean concurrence weight of about 0.0159 • Highly concurring entities were in many cases not coreferent

  19. Example of Concurrence

More Related