200 likes | 382 Views
Scalable and Distributed Methods for Entity Matching, Consolidation and Disambiguation over Linked Data Corpora. Aidan Hogan, Antoine Zimmermann, Jürgen Umbrich , Axel Polleres , Stefan Decker Presented by Joseph Park. Introduction. Linked Data best practices:
E N D
Scalable and Distributed Methods for Entity Matching,Consolidation and Disambiguation over Linked Data Corpora Aidan Hogan, Antoine Zimmermann, Jürgen Umbrich, Axel Polleres, Stefan Decker Presented by Joseph Park
Introduction • Linked Data best practices: • Use URIs as names for things (not just documents) • Make those URIs dereferenceable via HTTP • Return useful and relevant RDF content upon lookup of those URIs • Include links to other datasets • Linked Open Data project • Goal of providing dereferenceable machine readable data in RDF • Emphasis on reuse of URIs and inter-linkage between remote datasets • Web of Data • 30 billion published RDF triples
Aims & Goals • Focus on finding equivalent entities • E.g. people, places, musicians, proteins • Two entities are equivalent if they are coreferent • Interest in identifying coreferences and merge knowledge contributions provided by distinct parties (consolidation)
OWL:SameAs • owl:sameAs • A core OWL property that defines equivalences between individuals • Two individuals related by owl:sameAs are coreferent • Inferring new owl:sameAs relations: • Inverse-functional properties (e.g:biologicalMotherOf) • Functional properties (e.g:hasBiologicalMother) • Cardinality and max-cardinality restrictions
Experiment • 1.118 billion quadruples • Crawled from 3.985 million web documents • 1.106 billion are unique • 947 million are unique triples • 9 machines linked by Gigabit ethernet
Baseline – owl:Sameas • Extracted 11.93 million raw owl:sameAs quadruples • Only 3.77 million unique triples • 1000 randomly chosen pairs hand-checked • Trivially same (661 times) • Same (301 times) • Different (28 times) • Unclear (10 times)
Constraint Counts • No documents used owl:maxQualifiedCardinality • 434 functional properties • 57 inverse-functional properties • 109 cardinality restrictions with a value of 1 • 52.93 million memberships of inverse-functional properties • 22.14 million asserted • 11.09 million memberships of functional properties • 1.17 million asserted • 2.56 million cardinality triples • 533 thousand asserted
reasoning using constraints • Zero owl:sameAs inferences through cardinality rules • 106.8 thousand owl:sameAs through functional-property reasoning • 8.7 million owl:sameAs through inverse-functional-property reasoning • Resulted in a total of 12.03 million owl:sameAs statements
Results from constraints • From the 12.03 million owl:sameAs quadruples • 1000 randomly chosen and hand-checked: • Trivially same (145 times) • Same (823 times) • Different (23 times) • Unclear (9 times)
Statistical concurrence • Entity concurrence—sharing of outlinks, inlinks, and attribute values • Higher score means more discriminating shared characteristics
Quantifying concurrence • Observed cardinality (e.g. Card_G_ex(foaf:maker; dblp:AliceB10) = 2) • Observed inverse-cardinality (e.g. ICard_G_ex(foaf:gender; "female") = 2) • Average inverse-cardinality (e.g. AIC_G_ex(foaf:gender) = 1.5) • Can also be viewed as average non-zero cardinalities • For example, foaf:gender; 1 for “male”, 2 for “female”
Aggregated concurrence score • Same process as determining the probability of two independent events occurring (given the same outcome event) • P(AB) = P(A) + P(B) – P(A*B)
Results from Concurrence • Average cardinality of about 1.5 • Average inverse-cardinality of about 2.64 • Total of 636.9 million weighted concurrence pairs • Mean concurrence weight of about 0.0159 • Highly concurring entities were in many cases not coreferent