180 likes | 259 Views
A Novel Approach for Entity Linkage. IEEE-IRI2009, Las Vegas 2009-08-11 Heiko Stoermer , Paolo Bouquet University of Trento, Italy. This work is co-funded by the European Commission in the context of the Large-scale Integrated project OKKAM (GA 215032). Outline.
E N D
A Novel Approach for Entity Linkage IEEE-IRI2009, Las Vegas 2009-08-11 Heiko Stoermer, Paolo Bouquet University of Trento, Italy This work is co-funded by the European Commission in the context of the Large-scale Integrated project OKKAM (GA 215032)
Outline • Part 1: Background and Context • Part 2: Problem, Approach, Implementation, Results IEEE-IRI2009, Las Vegas
Web 2.0 seen from Outer Space Billions of people who create and share information and content producers (Web2.0) Intelligent (semantic-driven) mash-ups based and its use in new complex and ubiquitous services IEEE-IRI2009, Las Vegas
BUT However ... IEEE-IRI2009, Las Vegas
Flood of Identifiers http://www.reuters.com/news/globalcoverage/barackobama http://www.OPENCALAIS.com/watch?v=z4W2_raF_iw http://en.wikipedia.org/wiki/Barack_obama ?? http://www.facebook.com/home.php#/barackobama?ref=s http://dbpedia.org/resource/Barack_Obama http://farm4.static.flickr.com/3193/2437394249_824e76ed76.jpg?v=0 http://www.linkedin.com/in/barackobama IEEE-IRI2009, Las Vegas
Too many identifiers for the same thing out there … … not much used in content production … and poorly interlinked How do I find out what Web users have to say about our product XYZ? How can I avoid advertising restaurants in Venice (FL) for a query about Venice (IT)? How do we collect distributed information about a specific customer or project in a complex Intranet environment? In short: how can we enable mash-ups based on: select * from Web where ID=”…” on the Web of Data or in an enterprise-wide Intranet? The Flood of Identifiers IEEE-IRI2009, Las Vegas
Our Wish for The Web X.0 ... IEEE-IRI2009, Las Vegas
A Possible Solution -> An Entity Name System for the (Semantic) Web APIs • Open, decentralized service • Provides IDs for annotating any content in any application • Supports reuse of IDs • Maps ID schemas onto each other • Based on HTTP IEEE-IRI2009, Las Vegas
The ENS – A large „phonebook“ • Input: • a simple search query • a reference record • Output: a re-usable entity identifier • Under the hood: • large-scale entity repository • pre-populated • collaboratively growing • entity matching architecture IEEE-IRI2009, Las Vegas
ENS Overview IEEE-IRI2009, Las Vegas
Part 2 IEEE-IRI2009, Las Vegas
Entity Matching • Related work under different names: merge-purge, record linkage, deduplication, entity consolidation, entity linkage... • New aspects: • unknown entity representation • unknown query representation • multi-linguality • Our problem: • answer an entity search query with high top-1 success rate in very short time IEEE-IRI2009, Las Vegas
Bottom-up Study • We asked about 250 individuals from all over the world which feature names they would use to describe a certain set of entity types • Key result • „name“ feature shared between all analyzed types • „name“ feature with very high relevance for all analyzed types IEEE-IRI2009, Las Vegas
Name-feature based Entity Similarity IEEE-IRI2009, Las Vegas
Avoiding „Spam“ • Example: • Q={q1, q2} • E={e1,e2,e3} • Establish fsim() for every pair (q,e) • Select only maximum similar pairs • Build final score between Q and E IEEE-IRI2009, Las Vegas
Benchmark based on 67 example queries ~ 590k entities Top-1 improvement of ~12% over reference algorithm No performance penalty Results IEEE-IRI2009, Las Vegas
Future Work • Improved similarity measure based on a knowledge model inferred from our study • Evaluation in the context of the 2009 Ontology Matching Contest (entity track) IEEE-IRI2009, Las Vegas
Thank You! Contact stoermer@disi.unitn.it if you are interested in using the ENS in your experiments/projects/solutions. IEEE-IRI2009, Las Vegas