220 likes | 370 Views
ESTER: Efficient Search on Text, Entities, and Relations. Holger Bast , Alexandru Chitea , Fabian Suchanek , Ingmar Weber Presented by Krupakar Reddy Salguti. Keyword Search vs. Semantic Search. Keyword search Query: john lennon
E N D
ESTER: Efficient Search on Text, Entities, and Relations HolgerBast, AlexandruChitea, Fabian Suchanek, Ingmar WeberPresented by Krupakar Reddy Salguti
Keyword Search vs. Semantic Search • Keyword search • Query: john lennon • Answer: documents containing the words john and lennon • Semantic search • Query: musician • Answer: documents containing an instance ofmusician • Combined search • Query: beatles musician • Answer: documents containing the wordbeatles and an instance of musician
Semantic Search: Challenges + Our System 1. Entity recognition • approach 1: let users annotate (semantic web) • approach 2: annotate (semi-)automatically • our system: uses Wikipedia links + learns from them 2. Query Processing • build a space-efficient index • which enables fast query answers • our system: as compact and fast as a standard full-text engine 3. User Interface • easy to use • yet powerful query capabilities • our system: standard interface with interactive suggestions
In the Rest of this Talk … • Efficiency • three simple ideas (which all fail) • our approach (which works) • Queries supported • essentially all SPARQL queries, and • seamless integration with ordinary full-text search • Experiments • efficiency (great) • quality (not so great yet) • Conclusions • lots of interesting + challenging open problems
Efficiency: Simple Idea 1 • Add “semantic tags” to the document • e.g., add the special word tag:musician before every occurrence of a musician in a document • Problem 1: Index blowup • e.g., John Lennon is a: Musician, Singer, Composer, Artist, Vegetarian, Person, Pacifist, … (28 classes) • Problem 2: Limited querying capabilities • e.g., could not produce list of musicians that occur in documents that also contain the word beatles • i.p., could not do all SPARQL queries (more on that later)
Efficiency: Simple Idea 2 • Query Expansion • e.g., replace query word musician by disjunction musician:aaron_copland OR … OR musician:zarah_leander (7,593 musicians in Wikipedia) • Problem: Inefficient query processing • one intersection per element of the disjunction needed
Efficiency: Simple Idea 3 • Use a database • map semantic queries to SQL queries on suitably constructed tables • that’s what the Artificial-Intelligence / Semantic-Web people usually do • Problem: Inefficient + Lack of control • building a search engine on top of an off-the-shelf database is orders of magnitude slower or uses orders of magnitude more space, or both • very limited control regarding efficiency aspects
Efficiency: Our Approach • Two basic operations • prefix search of a special kind • join • An index data structure • which supports these two operations efficiently • Artificial words in the documents • such that a large class of semantic queries reduces to a combination of (few of) these operations
Processing the query “beatles musician” position Gitanes … legend says that John Lennon entity:john_lennonof the Beatles smoked Gitanes to deepen his voice … John Lennon 0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer… entity:*. relation:is_a . class:musician beatles entity:* two prefix queries entity:john_lennonentity:1964 entity:liverpool etc. entity:wolfang_amadeus_mozart entity:johann_sebastian_bach entity:john_lennon etc. onejoin entity:john_lennon etc.
position Gitanes … legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice … John Lennon 0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer… beatles entity:* entity:*. relation:is_a . class:musician • Problem: entity:* has a huge number of occurrences • ≈ 200 million for Wikipedia, which is ≈20% of all occurrences • prefix search efficient only for up to ≈ 1% (explanation follows) • Solution: frontier classes • classes at “appropriate” level in the hierarchy • e.g.: artist, believer, worker, vegetable, animal, …
position Gitanes … legend says that John Lennon artist:john_lennonbeliever:john_lennon of the Beatles smoked … John Lennon 0 artist:john_lennon0 believer:john_lennon 1 relation:is_a 2 class:musician… beatles artist:* artist:*. relation:is_a . class:musician two prefix queries artist:john_lennonartist:graham_greene artist:pete_best etc. artist:wolfang_amadeus_mozart artist:johann_sebastian_bach artist:john_lennon etc. first figure out: musician artist (easy) onejoin artist:john_lennon etc.
The HYB Index [Bast/Weber,SIGIR’06] • Maintains lists for word ranges (not words) able ablaze abroad abnormal abl-abt • Looks like this for person:* person:graham_greene person:john_lennon person:ringo_starr person:john_lennon person:*
Maintains lists for word ranges (not words) able ablaze abroad abnormal abl-abt • Provably efficient • no more space than an inverted index (on the same data) • each query = scan of a moderate number of (compressed) items • Extremely versatile • can do all kinds of things an inverted index cannot do (efficiently) • autocompletion, faceted search, query expansion, errorcorrection, select and join, …
Queries we can handle • We prove the following theorem: • Any basic SPARQL graph query with m edges can be reduced to at most 2m prefix / join operations SELECT ?who WHERE { ?who is_a Musician ?who born_in_year ?whenJohn_Lennonborn_in_year ?when } • ESTER achieves seamless integration with full-text search • SPARQL has no means for dealing with full text search • XQuery can handle full-text search, but is not really suitable for semantic search
Experiments: Corpus, Ontology, Index • Corpus: English Wikipedia (xml dump from Nov. 2006) ≈ 8 GB raw xml ≈ 2,8 million documents ≈ 1 billion words • Ontology: YAGO (Suchanek/Kasneci/Weikum, WWW’07) ≈ 2,5 million facts derived from clever combination of Wikipedia + WordNet(Entities from Wikipedia, Taxonomy from WordNet) • Our Index ≈ 1.5 billion words (original + artificial) ≈ 3.3 GB total index size; ontology-only is a mere 100 MB
Experiments: Efficiency — What Baseline? • SPARQL engines • can’t do text search • and slow for ontology-only too (on Wikipedia: seconds) • XQuery engines • extremely slow for text search (on Wikipedia: minutes) • and slow for ontology-only too (on Wikipedia: seconds) • Other prototypes which do semantic + full-text search • efficiency is hardly considered • e.g., the system of Castells/Fernandez/Vallet (TKDE’07) “… average informally observed response time on a standard professional desktop computer [of] below 30 seconds [on 145,316 documents and an ontology with 465,848 facts] …” • our system: ~100ms, 2.8 million documents, 2.5 million facts
Experiments: Efficiency — Stress Test 1 • Compare to ontology-only system • the YAGO engine from WWW’07 • Onto Simple : when was [person] born [1000 queries] • Onto Advanced: list all people from [profession][1000 queries] • Onto Hard : when did people die who were born in the same year as [person][1000 queries] • Note: comparison very unfair (for our system) 4 GB index 100 MB index
Experiments: Efficiency — Stress Test 2 • Compare to text-only search engine • state-of-the-art system from SIGIR’06 • Onto+Text Easy: counties in [US state] [50 queries] • Onto+Text Hard: computer scientists [nationality][50 queries] • Full-text query: e.g. german computer scientists Note: hardly finds relevant documents • Note: comparison extremely unfair (for our system)
Experiments: Quality — Entity Recognition • Use Wikipedia links as hints • “… following [[John Lennon | Lennon]] and Paul McCartney, two of the Beatles, …” • “… The southern terminus is located south of the town of [[Lennon, Michigan | Lennon]] …” • Learn other links • use words in neighborhood as features • Accuracy
Experiments: Quality — Relevance • 2 Query Sets • People associated with [american university] [100 queries] • Counties of [american state] [50 queries] • Ground truth • Wikipedia has corresponding lists e.g., List of Carnegie Mellon University People • Precision and Recall
Conclusions • Semantic Retrieval System ESTER • fast and scalable via reduction to prefix search and join • can handle all basic SPARQL queries • seamless integration with full-text search • standard user interface with (semantic) suggestions • Lots of interesting and challenging problems • simultaneous ranking of entities and documents • proper snippet generation and highlighting • search result quality • Source:www.mpi-inf.mpg.de/~bast/slides/xxx.ppt