1 / 15

Fast Two-Sided Error-Tolerant Search

Fast Two-Sided Error-Tolerant Search. Hannah Bast, Marjan Celikik University of Freiburg, Germany KEYS 2010. Motivation. Handling uncertainty in text search is important. Query side – users make mistakes typing the query Either due to mistyping

cleta
Download Presentation

Fast Two-Sided Error-Tolerant Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast Two-Sided Error-Tolerant Search Hannah Bast, Marjan Celikik University of Freiburg, Germany KEYS 2010

  2. Motivation • Handling uncertainty in text search is important • Query side – users make mistakes typing the query • Either due to mistyping • Or because we do not know the correct spelling (have incomplete knowledge about the underlying data) Efficient Two-Sided Error-Tolerant Search

  3. Motivation • Handling uncertainty in text search is important • Query side – user mistakes when typing the query • Either due to mistyping • Or because we do not know the correct spelling or have incomplete knowledge about the underlying data • Document side –mistakes in the documents • Those who type the documents also make mistakes • OCR errors Efficient Two-Sided Error-Tolerant Search

  4. State Of The Art • Not so much work on fast error-tolerant search • There is prior work done on document-side error tolerance • Overall only few relevant papers in the literature • BASELINE: Replace each query word by a disjunction of similar words A lot of work done on approximate string matching / searching Efficient Two-Sided Error-Tolerant Search

  5. BASELINE is all but efficient • Example fast AND list ANDintersction fast AND list AND (intersection ORinterrsectionOR intersession ORintersacitionnORintrasectionOR …) There can be hundreds of similar words! • Large list merging and diskI/O overhead • But the current state-of-the-art is not much faster than BASELINE … Efficient Two-Sided Error-Tolerant Search

  6. Our Approach - Clustering • Based on clustering of the vocabulary • A vocabulary V is the set of all words in a corpus • The clusters may overlap i.e. words can belong to few clusters • Definition (cover) • Let q be a keyword, K a clustering of Vand be the set of all words within a threshold T. Anexact cover of is a set of clusters from K with union . An approximate cover of does not necessarily contain all of Efficient Two-Sided Error-Tolerant Search

  7. Our Approach - Clustering • Based on clustering of the vocabulary • A vocabulary V is the set of all words in a corpus • The clusters may overlap i.e. words can belong to few clusters • Definition (cover) • Let q be a keyword, K a clustering of Vand be the set of all words within a threshold T. Anexact cover of is a set of clusters from K with union . An approximate cover of does not necessarily contain all of • The number of sets n in the cover is called cover index • Precision of a cover is defined as • Recall of a cover is defined as Efficient Two-Sided Error-Tolerant Search

  8. Our Approach - Clustering • Compute a clustering, so that for each q we can compute a good cover: • (C1) with cover index as small as possible • (C2) with recall as large as possible • (C3) with precision as large as possible • (C4) frequency-weighted overlap as small as possible Efficient Two-Sided Error-Tolerant Search

  9. Using the Clustering – Indexing • For each occurrence of a word, determine its clusters • Add corresponding artificial postings to the index by prepending the cluster ids, e.g. C:165:house Doc. 7012 house Doc. 7012 C:9823:house Doc. 7012 In clusters 165 and 9823 Efficient Two-Sided Error-Tolerant Search

  10. Using the Clustering – Query Time • For each q, compute and all affected cluster ids • ComputeMinimal Cover Index • Given a cover recall (and precision), there is no cover with smaller cover index (similar to the set cover problem) algoritm C:59:* OR C:1017:* 59, 201<- 59, 221<- algorithm 59, 1017,56<- Transform q into a disjunction of prefix queries alggorithm 1017, 221<- algoithm 1017<- algoirthm 61, 472<- alggorithluq 59, 201<- cluster 59 Use efficient prefix search to process the transformed query (we use the HYB index) logarithm 1017<- aglorithm cluster 1017 59, 472<- algorithmica … algorithmic … Efficient Two-Sided Error-Tolerant Search

  11. Computing a Clustering • How to compute a clustering with favorable properties (C1) – (C4) ? • It’s easy to optimize for (C1) alone, but then (C2) will suffer • It’s easy to optimize for (C1) - (C3) alone ,but then (C4) will suffer etc. v algoirtm algoithm y a1gor1thm C:x:algorithm algorithm z C:y:algorithm algorithm aglorithmm algortm C:z:algorithm C:v:algorithm algoritluq algoritw2 … = x Efficient Two-Sided Error-Tolerant Search

  12. Experimental results Average query times Average number of clusters and similar words Efficient Two-Sided Error-Tolerant Search

  13. Experimental results Average cover precision and recall Index sizes Efficient Two-Sided Error-Tolerant Search

More Related