170 likes | 403 Views
| 1. Zoekmachines. Gertjan van Noord 2014. Lecture 3: tolerant retrieval. Tolerant retrieval: overview. Methods to handle imprecise queries wildcard queries typo’s alternative spellings Building alternative indexes Finding the most similar terms. Sec. 3.2. Wild-card queries: *.
E N D
| 1 Zoekmachines • Gertjan van Noord 2014 Lecture 3: tolerant retrieval
Tolerant retrieval: overview Methods to handle imprecise queries • wildcard queries • typo’s • alternative spellings Building alternative indexes Finding the most similar terms
Sec. 3.2 Wild-card queries: * mon*: find docs containing any word beginning with “mon”. *mon: find words ending in “mon”: harder. mo*n: find words that start with ‘mo’ and end with ‘n’ m*o*n: find words that start with ‘m’, end with ‘n’, and have an ‘o’ somewhere inbetween.
Wildcard queries Two steps in retrieval for wildcard queries: • Find all terms that fall within wildcard definition • Find all docs containing any of these words Three ways to do this: B-trees, permuterm index, k-gram index
Dictionary structures: Hash: very efficient (lookup and construction), but cannot be used to find terms that are “close” to the key Binary tree and B-tree (and tries): data structures which keep data sorted (and balanced). Efficient search, but construction is more costly. Words with same suffix are close together in the result → can be used for robust retrieval.
Sec. 3.2 Wild-card queries: * mon*: Easy with binary tree (or B-tree) lexicon: retrieve all terms in range: mon ≤ w < moo *mon: Maintain an additional B-tree for terms backwards, retrieveall words in range: nom ≤ w < non. m*n: m*o*n: Combine B-tree and reverse B-tree. Expensive! ?? Solution: the permuterm index
Permuterm index and queries Permuterm index add an end symbol: cat$ index all permuterms(in a structure like B-tree): cat$ at$c t$ca $cat Wildcard query processing: add $, rotate (if needed) until * is at the end examples: queries that can find (a.o.) cat: c*t c*at ca* ca*t *t *at permuterm form?
Sec. 3.2.1 Permuterm index For term hello, index under: hello$, ello$h, llo$he, lo$hel, o$hell where $ is a special symbol. Queries: X lookup on X$ X* lookup on $X* *X lookup on X$**X* lookup on X* X*Y lookup on Y$X*X*Y*Z ???? Exercise! Query = hel*o X=hel, Y=o Lookup o$hel*
K-gram index k-gram index (example k=3) to each dictionary term add a start and an end symbol: $kitten$ from this string, list all trigrams kitten:$ki kit itt tte en$ make an inverted index of trigrams $ki (kinkiten, kitchen, kitten, ...) how can we find kitten?
An alternative: K-gram indexes Index for dictionary lookup, not for document retrieval! Posting lists point from k-gram to vocabulary terms k-gram: group of k consecutive items (context-dependent: characters, syllabes, words,..) bigram (digram), trigram, …
K-gram index and queries Part of 3-gram inverted index: $ki -> kinkiten kitchen kitten en$ -> kinkiten kitchen kitten che -> kitchen ink -> kinkiten itt -> kitten kit -> kinkiten kitchen kitten Wildcard query processing $kit*en$ $ki AND kit AND en$ kinkiten??? postprocessing needed!
Sec. 3.2 Query processing • At this point, we have an enumeration of all terms in the dictionary that match the wild-card query. • We still have to look up the postings for each enumerated term. • E.g., consider the query: se*ateANDfil*er • This may result in the execution of many Boolean AND queries.
Spell correction When? If a query word (combination) is quite rare or not available at all in the dictionary Approach: • Find similar term(s) • Calculate their similarity to the query term • Choose the most frequent ones
Finding similar words and calculate their similarity use k-gram index of words and calculate Jaccard coefficient to find most similar ones for query term |A ∩ B| / |A U B| relative similarity size of set of elements (k-grams) in commondivided bysize of set of all elements SET: no duplicates!
Even more precise then use Levenshtein distance for more precisely selecting the terms with the least edit distance to the query term demo: http://www.miislita.com/searchito/levenshtein-edit-distance.html
26-01-12 Levenshtein distance m(i, j-1) m(i-1,j-1) m(i-1,j) Minimal edit distance
Phonetic similarity To calculate which (English) written words are most similar in pronunciation, the SOUNDEX algorithm gives a (rather rough) measure. Demo: http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm#SoundExConverter