| 1

| 1 Zoekmachines • Gertjan van Noord 2014 Lecture 3: tolerant retrieval

Tolerant retrieval: overview Methods to handle imprecise queries • wildcard queries • typo’s • alternative spellings Building alternative indexes Finding the most similar terms

Sec. 3.2 Wild-card queries: * mon*: find docs containing any word beginning with “mon”. *mon: find words ending in “mon”: harder. mo*n: find words that start with ‘mo’ and end with ‘n’ m*o*n: find words that start with ‘m’, end with ‘n’, and have an ‘o’ somewhere inbetween.

Wildcard queries Two steps in retrieval for wildcard queries: • Find all terms that fall within wildcard definition • Find all docs containing any of these words Three ways to do this: B-trees, permuterm index, k-gram index

Dictionary structures: Hash: very efficient (lookup and construction), but cannot be used to find terms that are “close” to the key Binary tree and B-tree (and tries): data structures which keep data sorted (and balanced). Efficient search, but construction is more costly. Words with same suffix are close together in the result → can be used for robust retrieval.

Sec. 3.2 Wild-card queries: * mon*: Easy with binary tree (or B-tree) lexicon: retrieve all terms in range: mon ≤ w < moo *mon: Maintain an additional B-tree for terms backwards, retrieveall words in range: nom ≤ w < non. m*n: m*o*n: Combine B-tree and reverse B-tree. Expensive! ?? Solution: the permuterm index

Permuterm index and queries Permuterm index add an end symbol: cat$ index all permuterms(in a structure like B-tree): cat$ at$c t$ca $cat Wildcard query processing: add $, rotate (if needed) until * is at the end examples: queries that can find (a.o.) cat: c*t c*at ca* ca*t *t *at permuterm form?

Sec. 3.2.1 Permuterm index For term hello, index under: hello$, ello$h, llo$he, lo$hel, o$hell where $ is a special symbol. Queries: X lookup on X$ X* lookup on $X* *X lookup on X$**X* lookup on X* X*Y lookup on Y$X*X*Y*Z ???? Exercise! Query = hel*o X=hel, Y=o Lookup o$hel*

K-gram index k-gram index (example k=3) to each dictionary term add a start and an end symbol: $kitten$ from this string, list all trigrams kitten:$ki kit itt tte en$ make an inverted index of trigrams $ki  (kinkiten, kitchen, kitten, ...) how can we find kitten?

An alternative: K-gram indexes Index for dictionary lookup, not for document retrieval! Posting lists point from k-gram to vocabulary terms k-gram: group of k consecutive items (context-dependent: characters, syllabes, words,..) bigram (digram), trigram, …

K-gram index and queries Part of 3-gram inverted index: $ki -> kinkiten kitchen kitten en$ -> kinkiten kitchen kitten che -> kitchen ink -> kinkiten itt -> kitten kit -> kinkiten kitchen kitten Wildcard query processing $kit*en$ $ki AND kit AND en$ kinkiten??? postprocessing needed!

Sec. 3.2 Query processing • At this point, we have an enumeration of all terms in the dictionary that match the wild-card query. • We still have to look up the postings for each enumerated term. • E.g., consider the query: se*ateANDfil*er • This may result in the execution of many Boolean AND queries.

Spell correction When? If a query word (combination) is quite rare or not available at all in the dictionary Approach: • Find similar term(s) • Calculate their similarity to the query term • Choose the most frequent ones

Finding similar words and calculate their similarity use k-gram index of words and calculate Jaccard coefficient to find most similar ones for query term |A ∩ B| / |A U B| relative similarity size of set of elements (k-grams) in commondivided bysize of set of all elements SET: no duplicates!

Even more precise then use Levenshtein distance for more precisely selecting the terms with the least edit distance to the query term demo: http://www.miislita.com/searchito/levenshtein-edit-distance.html

26-01-12 Levenshtein distance m(i, j-1) m(i-1,j-1) m(i-1,j) Minimal edit distance

Phonetic similarity To calculate which (English) written words are most similar in pronunciation, the SOUNDEX algorithm gives a (rather rough) measure. Demo: http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm#SoundExConverter

| 1

| 1

Presentation Transcript

1 X 1 X 1 = 1

1 2551 1 1

1-1

1-1 1-2 1-3 1-4 1-5 1-6

1-1

1 Peter 1:1

1:1

1 17 2 20 2 13 1 9 5 1 1 1 1 1 1 1 1 1 17 1 1 11 1 7 1 22 3

1 1 0 1

a 4 ‘ 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0

1 Peter 1:1

1 1 0 0 1 1

Direct product: e.g. B 1 B 2 = (1 -1 1 -1) (1 -1 -1 1) = (1 1 -1 -1) = A 2

1 1 1 1 2 1 1 3 3 1 1 4 6 4 1

1-1