670 likes | 856 Views
Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval. Outline. Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex. Sec. 3.1. Dictionary data structures for inverted indexes.
E N D
Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval
Outline • Recap • Dictionaries • Wildcard queries • Edit distance • Spelling correction • Soundex
Sec. 3.1 Dictionary data structures for inverted indexes • The dictionary data structure stores the term vocabulary, document frequency, pointers to each postings list … in what data structure?
Sec. 3.1 A naïve dictionary • An array of struct: char[20] int Postings * 20 bytes 4/8 bytes 4/8 bytes • How do we store a dictionary in memory efficiently? • How do we quickly look up elements at query time?
Sec. 3.1 Dictionary data structures • Two main choices: • Hash table • Tree • Some IR systems use hashes, some trees
Sec. 3.1 Hashes • Each vocabulary term is hashed to an integer • (We assume you’ve seen hashtables before) • Pros: • Lookup is faster than for a tree: O(1) • Cons: • No easy way to find minor variants: • judgment/judgement • No prefix search [tolerant retrieval] • If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everything
Sec. 3.1 Trees • Simplest: binary tree • More usual: B-trees • Trees require a standard ordering of characters and hence strings … but we standardly have one • Pros: • Solves the prefix problem (terms starting with hyp) • Cons: • Slower: O(log M) [and this requires balanced tree] • Rebalancing binary trees is expensive • But B-trees mitigate the rebalancing problem
Sec. 3.1 Tree: B-tree • Definition: Every internal node has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4]. n-z a-hu hy-m
Outline • Recap • Dictionaries • Wildcard queries • Edit distance • Spelling correction • Soundex
Sec. 3.2 Wild-card queries: * • mon*: find all docs containing any word beginning “mon”. • Easy with binary tree (or B-tree) lexicon: retrieve all words in range: mon ≤ w < moo • *mon: find words ending in “mon”: harder • Maintain an additional B-tree for terms backwards. Can retrieve all words in range: nom ≤ w < non. Exercise: from this, how can we enumerate all terms meeting the wild-card query pro*cent?
How to handle * in the middle of a term • Example: m*nchen • We could look up m* and *nchen in the B-tree and intersect the two term sets. • Expensive • Alternative: permuterm index • Basic idea: Rotate every wildcard query, so that the * occurs at the end. • Store each of these rotations in the dictionary, say, in a B-tree 12
Sec. 3.2 B-trees handle *’s at the end of a query term • How can we handle *’s in the middle of query term? • co*tion • We could look up co* AND *tion in a B-tree and intersect the two term sets • Expensive • The solution: transform wild-card queries so that the *’s occur at the end • This gives rise to the Permuterm Index.
Permuterm index • For term HELLO: add hello$, ello$h, llo$he, lo$hel, and o$hell to the B-tree where $ is a special symbol 14
Permuterm index • For HELLO, we’ve stored: hello$, ello$h, llo$he, lo$hel, and o$hell • Queries • For X, look up X$ • For X*, look up $X* • For *X, look up X$* • For *X*, look up X* • For X*Y, look up Y$X* • Example: For hel*o, look up o$hel* • Permuterm index would better be called a permutermtree. • But permuterm index is the more common name. 16
Processing a lookup in the permuterm index • Rotate query wildcard to the right • Use B-tree lookup as before • Problem: Permuterm more than quadruples the size of the dictionary compared to a regular B-tree. (empirical number) 17
k-gram indexes • More space-efficient than permuterm index • Enumerate all character k-grams (sequence of k characters) occurring in a term • 2-grams are called bigrams. • Example: from April is the cruelest month we get the bigrams: $a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ue el le es st t$ $m mo on nt h$ • $ is a special word boundary symbol, as before. • Maintain an inverted index from bigrams to the terms that contain the bigram 18
k-gram (bigram, trigram, . . . ) indexes • Note that we now have two different types of inverted indexes • The term-document inverted index for finding documents based on a query consisting of terms • The k-gram index for finding terms based on a query consisting of k-grams 20
Processing wildcarded terms in a bigram index • Query mon* can now be run as: $m AND mo AND on • Gets us all terms with the prefix mon . . . • . . . but also many “false positives” like MOON. • We must postfilter these terms against query. • Surviving terms are then looked up in the term-document invertedindex. • k-gram index vs. permutermindex • k-gram index is more space efficient. • Permuterm index doesn’t require postfiltering. 21
Sec. 3.2.2 Processing wild-card queries • As before, we must execute a Boolean query for each enumerated, filtered term. • Wild-cards can result in expensive query execution (very large disjunctions…) • pyth* AND prog* • If you encourage “laziness” people will respond! • Which web search engines allow wildcard queries? Search Type your search terms, use ‘*’ if you need to. E.g., Alex* will match Alexander.
Outline • Recap • Dictionaries • Wildcard queries • Edit distance • Spelling correction • Soundex
Distance between misspelled word and “correct” word • We will study several alternatives. • weighted edit distance • Edit distance and Levenshtein distance • k-gram overlap 24
Weighted edit distance • As above, but weight of an operation depends on the charactersinvolved. • Meant to capture keyboard errors, e.g., m more likely to be mistyped as n than as q. • Therefore, replacing m by n is a smaller edit distance than by q. 25
Edit distance • The edit distance between string s1 and string s2 is the minimum number of basic operations that convert s1 to s2. • Levenshtein distance: The admissible basic operations are insert, delete, andreplace • Levenshteindistancedog-do: 1 • Levenshteindistancecat-cart: 1 • Levenshteindistancecat-cut: 1 • Levenshtein distance cat-act: 2 26
Exercise • Compute Levenshtein distance matrix for OSLO – SNOW • What are the Levenshtein editing operations that transform cat into catcat? 34
How do I read out the editing operations that transform OSLO into SNOW? 42