440 likes | 455 Views
Information Retrieval. Tolerant Retrieval. Wild-card queries. Wild Card Queries. 1. Trailing wild card queries fan* (e.g. fanatic, fancy, fantasy, etc.) - Use B-tree data structure on the dictionary - Walk down the tree following f, a, n
E N D
Information Retrieval Tolerant Retrieval
Wild Card Queries 1. Trailing wild card queries fan* (e.g. fanatic, fancy, fantasy, etc.) - Use B-tree data structure on the dictionary - Walk down the tree following f, a, n - Retrieve all words w such that: fan ≤w<fao (i.e. all the words having prefix “fan”) Let set of these terms be W. - Use inverted index to retrieve documents containing terms in W
Wild-card queries: * *tic: find words ending in “tic”: harder • Maintain an additional B-tree for terms backwards. Can retrieve all words in range: cit ≤ w < ciu. Once we have all terms in the dictionary that match the wild-card query. We have to look up the postings for each enumerated term to perform retrieval.
General cases (Single *) • consider the query: se*tic (e.g. semantic, semiotic, semitic, etc.) • Use B-tree to get set of terms (W) having a prefix “se” • Use reverse B-tree to get set of terms (R) with suffix “tic” • Compute S = W ∩ R (words with prefix se and suffix tic) • Use inverted index to retrieve documents containing terms in S B-trees handle *’s at the end of a query term
General wild card queries • How can we handle *’s in the middle of query term? (Especially multiple *’s) • Two techniques: Both make use of a specially constructed index Solution 1: transform every wild-card query so that the *’s occur at the end -This gives rise to the Permuterm Index. Solution 2: K-gram indexes
General Steps • Wild card query w is expressed as a Boolean query on the specially constructed index and a superset of the set of dictionary terms matching w is obtained. • A post filtering step is used to discard dictionary terms that do not match w • Standard Inverted index is then used to retrieve documents
Sol 1 : Permuterm index • In a permuterm index, dictionary consists of all rotations (with $ marking the end) of each term and Postings of each rotation consists of all dictionary terms containing that rotation. • For term hello index under: • hello$, ello$h, llo$he, lo$hel, o$hell • In the B-tree all rotations of a term will point to the original lexicon term
Permuterm query processing • Rotate query wild-card to the right • Now use B-tree lookup as before. • Permuterm problem: ≈ quadruples lexicon size • Query = hel*o hel*o$ After rotation: o$hel* Now traverse the B-tree seeking o$hel
Query se*m*tic - First find candidate terms in the permuterm index of tic$se - Next, filter out those terms from the candidate set that do not contain m using exhaustive enumeration.
Sol 2 : Bigram indexes • Enumerate all k-grams (sequence of k chars) occurring in any term • e.g., from text “bigram index” we get the 2-grams (bigrams) • $ is a special word boundary symbol • Maintain an “inverted” index from bigrams to dictionary terms that match each bigram. $b, bi, ig, gr,ra,am, m$, $i, in, nd,de,ex, x$
Bigram index example $b bag big bigram gr grass group bigram A k-gram index is an index in which the dictionary consists of all k-grams that occur in any word in the lexicon Each postings list point from the k-gram to all lexicon words containing that k-gram.
Processing n-gram wild-cards • Query pri* can be run as • $p AND pr AND ri • Fast, space efficient. • Gets terms that match AND version of our wildcard query. • Matches with words prince, pride, prior, price, priest • Also matches with proprietary as it contains 3-gram $p, pr, ri. Must post-filter these terms against query. • Surviving enumerated terms are then looked up in the term-document inverted index.
Processing wild-card queries • As before, we must execute a Boolean query for each enumerated, filtered term. • Wild-cards can result in expensive query execution that’s why usually SE hide these features behind “Advanced search” button.
Spell correction • Two principal uses • Correcting document(s) being indexed • Retrieve matching documents when query contains a spelling error • Two main flavors: • Isolated word • Check each word on its own for misspelling • Will not catch typos resulting in correctly spelled words e.g., from form • Context-sensitive • Look at surrounding words, e.g., I flew form …
Document correction • Primarily for OCR’ed documents • Correction algorithms tuned for this • Goal: the index (dictionary) contains fewer OCR-induced misspellings • Can use domain-specific knowledge • E.g., OCR can confuse O and D more often than it would confuse O and I (adjacent on the QWERTY keyboard, so more likely interchanged in typing), e and c, r and n, etc.
Query mis-spellings • We can either • Retrieve documents indexed by the correct spelling, OR • Return several suggested alternative queries with the correct spelling • Did you mean … ?
Isolated word correction • Makes use of a lexicon from which the correct spellings come • Two basic choices for this • A standard lexicon such as • Webster’s English Dictionary • An “industry-specific” lexicon – hand-maintained • The lexicon of the indexed corpus • E.g., all words on the web • All names, acronyms etc. • (Including the mis-spellings)
Isolated word correction • Given a lexicon and a character sequence Q, return the words in the lexicon closest to Q • What’s “closest”? • Edit distance • Weighted edit distance • n-gram overlap
Edit distance • Given two strings S1 and S2, the minimum number of basic operations to transform one to the other • Basic operations are typically character-level • Insert • Delete • Replace • E.g., the edit distance from cat to dog is 3. • Generally found by dynamic programming.
Edit distance • Also called “Levenshtein distance” • The following alignment between tutor and tumour has a edit distance of 2. t u t o - r t u m o u r Another possible alignment with edit distance 3. t u t - o - r t u - m o u r The best possible alignment corresponds to minimum edit distance.
Weighted edit distance • As above, but the weight of an operation depends on the character(s) involved • Meant to capture keyboard errors, e.g. m more likely to be mis-typed as n than as q • Therefore, replacing m by n is a smaller edit distance than by q • (Same ideas usable for OCR, but with different weights) • Require weight matrix as input
Minimum edit distance • Dynamic programming algorithms can be quite useful for finding minimum edit distance between two sequences. • implemented by creating an edit distance matrix. • This matrix has one row for each symbol in the source string and one column for each matrix in the target string
Minimum edit distance matrix • The (i,j)th cell in this matrix represents the distance between first i character of the source and first j character of the target string. • The value in each cell is computed in terms of three possible paths following which we can reach there: The substitution will be 0 if the ith character in the source matches with jth character in the target.
Input: Two strings, X and Y Output: The minimum edit distance between X and Y m length(X) n length(Y) for i = 0 to m do dist[i,0] i for j = 0 to n do dist[0,j] j for i = 1 to m do for j = 1 to n do dist[i,j] = min { dist[i-1,j] + insert_cost, dist[i-1,j-1] + subst_cost(Xi,Yj), dist[i,j-1] + delet_cost } end
# t u m o u r # 0 1 2 3 4 5 6 t 1 0 1 2 3 4 5 u 2 1 0 1 2 3 4 t 3 2 1 1 2 3 4 o 4 3 2 2 1 2 3 r 5 4 3 3 2 2 2
Using edit distances • Given query, first enumerate all dictionary terms within a preset (weighted) edit distance - To reduce search complexity heuristics are used. * consider dictionary terms beginning with the same letter * use permutation index leaving end of string symbol * omit suffix of length lbefore performing rotation Then look up enumerated dictionary terms in the term-document inverted index
Edit distance to all dictionary terms? • Given a (mis-spelled) query – do we compute its edit distance to every dictionary term? • Expensive and slow • How do we cut the set of candidate dictionary terms? • We can use n-gram overlap for this
n-gram overlap • Enumerate all the n-grams in the query string as well as in the lexicon • Use the n-gram index to retrieve all lexicon terms matching any of the query n-grams • Threshold by number of matching n-grams
Example with trigrams • Suppose the text is november • Trigrams are nov, ove, vem, emb, mbe, ber. • The query is december • Trigrams are dec, ece, cem, emb, mbe, ber. • So 3 trigrams overlap (of 6 in each term) • How can we turn this into a normalized measure of overlap?
One option – Jaccard coefficient • A commonly-used measure of overlap • Let X and Y be two sets; then the J.C. is Equals 1 when X and Y have the same elements and zero when they are disjoint • X and Y don’t have to be of the same size • Always assigns a number between 0 and 1 • Now threshold to decide if you have a match • E.g., if J.C. > 0.8, declare a match
Matching trigrams • Consider the query lord – we wish to identify words matching 2 of its 3 bigrams (lo, or, rd) lo alone lord sloth or border lord morbid rd border card ardent Standard postings “merge” will enumerate …
Context-sensitive spell correction • Text: I flew from Heathrow to Narita. • Consider the phrase query “flew form Heathrow” • We’d like to respond Did you mean “flew from Heathrow”? because no docs matched the query phrase.
Context-sensitive correction • Need surrounding context to catch this. • NLP too heavyweight for this. • First idea: retrieve dictionary terms close (in weighted edit distance) to each query term • Now try all possible resulting phrases with one word “fixed” at a time • flew from heathrow • fled form heathrow • flea form heathrow • etc. • Suggest the alternative that has lots of hits?
Exercise • Suppose that for “flew form Heathrow” we have 7 alternatives for flew, 19 for form and 3 for heathrow.
Another approach • Break phrase query into a conjunction of biwords • Look for biwords that need only one term corrected. • Enumerate phrase matches and … rank them!
General issue in spell correction • Will enumerate multiple alternatives for “Did you mean” • Need to figure out which one (or small number) to present to the user
Computational cost • Spell-correction is computationally expensive • Avoid running routinely on every query? • Run only on queries that matched few docs
Soundex • Class of heuristics to expand a query into phonetic equivalents • Language specific – mainly for names • E.g., chebyshev tchebycheff
Soundex – typical algorithm • Turn every token to be indexed into a 4-character reduced form • Do the same with query terms • Build and search an index on the reduced forms • (when the query calls for a soundex match)
Soundex • Keep the first letter • Code the rests into digits as shown in table • Ignore letters with same Soundex digit • Eliminate all zeros • Truncate or pad with zeros to one initial letter and three digits Soundex Phonetic code
Soundex The coding scheme is based on observations like: - vowels are interchangeable - consonants with similar sounds are put in equivalence class Example: Dickson, Dikson, Dixon • Developed by Odell and Russell in 1918 and used in US census to match American English names • Soundex fails on two names with different initials (e.g.: Karlson, Carlson) • Also in other cases (e.g. Rodgers, Rogers)