Information Retrieval

Information Retrieval Tolerant Retrieval

Wild-card queries

Wild Card Queries 1. Trailing wild card queries fan* (e.g. fanatic, fancy, fantasy, etc.) - Use B-tree data structure on the dictionary - Walk down the tree following f, a, n - Retrieve all words w such that: fan ≤w<fao (i.e. all the words having prefix “fan”) Let set of these terms be W. - Use inverted index to retrieve documents containing terms in W

Wild-card queries: * *tic: find words ending in “tic”: harder • Maintain an additional B-tree for terms backwards. Can retrieve all words in range: cit ≤ w < ciu. Once we have all terms in the dictionary that match the wild-card query. We have to look up the postings for each enumerated term to perform retrieval.

General cases (Single *) • consider the query: se*tic (e.g. semantic, semiotic, semitic, etc.) • Use B-tree to get set of terms (W) having a prefix “se” • Use reverse B-tree to get set of terms (R) with suffix “tic” • Compute S = W ∩ R (words with prefix se and suffix tic) • Use inverted index to retrieve documents containing terms in S B-trees handle *’s at the end of a query term

General wild card queries • How can we handle *’s in the middle of query term? (Especially multiple *’s) • Two techniques: Both make use of a specially constructed index Solution 1: transform every wild-card query so that the *’s occur at the end -This gives rise to the Permuterm Index. Solution 2: K-gram indexes

General Steps • Wild card query w is expressed as a Boolean query on the specially constructed index and a superset of the set of dictionary terms matching w is obtained. • A post filtering step is used to discard dictionary terms that do not match w • Standard Inverted index is then used to retrieve documents

Sol 1 : Permuterm index • In a permuterm index, dictionary consists of all rotations (with $ marking the end) of each term and Postings of each rotation consists of all dictionary terms containing that rotation. • For term hello index under: • hello$, ello$h, llo$he, lo$hel, o$hell • In the B-tree all rotations of a term will point to the original lexicon term

Permuterm query processing • Rotate query wild-card to the right • Now use B-tree lookup as before. • Permuterm problem: ≈ quadruples lexicon size • Query = hel*o  hel*o$ After rotation: o$hel* Now traverse the B-tree seeking o$hel

Query se*m*tic - First find candidate terms in the permuterm index of tic$se - Next, filter out those terms from the candidate set that do not contain m using exhaustive enumeration.

Sol 2 : Bigram indexes • Enumerate all k-grams (sequence of k chars) occurring in any term • e.g., from text “bigram index” we get the 2-grams (bigrams) • $ is a special word boundary symbol • Maintain an “inverted” index from bigrams to dictionary terms that match each bigram. $b, bi, ig, gr,ra,am, m$, $i, in, nd,de,ex, x$

Bigram index example $b bag big bigram gr grass group bigram A k-gram index is an index in which the dictionary consists of all k-grams that occur in any word in the lexicon Each postings list point from the k-gram to all lexicon words containing that k-gram.

Processing n-gram wild-cards • Query pri* can be run as • $p AND pr AND ri • Fast, space efficient. • Gets terms that match AND version of our wildcard query. • Matches with words prince, pride, prior, price, priest • Also matches with proprietary as it contains 3-gram $p, pr, ri. Must post-filter these terms against query. • Surviving enumerated terms are then looked up in the term-document inverted index.

Processing wild-card queries • As before, we must execute a Boolean query for each enumerated, filtered term. • Wild-cards can result in expensive query execution that’s why usually SE hide these features behind “Advanced search” button.

Spelling correction

Spell correction • Two principal uses • Correcting document(s) being indexed • Retrieve matching documents when query contains a spelling error • Two main flavors: • Isolated word • Check each word on its own for misspelling • Will not catch typos resulting in correctly spelled words e.g., from  form • Context-sensitive • Look at surrounding words, e.g., I flew form …

Document correction • Primarily for OCR’ed documents • Correction algorithms tuned for this • Goal: the index (dictionary) contains fewer OCR-induced misspellings • Can use domain-specific knowledge • E.g., OCR can confuse O and D more often than it would confuse O and I (adjacent on the QWERTY keyboard, so more likely interchanged in typing), e and c, r and n, etc.

Query mis-spellings • We can either • Retrieve documents indexed by the correct spelling, OR • Return several suggested alternative queries with the correct spelling • Did you mean … ?

Isolated word correction • Makes use of a lexicon from which the correct spellings come • Two basic choices for this • A standard lexicon such as • Webster’s English Dictionary • An “industry-specific” lexicon – hand-maintained • The lexicon of the indexed corpus • E.g., all words on the web • All names, acronyms etc. • (Including the mis-spellings)

Isolated word correction • Given a lexicon and a character sequence Q, return the words in the lexicon closest to Q • What’s “closest”? • Edit distance • Weighted edit distance • n-gram overlap

Edit distance • Given two strings S1 and S2, the minimum number of basic operations to transform one to the other • Basic operations are typically character-level • Insert • Delete • Replace • E.g., the edit distance from cat to dog is 3. • Generally found by dynamic programming.

Edit distance • Also called “Levenshtein distance” • The following alignment between tutor and tumour has a edit distance of 2. t u t o - r t u m o u r Another possible alignment with edit distance 3. t u t - o - r t u - m o u r The best possible alignment corresponds to minimum edit distance.

Weighted edit distance • As above, but the weight of an operation depends on the character(s) involved • Meant to capture keyboard errors, e.g. m more likely to be mis-typed as n than as q • Therefore, replacing m by n is a smaller edit distance than by q • (Same ideas usable for OCR, but with different weights) • Require weight matrix as input

Minimum edit distance • Dynamic programming algorithms can be quite useful for finding minimum edit distance between two sequences. • implemented by creating an edit distance matrix. • This matrix has one row for each symbol in the source string and one column for each matrix in the target string

Minimum edit distance matrix • The (i,j)th cell in this matrix represents the distance between first i character of the source and first j character of the target string. • The value in each cell is computed in terms of three possible paths following which we can reach there: The substitution will be 0 if the ith character in the source matches with jth character in the target.

Input: Two strings, X and Y Output: The minimum edit distance between X and Y m  length(X) n  length(Y) for i = 0 to m do dist[i,0]  i for j = 0 to n do dist[0,j]  j for i = 1 to m do for j = 1 to n do dist[i,j] = min { dist[i-1,j] + insert_cost, dist[i-1,j-1] + subst_cost(Xi,Yj), dist[i,j-1] + delet_cost } end

# t u m o u r # 0 1 2 3 4 5 6 t 1 0 1 2 3 4 5 u 2 1 0 1 2 3 4 t 3 2 1 1 2 3 4 o 4 3 2 2 1 2 3 r 5 4 3 3 2 2 2

Using edit distances • Given query, first enumerate all dictionary terms within a preset (weighted) edit distance - To reduce search complexity heuristics are used. * consider dictionary terms beginning with the same letter * use permutation index leaving end of string symbol * omit suffix of length lbefore performing rotation Then look up enumerated dictionary terms in the term-document inverted index

Edit distance to all dictionary terms? • Given a (mis-spelled) query – do we compute its edit distance to every dictionary term? • Expensive and slow • How do we cut the set of candidate dictionary terms? • We can use n-gram overlap for this

n-gram overlap • Enumerate all the n-grams in the query string as well as in the lexicon • Use the n-gram index to retrieve all lexicon terms matching any of the query n-grams • Threshold by number of matching n-grams

Example with trigrams • Suppose the text is november • Trigrams are nov, ove, vem, emb, mbe, ber. • The query is december • Trigrams are dec, ece, cem, emb, mbe, ber. • So 3 trigrams overlap (of 6 in each term) • How can we turn this into a normalized measure of overlap?

One option – Jaccard coefficient • A commonly-used measure of overlap • Let X and Y be two sets; then the J.C. is Equals 1 when X and Y have the same elements and zero when they are disjoint • X and Y don’t have to be of the same size • Always assigns a number between 0 and 1 • Now threshold to decide if you have a match • E.g., if J.C. > 0.8, declare a match

Matching trigrams • Consider the query lord – we wish to identify words matching 2 of its 3 bigrams (lo, or, rd) lo alone lord sloth or border lord morbid rd border card ardent Standard postings “merge” will enumerate …

Context-sensitive spell correction • Text: I flew from Heathrow to Narita. • Consider the phrase query “flew form Heathrow” • We’d like to respond Did you mean “flew from Heathrow”? because no docs matched the query phrase.

Context-sensitive correction • Need surrounding context to catch this. • NLP too heavyweight for this. • First idea: retrieve dictionary terms close (in weighted edit distance) to each query term • Now try all possible resulting phrases with one word “fixed” at a time • flew from heathrow • fled form heathrow • flea form heathrow • etc. • Suggest the alternative that has lots of hits?

Exercise • Suppose that for “flew form Heathrow” we have 7 alternatives for flew, 19 for form and 3 for heathrow.

Another approach • Break phrase query into a conjunction of biwords • Look for biwords that need only one term corrected. • Enumerate phrase matches and … rank them!

General issue in spell correction • Will enumerate multiple alternatives for “Did you mean” • Need to figure out which one (or small number) to present to the user

Computational cost • Spell-correction is computationally expensive • Avoid running routinely on every query? • Run only on queries that matched few docs

Soundex

Soundex • Class of heuristics to expand a query into phonetic equivalents • Language specific – mainly for names • E.g., chebyshev tchebycheff

Soundex – typical algorithm • Turn every token to be indexed into a 4-character reduced form • Do the same with query terms • Build and search an index on the reduced forms • (when the query calls for a soundex match)

Soundex • Keep the first letter • Code the rests into digits as shown in table • Ignore letters with same Soundex digit • Eliminate all zeros • Truncate or pad with zeros to one initial letter and three digits Soundex Phonetic code

Soundex The coding scheme is based on observations like: - vowels are interchangeable - consonants with similar sounds are put in equivalence class Example: Dickson, Dikson, Dixon • Developed by Odell and Russell in 1918 and used in US census to match American English names • Soundex fails on two names with different initials (e.g.: Karlson, Carlson) • Also in other cases (e.g. Rodgers, Rogers)

Information Retrieval