630 likes | 844 Views
Speling Korecksion: A Survey Of Techniques from Past to Present. A UCSD Research Exam by Dustin Boswell. September 20 th 2004. Presentation Outline. Introduction to Spelling Correction Techniques Difficulties Noisy Channel Model My Implementation Demonstration Conclusions.
E N D
Speling Korecksion:A Survey Of Techniquesfrom Past to Present A UCSD Research Exam by Dustin Boswell September 20th 2004
Presentation Outline • Introduction to Spelling Correction • Techniques • Difficulties • Noisy Channel Model • My Implementation • Demonstration • Conclusions
Goal of Spelling Correction To assist humans in the detection of mistaken tokens of text, and their replacement with corrected tokens.
Sources of Spelling Mistakes “Spelling” is too specific. We want to correct many error types:
Sources of Spelling Mistakes “Spelling” is too specific. We want to correct many error types: • Physical: keyboard errors (“typos”), “the” “hte” • Cognitive: phonetic errors, “phonetic” “fonetik” • word misuse, “their” “they’re” • Non-human: OCR errors, “iii” “m”
General Spelling Correction Algorithm SpellCorrect( word w ) if ( isMistake( w ) ) Candidates := getCandidates( w ) Suggestions := filterAndRank( Candidates ) return Suggestions else return IS_CORRECT
Recognizing Mistakes (Dictionaries) • Use a dictionary (lexicon) to define valid words • Recognizing a string in a language is an old CS problem • Hash Tables or Letter Tries are reasonably efficient implementations
Recognizing Mistakes (Dictionaries) • Use a dictionary (lexicon) to define valid words • Recognizing a string in a language is an old CS problem • Hash Tables or Letter Tries are reasonably efficient implementations • Earlier work like (Peterson80) focused on compression techniques (stemming, caching, etc…) • For hardware like cell phones this may still be relevant.
Recognizing Mistakes (Letter Ngrams) • No dictionary, just look for unusual letter sequences. • Eg. “standrd” is a strange sequence. • Learn what is “usual” from a large corpus of text.
Recognizing Mistakes (Letter Ngrams) • No dictionary, just look for unusual letter sequences. • Eg. “standrd” is a strange sequence. • Learn what is “usual” from a large corpus of text. • Very space efficient • Doesn’t really work though: • unusual/foreign language valid words will be marked • nice-looking mistakes will slip by
Generating Candidates • We have a mistaken word w, and a dictionary D • Find a subset Candidates(w) D that contains all the words that are “similar” to w • Recall : Candidatesmust include the intended word • Precision : Too many words makes the system slow • We need to define “similar” formally…
Soundex Algorithm Define the following groups:
Soundex Algorithm • Define the following groups: • Then convert any string into a code as follows: • Save the first letter • Convert other letters to numbers of groups above • Squeeze repeats • 0’s are deleted
Soundex Algorithm • Define the following groups: • Then convert any string into a code as follows: • Save the first letter • Convert other letters to numbers of groups above • Squeeze repeats • 0’s are deleted • Soundex(Mississippi) M0220220110 M0202010 -> M221 • Two strings are “similar” if they have the same Soundex Code.
Edit Distance - Definition • Suppose humans make the following edits (mistakes): • Insertion – insert a new char somewhere in the string • Deletion – delete a character • Substitution – replace one character with another • Transposition – swap two adjacent characters
Edit Distance - Definition • Suppose humans make the following edits (mistakes): • Insertion – insert a new char somewhere in the string • Deletion – delete a character • Substitution – replace one character with another • Transposition – swap two adjacent characters • The edit distance between strings x and y is: the minimum number of edits required to change x into y
Edit Distance – Example 1 Dist(receive,reciev) = 2 receive recieve (swap “ei” to “ie”) reciev (delete “e”)
Edit Distance – Example 2 Dist(relive,reciev) = 2 relive recive (change “l” to “c”) reciev (swap “ve” to “ev”)
Edit Distance - Implementation Dist(receive,reciev) = 2 receive recieve (swap “ei” to “ie”) reciev (delete “e”) Dist(relive,reciev) = 2 relive recive (change “l” to “c”) reciev (swap “ve” to “ev”) There is a O(|x| * |y|) dynamic programming solution. Takes about 10 S on average. On 100,000 dictionary words would take about 1 second.
Edit Distance - Imperfections Dist(receive,reciev) = 2 receive recieve (swap “ei” to “ie”) reciev (delete “e”) Dist(relive,reciev) = 2 relive recive (change “l” to “c”) reciev (swap “ve” to “ev”) • Edit Distance isn’t a perfect model of similarity: • deleting “e” is more likely than changing “l” to “c” • swapping “ei” to “ie” is more likely than swapping “ve” to“ev”
Weighted Edit Distance Make edits involving similar sounding letters (or letters that are adjacent on the keyboard) have less cost than others. For example, substituting letters within the same group only costs 0.5 instead of 1.0
Ranking Candidates • The Single-edit assumption (80% true) • - all candidates have an edit distance <= 1 • - assign weights/probabilities to each edit • - (Kernighan90 learns these probs automatically)
Ranking Candidates • The Single-edit assumption (80% true) • - all candidates have an edit distance <= 1 • - assign weights/probabilities to each edit • - (Kernighan90 learns these probs automatically) • Ranking multiple edit candidates • - weighted edit distance is the most popular • (used by aspell in Unix )
Word Boundary Mistakes • run-on – “ofthe” • missplit – “o fthe” • unintended split – “specifi cation”
Word Boundary Mistakes • Partial Solution (the only method in practical use): • If a word w is a mistake: • For each division of w into w1 and w2: • if w1 and w2 are both valid, return “w1 w2” • Concatenate w with the next (or previous) word: • if this is valid, return it
Word Boundary Mistakes • This method fails when other errors are involved. • Examples: • … quite a sop[h isticated one … • … is a deter miniic statement … • There are an exponential number of combinations to consider in the general case. This is an open problem.
False Positives • A false-positive is a correct word that is marked as a mistake. • Eg: viagra, donatelo, tokenization • Tend to be proper nouns, domain jargon, etc…
False Positives • A false-positive is a correct word that is marked as a mistake. • Eg: viagra, donatelo, tokenization • Tend to be proper nouns, domain jargon, etc… • Statistics: • 5% - 35% of mistakes involve words outside of a 40,000 word dictionary. • The words in a New York Times newswire corpus and those in the Webster Dictionary only had 1/3 overlap. • False-positives are caused by inadequate dictionaries.
Obtaining Large Dictionaries If standard dictionaries are too small, can we build one automatically? (just filter out the infrequent words?)
Obtaining Large Dictionaries If standard dictionaries are too small, can we build one automatically? (just filter out the infrequent words?) Word Counts from a newswire corpus: Word word count recieved 90 survivalists 90 mackovic’s 90 savr 90 trevelyan 90 Word word count recieve 38 subsets 38 panjshir 38 angelie’s 38 lovato 38
False Negatives • A false-negative is a mistake that is undetected. • Eg: Can you here me? Go over their. Cross the dessert sand. • Mistakes can also look like rare words, eg: veery.
False Negatives • A false-negative is a mistake that is undetected. • Eg: Can you here me? Go over their. Cross the dessert sand. • Mistakes can also look like rare words, eg: veery. • Context must be used to solve this problem: • part of speech syntax constraints • word N-gram statistics
Lacking Features of Earlier Work • Handling multiple errors / quantifying “error likelihood” • Use of Large Dictionaries • Use of Word Statistics • Use of Word Context • Quantitative ways of combining all of the above!
Noisy Channel Derivation Let S be a word, phrase, or sentence intended by an author. Let S be the produced sequence after errors occur. Our task is to find the most likely candidate given the evidence: ~
Noisy Channel Derivation Let S be a word, phrase, or sentence intended by an author. Let S be the produced sequence after errors occur. Our task is to find the most likely candidate given the evidence: S* = argmax P(S’| S ) S’ = argmax P(S | S’) * P(S’) (By Bayes Rule) S’ P(S) = argmax P(S | S’) * P(S’) (Since P(S) doesn’t depend on S’) S’ ~ ~ ~ ~ ~ ~
Noisy Channel Illustration ~ S* = argmax P(S | S’) * P(S’) S’ Typos & Misspellings ~ The Author S S ~ P(S) P(S | S) (Language Model) (Error Model)
Noisy Channel Systems (1) First application to spelling correction: Kernighan,Church,Gale ‘90. Language Model Used: Unigram model Error Model Used: Learned probabilities of single edits automatically
Noisy Channel Systems (1) Kernighan,Church,Gale ’90: S = “acress” ~
Noisy Channel Systems (1) Kernighan,Church,Gale ’90: S = “acress” ~ … stellar and versatile acress whose combination of sass and … 87% correction accuracy overall.
Noisy Channel Systems (2) • Another founding work: • Mays, Damerau, Mercer ‘91. • Language Model Used: • Trigram model • Error Model Used: • All candidates (one edit away) are equally likely.
Noisy Channel Systems (2) • Another founding work: • Mays, Damerau, Mercer ‘91. • Language Model Used: • Trigram model • Error Model Used: • All candidates (one edit away) are equally likely. • Artificially created word-to-word mistakes • Corrected 73% of sentences back to original form. • Falsely changed 1% of original sentences • to something else.
Noisy Channel Systems (3) • Recent work by Brill & Moore 2000. • Language Model Used: Trigram model • Error Model Used: Learned probabilities of “substring edits”:
Noisy Channel Systems (3) • Recent work by Brill & Moore 2000. • Language Model Used: Trigram model • Error Model Used: Learned probabilities of “substring edits”: • Learn probabilities of mistakes like “ph” “f”, “ent””ant” • Partition words into aligned chunks: • S = ph y s i c al • S = f i s i c le • Calculate P(S | S) as: • P( fisicle | physical ) = P(f | ph) * P(I|y) * P(s|s) … * P(le|al) ~ ~
Noisy Channel Systems (3) • Recent work by Brill & Moore 2000. • Language Model Used: Trigram model • Error Model Used: Learned probabilities of “substring edits”: • Learn probabilities of mistakes like “ph” “f”, “ent””ant” • Partition words into aligned chunks: • S = ph y s i c al • S = f i s i c le • Calculate P(S | S) as: • P( fisicle | physical ) = P(f | ph) * P(I|y) * P(s|s) … * P(le|al) • 97% correction accuracy on a list of common mistakes ~ ~
Noisy Channel Systems (4) • More recent work by Toutanova & Moore 2002. • Language Model: Unigram model • Error Model: Phonetic model combined with previous work • Convert all letter strings to phone strings first. • Learn probabilities of phonetic mistakes like “EH” “AH” • Linearly combine this model with previous model • Reduced the error rate by 24%