1 / 41

Spelling Error Detection and Correction using Noisy Channel Model

This lecture covers the Noisy Channel Model for spelling error detection and correction, including the application of Bayes rule and the use of n-gram language models. It also discusses the challenges in handling spelling errors and provides strategies for finding the most likely correct word.

Download Presentation

Spelling Error Detection and Correction using Noisy Channel Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CPSC 503Computational Linguistics Lecture 4 Giuseppe Carenini CPSC503 Winter 2010

  2. Knowledge-Formalisms Map(including probabilistic formalisms) State Machines (and prob. versions) (Finite State Automata,Finite State Transducers, Markov Models) Morphology Logical formalisms (First-Order Logics) Syntax Rule systems (and prob. versions) (e.g., (Prob.) Context-Free Grammars) Semantics Pragmatics Discourse and Dialogue AI planners CPSC503 Winter 2010

  3. Today Sep 21 • Dealing with spelling errors • Noisy channel model • Bayes rule applied to Noisy channel model (single and multiple spelling errors) • Min Edit Distance ? • Start n-grams models: Language Models CPSC503 Winter 2010

  4. Background knowledge • Morphological analysis • P(x) (prob. distribution) • joint P(x,y) • conditional P(x|y) • Bayes rule • Chain rule CPSC503 Winter 2010

  5. funn -> funny, fun, ... Find the most likely correct word • trust funn • a lot of funn …in this context Is it an impossible (or very unlikely) word in this context? • .. a wild dig. Spelling: the problem(s) Correction Detection Non-word isolated Non-word context Real-word isolated ?! Find the most likely substitution word in this context Real-word context CPSC503 Winter 2010

  6. Spelling: Data • .05% -3% - 38% • 80% of misspelled words, single error • insertion (toy -> tony) • deletion (tuna -> tua) • substitution (tone -> tony) • transposition (length -> legnth) • Types of errors • Typographic (more common, user knows the correct spelling… the -> rhe) • Cognitive (user doesn’t know…… piece -> peace) CPSC503 Winter 2010

  7. noisy signal signal signal Noisy Channel • An influential metaphor in language processing is the noisy channel model • Special case of Bayesian classification CPSC503 Winter 2010

  8. Bayes and the Noisy Channel: Spelling Non-word isolated Goal: Find the most likely word given some observed (misspelled) word CPSC503 Winter 2010

  9. Problem • P(w|O) is hard/impossible to get (why?) P(wine|winw)= CPSC503 Winter 2010

  10. likelihood prior Solution • Apply Bayes Rule • Simplify CPSC503 Winter 2010

  11. Always verify… Estimate of prior P(w) (Easy) smoothing CPSC503 Winter 2010

  12. Estimate of P(O|w) is feasible(Kernighan et. al ’90) • For one-error misspelling: • Estimate the probability of each possible error type • e.g., insert aafter c, substitute fwith h • P(O|w) equal to the probability of the error that generated O from w • e.g., P( cbat| cat) = P(insert b after c) CPSC503 Winter 2010

  13. Estimate P(error type) Large corpus compute confusion matrices (e.g substitution: sub[x,y]) and count matrix #Times b was incorrectly used for a a b c ……… ……… a Count(a)= # of a in corpus ……… b 5 ……… c 8 15 d 8 … ……… ……… ……… CPSC503 Winter 2010

  14. Corpus: Example … On 16 January, he sais [sub[i,y] 3] that because of astronaut safety tha [del[a,t] 4] would be no more space shuttle missions to miantain [tran[a,i] 2] and upgrade the orbiting telescope…….. CPSC503 Winter 2010

  15. (2) For all the wicompute: word prior Probability of the error generating O from w1 Final Method single error (1) Given O, collect all the wi that could have generated O by one error. E.g., O=acress=> w1 = actress (t deletion), w2 = across (sub o with e), … … How to do (1): Generate all the strings that could have generated O by one error (how?). Keep the words (3) Sort and display top-n to user CPSC503 Winter 2010

  16. Example: collect all the wi that could have generated “acress” by one error. a c r e s s # of deletions # of transpositions # of alternations # of insertions CPSC503 Winter 2010

  17. Example: O = acress 1988 AP newswire corpus 44 million words _ _ _ _ _ …stellar and versatile acress whose… CPSC503 Winter 2010

  18. Evaluation “correct” system 0 1 2 other CPSC503 Winter 2010

  19. Corpora: issues to remember • Zero counts in the corpus: Just because an event didn’t happen in the corpus doesn’t mean it won’t happen e.g., cress has not really zero probability • Getting a corpus that matches the actual use. • e.g., Kids don’t misspell the same way that adults do CPSC503 Winter 2010

  20. Multiple Spelling Errors • (BEFORE) Given O collect all the wi that could have generated O by one error……. • (NOW) Given O collect all the wi that could have generated O by 1..k errors How? (for two errors): Collect all the strings that could have generated O by one error, then collect all the wi that could have generated one of those strings by one error Etc. CPSC503 Winter 2010

  21. (2) For all the wi compute: word prior Probability of the errors generating O from wi Final Method multiple errors (1) Given O, for each wi that can be generated from O by a sequence of edit operations EdOpi ,save EdOpi . (3) Sort and display top-n to user CPSC503 Winter 2010

  22. funn -> funny, funnel... Find the most likely correct word • trust funn • a lot of funn …in this context Is it an impossible (or very unlikely) word in this context? • .. a wild dig. Spelling: the problem(s) Correction Detection Non-word isolated Non-word context Real-word isolated ?! Find the most likely sub word in this context Real-word context CPSC503 Winter 2010

  23. Real Word Spelling Errors • Collect a set of common sets of confusions: C={C1 ..Cn} e.g.,{(Their/they’re/there), (To/too/two), (Weather/whether), (lave, have)..} • Whenever c’  Ciis encountered • Compute the probability of the sentence in which it appears • Substitute all cCi(c ≠ c’) and compute the probability of the resulting sentence • Choose the highest one CPSC503 Winter 2010

  24. Want to play with Spelling Correction: minimal noisy channel model implementation • (Python) http://www.norvig.com/spell-correct.html • By the way Peter Norvig is Director of Research at Google Inc. • (He will be visiting our dept. on Thurs!) CPSC503 Winter 2010

  25. Today Sep 21 • Dealing with spelling errors • Noisy channel model • Bayes rule applied to Noisy channel model (single and multiple spelling errors) • Min Edit Distance ? • Start n-grams models: Language Models CPSC503 Winter 2010

  26. Minimum Edit Distance • Def. Minimum number of edit operations (insertion, deletion and substitution) needed to transform one string into another. gumbo gumb gum gam delete o delete b substitute u by a CPSC503 Winter 2010

  27. Minimum Edit Distance Algorithm • Dynamic programming (very common technique in NLP) • High level description: • Fills in a matrix of partial comparisons • Value of a cell computed as “simple” function of surrounding cells • Output: not only number of edit operations but also sequence of operations CPSC503 Winter 2010

  28. target j i source i-1 , j i-1, j-1 update z x sub or equal del i , j-1 ? y ins Minimum Edit Distance Algorithm Details del-cost =1 sub-cost=2 ins-cost=1 ed[i,j] ed[i,j] = min distance between first i chars of the source and first j chars of the target MIN(z+1,y+1, x + (2 or 0)) CPSC503 Winter 2010

  29. target j i source i-1 , j i-1, j-1 update z x sub or equal del i , j-1 ? y ins Minimum Edit Distance Algorithm Details del-cost =1 sub-cost=2 ins-cost=1 ed[i,j] = min distance between first i chars of the source and first j chars of the target MIN(z+1,y+1, x + (2 or 0)) CPSC503 Winter 2010

  30. Min edit distance and alignment See demo CPSC503 Winter 2010

  31. Today Sep 21 • Dealing with spelling errors • Noisy channel model • Bayes rule applied to Noisy channel model (single and multiple spelling errors) • Min Edit Distance ? • Start n-grams models: Language Models CPSC503 Winter 2010

  32. Key Transition • Up to this point we’ve mostly been discussing words in isolation • Now we’re switching to sequences of words • And we’re going to worry about assigning probabilities to sequences of words CPSC503 Winter 2010

  33. Knowledge-Formalisms Map(including probabilistic formalisms) State Machines (and prob. versions) (Finite State Automata,Finite State Transducers, Markov Models) Morphology Logical formalisms (First-Order Logics) Syntax Rule systems (and prob. versions) (e.g., (Prob.) Context-Free Grammars) Semantics Pragmatics Discourse and Dialogue AI planners CPSC503 Winter 2010

  34. Only Spelling? • Assign a probability to a sentence • Part-of-speech tagging • Word-sense disambiguation • Probabilistic Parsing • Predict the next word • Speech recognition • Hand-writing recognition • Augmentative communication for the disabled AB Impossible to estimate  CPSC503 Winter 2010

  35. Chain Rule: Decompose: apply chain rule Applied to a word sequence from position 1 to n: CPSC503 Winter 2010

  36. Example • Sequence “The big red dog barks” • P(The big red dog barks)= P(The) * P(big|the) * P(red|the big)* P(dog|the big red)* P(barks|the big red dog) Note - P(The) is better expressed as: P(The|<Beginning of sentence>) written as P(The|<S>) CPSC503 Winter 2010

  37. Not a satisfying solution  Even for small n (e.g., 6) we would need a far too large corpus to estimate: Markov Assumption: the entire prefix history isn’t necessary. unigram bigram trigram CPSC503 Winter 2010

  38. Prob of a sentence: N-Grams unigram bigram trigram CPSC503 Winter 2010

  39. Bigram<s>The big red dog barks • P(The big red dog barks)= • P(The|<S>) * • P(big|the) * • P(red|big)* • P(dog|red)* • P(barks|dog) Trigram? CPSC503 Winter 2010

  40. Estimates for N-Grams bigram ..in general CPSC503 Winter 2010

  41. Next Time • N-Grams (Chp. 4) • Model Evaluation (sec. 4.4) • No smoothing 4.5-4.7 CPSC503 Winter 2010

More Related