290 likes | 440 Views
Outline. Applications: Spelling correction Formal Representation: Weighted FSTs Algorithms: Bayesian Inference (Noisy channel model) Methods to determine weights Hand-coded Corpus-based estimation Dynamic Programming Shortest path . Detecting and Correcting Spelling Errors.
Outline • Applications: • Spelling correction • Formal Representation: • Weighted FSTs • Algorithms: • Bayesian Inference (Noisy channel model) • Methods to determine weights • Hand-coded • Corpus-based estimation • Dynamic Programming • Shortest path
Detecting and Correcting Spelling Errors • Sources of lexical/spelling errors • Speech: lexical access and recognition errors (more later) • Text: typing and cognitive • OCR: recognition errors • Applications: • Spell checking • Hand-writing recognition of zip codes, signatures, Graffiti • Issues: • Correct non-words in isolation (dg for dog, why not dig?) • Correcting non-words could lead to valid words • Homophone substitution: “parents love there children”; “Lets order a desert after dinner” • Correcting words in context
Patterns of Error • Human typists make different types of errors from OCR systems -- why? • Error classification I: performance-based: • Insertion: catt • Deletion: ct • Substitution: car • Transposition: cta • Error classification II: cognitive • People don’t know how to spell (nucular/nuclear; potatoe/potato) • Homonymous errors (their/there)
Probability: Refresher • Population: 10 Princeton students • 4 vegetarians • 3 CS majors • What is the probability that a randomly chosen student (rcs) is a vegetarian? p(v) = 0.4 • That a rcs is a CS major? p(c) = 0.3 • That a rcs is a vegetarian and CS major? p(c,v) = 0.2 • That a vegetarian is a CS major? p(c|v) = 0.5 • That a CS major is a vegetarian? p(v|c) = 0.66 • That a non-CS major is a vegetarian? p(v|c’) = ??
Bayes Rule and Noisy Channel model • We know the joint probabilities • p(c,v) = p(c) p(v|c) (chain rule) • p(v,c) = p(c,v) = p(v) p(c|v) • So, we can define the conditional probability p(c|v) in terms of the prior probabilities p(c) and p(v) and the likelihood p(v|c). • “Noisy channel” metaphor: channel corrupts the input; recover the original. • think cell-phone conversations!! • Hearer’s challenge: decode what the speaker said (w), given a channel-corrupted observation (O). Source model Channel model
How do we use this model to correct spelling errors? • Simplifying assumptions • We only have to correct non-word errors • Each non-word (O) differs from its correct word (w) by one step (insertion, deletion, substitution, transposition) • Generate and Test Method: (Kernighan et al 1990) • Generate a word using one of substitution, deletion or insertion, transposition operations • Test if the resulting word is in the dictionary. • Example:
How do we decide which correction is most likely? • Validate the generated word in a dictionary. • But there may be multiple valid words, how to rank them? • Rank them based on a scoring function • P(w | typo) = P(typo | w) * P(w) • Note there could be other scoring functions • Propose n-best solutions • Estimate the likelihood P(typo|w) and the prior P(w) • count events from a corpus to estimate these probabilities • Labeled versus Unlabeled corpus • For spelling correction, what do we need? • Word occurrence information (unlabeled corpus) • A corpus of labeled spelling errors • Approximate word replacement by local letter replacement probabilities: Confusion matrix on letters
Cat vs Carat • Estimating the Prior: Suppose we look at the occurrence of cat and carat in a large (50M word) AP news corpus • cat occurs 6500 times, so p(cat) = .00013 • carat occurs 3000 times, so p(carat) = .00006 • Estimating the likelihood: Now we need to find out if inserting an ‘a’ after an ‘a’ is more likely than deleting an ‘r’ after an ‘a’ in a corrections corpus of 50K corrections ( p(typo|word)) • suppose ‘a’ insertion after ‘a’ occurs 5000 times (p(+a)=.1) and ‘r’ deletion occurs 7500 times (p(-r)=.15) • Scoring function: p(word|typo) = p(typo|word) * p(word) • p(cat|caat) = p(+a) * p(cat) = .1 * .00013 = .000013 • p(carat|caat) = p(-r) * p(carat) = .15 * .000006 = .000009
c:e,a:e,r:e,t:e Del c:c,a:a,r:r,t:t 0 c:c,a:a,r:r,t:t 0 c:c,a:a,r:r,t:t 0 e:c,e:a,e:r,e:t Ins c:a,c:r,c:t,a:c,a:t… Sub c a a r a t Encoding One-Error Correction as WFSTs • Let Σ = {c,a,r,t}; • One-edit model: • Dictionary model: • One-Error spelling correction: • Input ● Edit ● Dictionary t t
Issues • What if there are no instances of carat in corpus? • Smoothing algorithms • Estimate of P(typo|word) may not be accurate • Training probabilities on typo/word pairs • What if there is more than one error per word?
Minimum Edit Distance • How can we measure how different one word is from another word? • How many operations will it take to transform one word into another? caat --> cat, fplc --> fireplace (*treat abbreviations as typos??) • Levenshtein distance: smallest number of insertion, deletion, or substitution operations that transform one string into another (ins=del=subst=1) • Alternative: weight each operation by training on a corpus of spelling errors to see which is most frequent
Computing Levinshtein Distance • Dynamic Programming algorithm • Solution for a problem is a function of the solutions of subproblems • d[i,j] contains the distance upto si and tj • d[i,j] is computed by combining the distance of shorter substrings using insertion, deletion and substitution operations. • optimal edit operations is recovered by storing back-pointers.
Edit Distance Matrix NB: errors Cost=1 for insertions and deletions; Cost=2 for substitutions Recompute the matrix: insertions=deletions=substituitions=1
Levenstein Distance with WFSTs c:e,a:e,r:e,t:e Del • Let Σ = {c,a,r,t}; • Edit model: • The two sentences to compared are encoded as FSTs. • Levenstein distance between two sentences: • Dist(s1,s2) = s1 ● Edit ● s2 c:c,a:a,r:r,t:t 0 c:a,c:r,c:t,a:c,a:t… Sub e:c,e:a,e:r,e:t Ins
Spelling Correction with WFSTs • Dictionary: FST representation of words • Isolated word spelling correction: • AllCorrections(w) = w ● Edit ● Dictionary • BestCorrection(w) = Bestpath (w ● Edit ● Dictionary) • Spelling correction in context: “parents lovetherechildren” • S = w1, w2, … wn • Spelling correction of wi • Generate possible edits for wi • Pick the edit that fits best in context • Use a n-gram language model (LM) to rank the alternatives. • “love there” vs “love their”; “there children” vs “their children” • SentenceCorrection (S) = F(S) ● Edit ● LM
Aoccdrnig to a rscheearch at an Elingsh uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteers are at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by itslef but the wrod as a wlohe. • Can humans understand ‘what is meant’ as opposed to ‘what is said/written’? • How? • http://www.mrc-cbu.cam.ac.uk/personal/matt.davis/Cmabrigde/
Summary • We can apply probabilistic modeling to NL problems like spell-checking • Noisy channel model, Bayesian method • Training priors and likelihoods on a corpus • Dynamic programming approaches allow us to solve large problems that can be decomposed into sub problems • e.g. Minimum Edit Distance algorithm • A number of Speech and Language tasks can be cast in this framework. • Generate alternatives using a generator • Select best/ Rank the alternatives using a model • If the generator and the model are encodable as FST • Decoding becomes • composition followed by search for best path.
Word Classes and Tagging • Words can be grouped into classes based on a number of criteria. • Application independent criterion • Syntactic class (Nouns, Verbs, Adjectives…) • Proper names (People names, country names…) • Dates, currencies • Application specific criterion • Product names (Ajax, Slurpee, Lexmark 3100) • Service names (7-cents plan, GoldPass) • Tagging: Categorizing words of a sentence into one of the classes.
Syntactic Classes in English: Open Class Words • Nouns: • Defined semantically: words for people, places, things • Defined syntactically: words that take determiners • Count nouns: nouns that can be counted • One book, two computers, hundred men • Mass nouns: nouns that represent homogenous groups, can occur without articles. • snow, salt, milk, water, hair • Proper nouns; common nouns • Verbs: words for actions and processes • Hit, love, run, fly, differ, go • Adjectives: words for describing qualities and properties (modifiers) of objects • White, black, old, young, good, bad • Adverbs: words for describing modifiers of actions • Unfortunately, John walked homeextremely slowly yesterday • Subclasses: locative (home), degree (very), manner (slowly), temporal (yesterday)
Syntactic Classes in English: Closed Class Words • Closed Class words: • fixed set for a language • Typically high frequency words • Prepositions: relational words for describing relations among objects and events • In, on, before, by • Particles: looked up, throw out • Articles/Determiners: definite versus indefinite • Indefinite: a, an • Definite: the • Conjunctions: used to join two phrases, clauses, sentences. • Coordinating conjunctions: and, or, but • Subordinating conjunctions: that, since, because • Pronouns: shorthand to refer to objects and events. • Personal pronouns: he, she, it, they, us • Possessive pronouns: my, your, ours, theirs, his, hers, its, one’s • Wh-pronouns: whose, what, who, whom, whomever • Auxiliary verbs: used to mark tense, aspect, polarity, mood, of an action • Tense: past, present, future • Aspect: completed or on-going • Polarity: negation • Mood: possible, suggested, necessary, desired; depicted by modal verbs (can, do, have, may, might) • Copula: “be” connects a subject to a predicate (John is a teacher) • Other word classes: Interjections (ah, oh, alas); negatives (not, no); politeness (please, sorry), greetings (hello, goodbye).
Tagset • Tagset: set of tags to use; depends on the application. • Basic tags; tags with some morphology • Composition of a number of subtags • Agglutinative languages • Popular tagsets for English • Penn Treebank Tagset: 45 tags • CLAWS tagset: 61 tags • C7 tagset: 146 tags • How do we decide how many tags to use? • Application utility • Ease of disambiguation • Annotation consistency • “IN” tag in Penn Treebank tagset subordinating conjuntions and prepositions • “TO” tag represents preposition “to” and infinitival marker “to read” • Supertags: fold in syntactic information into tagset • of the order of 1000 tags
Tagging: Disambiguating Words • Three different models • ENGTWOL model (Karlsson et.al. 1995) • Transformation-based model (Brill 1995) • Hidden Markov Model tagger • ENGTWOL tagger • Constraint-based tagger • 1,100 hand-written constraints to rule out invalid combinations of tags. • Use of probabilistic constraints and syntactic information • Transformation-based model • Start with the most likely assignment • Make note of the context when the most likely assignment is wrong. • Induce a transformation rule that corrects the most likely assignment to the correct tag in that context. • Rules can be seen as α β | δ–γ • Compilable into an FST
Noisy Channel Source Decoder Again, the Noisy Channel Model Input to channel: Part-of-speech sequence T • Output from channel: a word sequence W • Decoding task: find T’ = P(T|W) • Using Bayes Rule • And since P(W) doesn’t change for any hypothetical T’ • T’ = P(W|T) P(T) • P(W|T) is the Emit Probability, and P(T) is the prior, or Contextual Probability
Stochastic Tagging: Markov Assumption • The tagging model is approximated using Markov assumptions. • T’ = P(T) * P(W|T) • Markov (first-order) assumption: • Independence assumption: • Thus: • The probability distributions are estimated from an annotated corpus. • Maximum Likelihood Estimate • P(w|t) = count(w,t)/count(t) • P(ti|ti-1) = count(ti, ti-1)/count(ti-1) • Don’t forget to smooth the counts!! • There are other means of estimating these probabilities.
Best Path Search • Search for the best path pervades many Speech and NLP problems. • ASR: best path through a composition of acoustic, pronunciation and language models • Tagging: best path through a composition of lexicon and contextual model • Edit distance: best path through a search space set up by insertion, deletion and substitution operations. • In general: • Decisions/operations create a weighted search space • Search for the best sequence of decisions • Dynamic programming solution • Sometimes the score is only relevant. • Most often the path (sequence of states; derivation) is relevant.
Multi-stage decision problems The dog runs . • NN NNS EOS DT VB VBZ • BOS P(DT|BOS) =1 P(NN|DT) = 0.9 P(VB|DT) = 0.1 P(NNS|NN) = 0.3 P(VBZ|NN) = 0.7 P( |NNS) = 0.3 P( |VBZ) = 0.7 P(EOS | ) = 1 P(dog|NN) = 0.99 P(dog|VB) = 0.01 P(the|DT) = 0.999 P(runs|NNS) = 0.63 P(runs|VBZ) = 0.37 P( | ) = 0.999 P(NNS|VB) = 0.7 P(VBZ|VB) = 0.3 • • • • •
Multi-stage decision problems The dog runs . • • Find the state sequence through this space that maximizes P(w|t)*P(t|t-1) • cost(BOS, EOS) = 1*cost(DT, EOS) • cost(DT,EOS) = max{P(the|DT)*P(NN|DT)*cost(NN,EOS), • P(the|DT)*P(VB|DT)*cost(VB,EOS)} NN NNS EOS DT VB VBZ • BOS
Two ways of reasoning • Forward approach (Backward reasoning) • Compute the best way to get from a state to the goal state. • Backward approach (Forward reasoning) • Compute the best way from the source state to get to a state. • A combination of these two approaches is used in unsupervised training of HMMs. • Forward-backward algorithm (Appendix D)