1 / 34

Language Modeling with N-Grams

Language Modeling with N-Grams. Language Modeling. A Language Model is a probabilistic model that allows us to compute the probability of a sentence. Let w 1:n denote the word sequence w 1 w 2 …w n . What is the probability P(w 1:n )?. Why Language Modeling?.

Download Presentation

Language Modeling with N-Grams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language Modeling with N-Grams

  2. Language Modeling • A Language Model is a probabilistic model that allows us to compute the probability of a sentence. • Let w1:n denote the word sequence w1w2…wn. • What is the probability P(w1:n)?

  3. Why Language Modeling? • Determine which sequence of words is more likely • Predicting the next word given the previous words • Shannon game: • guessing the next letter given previous letters. • Applications in • Speech Recognition • Machine Translation • Context sensitive spelling check

  4. Language Modeling in Speech Recognition • Some sequences of words sounds alike, but not all of them are good English sentences. • I went to a party • Eye went two a bar tea Rudolph the red nose reindeer. Rudolph the Red knows rain, dear. Rudolph the Red Nose reigned here.

  5. Language Modeling in Machine Translation • Given a French sentence • On voit Jon à la télévision • And several possible English translations: • Jon appeared in TV. • In Jon appeared TV. • Jon appeared on TV. • Which one is more likely to be correct?

  6. Context Sensitive Spelling • Which is most probable? • … I think they’re okay … • … I think there okay … • … I think their okay … • Which is most probable? • … by the way, are they’re likely to … • … by the way, are there likely to … • … by the way, are their likely to …

  7. Axioms of Probability Theory • Suppose P(.) is a probability function, then 1. for any event E, 0≤P(E) ≤1. 2. P(S) = 1, where S is the sample space. 3. for any two mutually exclusive events E1 and E2, P(E1 U E2) = P(E1) + P(E2) • Any function that satisfies the above three axioms is a probability function.

  8. Properties of Probability 1. P(¬E) = 1– P(E) 2. If E1 and E2 are logically equivalent, then P(E1)=P(E2). • E1: Not all philosophers are more than six feet tall. • E2: Some philosopher is not more that six feet tall. Then P(E1)=P(E2). 3. P(E1, E2)≤P(E1).

  9. Conditional Probability • The probability of an event may change after knowing another event. The probability of A given B is denoted by P(A|B). • Example • P( W=space ) the probability of a randomly selected word from an English text is ‘space’ • P( W=space | W’=outer) the probability of ‘space’ if the previous word is ‘outer’

  10. Chain Rule and Bayes Theorem • Chain Rule: P(A, B)=P(A)P(B|A) • Bayes Theorem If P(E2)>0, then P(E1|E2)=P(E2|E1)P(E1)/P(E2) This can be derived from the definition of conditional probability.

  11. The n-gram Language Model Using the Chain Rule: P(A,B)=P(A)P(B|A) P(w1:n) =P(w1:n-1)P(wn|w1:n-1) = P(w1:n-2)P(wn-1|w1:n-2)P(wn|w1:n-1) = P(w1:n-3)P(wn-2|w1:n-3)P(wn-1|w1:n-2)P(wn|w1:n-1) = P(w1)P(w2|w1) P(w3|w1:2) P(w4|w1:3) …… P(wn-1|w1:n-2)P(wn|w1:n-1) Can we compute P(w1:n) in the reverse order?

  12. Markov Assumption • W1:n-1 is called the history of wn • Sue swallowed the large green ______. • The statistics for the complete history is very sparse. • Markov Assumption: only the closest n words are relevant: P(wn|w1:n-1)≈P(wn|wn-N+1:n-1) • Bigram: only the previous one word matters • Trigram: only the previous two words matter • Therefore P(w1:n) ≈k=1,n P(wk|wk-N+1:k-1)

  13. Examples: • Without Markov Assumption: • P(I went to a party) = ? • With Markov Assumption (n=3) • P(I went to a party) = ? • With Markov Assumption (n=2) • P(I went to a party) = ? • What does n=1 mean?

  14. Parameters in N-gram Models • Suppose there are 20,000 words • very conservative assumption • Parameters • Bigram: 20,000x19,999 = 400M • Trigram:20,0002x19,999=8 trillion • 4-gram: 20,0003x19,999=1.6x1017 • Reliability vs. Relevance • as n increases, n-gram becomes more relevant, but less reliable.

  15. Estimation of Probability • P(wn | w1:n-1) = P(w1:n)/P(w1:n-1) • Probabilities (subjective/objective) exist independent of data. • However, probabilities have to be estimated from data. • Maximum Likelihood Estimation • PMLE(wn | w1:n)=C(w1:n)/C(w1:n-1)

  16. Maximum Likelihood Estimation • MLE assigns the highest probability to data. • Example: • training corpus: <s> a b a b </s> • MLE P(a|b)= ½, P(b|a)=1, P(a|<s>)=1, P(</s>|b) = ½, P(corpus)=1/2. • MLE is not suitable for NLP • MLE assigns 0 probability to unseen events. • One experiment shows that 23% of trigrams were previously unseen after 1.5M words.

  17. p(z | xy) = ? Suppose our training data includes … xya .. … xyd … … xyd … but never xyz Should we conclude p(a | xy) = 1/3?p(d | xy) = 2/3?p(z | xy) = 0/3? NO! Absence of xyz might just be bad luck. How to Estimate

  18. Smoothing the Estimates • Should we conclude • p(a | xy) = 1/3? reduce this • p(d | xy) = 2/3? reduce this • p(z | xy) = 0/3? increase this • Discount the positive counts somewhat • Reallocate that probability to the zeroes

  19. Especially if the denominator is small … • 1/3 probably too high, 100/300 probably about right • Especially if numerator is small … • 1/300 probably too high, 100/30000 probably about right

  20. Dealing with 0 Probability • Back-off • If the frequency count of N-gram is 0, used N-1 gram • Smoothing • Mix MLE with another probability distribution that guarantees not to give 0 probability.

  21. UNIGRAM 438699 ... DNS 298 DNS/WINS 2 dns1.isp.net 1 dnsadmin.exe 2 DNSName 1 DNSServer 1 do 384 ... NT 3313 ... pertinent 2 pervasiveness 1 Ph33r 3 phase 24 phased 1 ... phone 60 Phonebook 23 phrase 9 phrases 2 physical 123 PhysicalDisk 1 ... do 384 ... anything 2 approach 1 ... for 5 have 5 I 1 If 4 ... Link 1 list 1 ... no 1 not 97 Novell 1 offer 1 ... workitem 1 you 7 your 1 they,do 22 . 1 approach 1 have 2 Link 1 not 7 on 3 open 1 so 1 under 5 Courtesy of Patrick Pantel

  22. not 7 do 384 not 97 UNIGRAM 438699 ... DNS 298 DNS/WINS 2 dns1.isp.net 1 dnsadmin.exe 2 DNSName 1 DNSServer 1 do 384 ... NT 3313 ... pertinent 2 pervasiveness 1 Ph33r 3 phase 24 phased 1 ... phone 60 Phonebook 23 phrase 9 phrases 2 physical 123 PhysicalDisk 1 ... not 97 do 384 ... anything 2 approach 1 ... for 5 have 5 I 1 If 4 ... Link 1 list 1 ... no 1 not 97 Novell 1 offer 1 ... workitem 1 you 7 your 1 they,do 22 . 1 approach 1 have 2 Link 1 not 7 on 3 open 1 so 1 under 5 C(they,do,not) = 7 C(do,not) = 97 PMLE(not|they,do) = 7/22 = 0.318 PMLE(not|do) = 97/384 = 0.253 PMLE(offer|they,do) = 0/22 = 0 PMLE(have|they,do) = 2/22 = 0.091 Courtesy of Patrick Pantel

  23. Add-One Smoothing • V is the number of types we might see • the vocabulary size (unique words) • Add-One Smoothing (+1): • Too much mass is reserved for 0-frequency N-grams • arbitrarily picked value “1” to add to N-grams Courtesy of Patrick Pantel

  24. Vocabulary Size (V) = 10,543 Vocabulary Size (V) = 10,543 They,do 22 . 1 approach 1 have 2 Link 1 not 7 on 3 open 1 so 1 under 5 P+1(not|they,do) P+1(offer|they,do) P+1(have|they,do) Courtesy of Patrick Pantel

  25. Witten-Bell Discounting (unigrams) • T is the number of types in the training corpus (T < V) • N is the number of tokens in the training corpus • Idea: Use the count of things seen once to estimate unseen events • we saw T words once Courtesy of Patrick Pantel

  26. Witten-Bell Discounting (unigrams) • Total mass reserved for all 0-frequency N-grams is: • Where does this mass come from? • Z = number of 0-frequency words = V – T Courtesy of Patrick Pantel

  27. Witten-Bell Discounting (N-grams) • Condition T, N and Z on N-gram context • unseen N-gram estimate is specific to a word history (context) • b is the number of N-gram types with the given context • b is the number of N-gram tokens with the given context • b is the number of 0-frequency N-grams with the given context Courtesy of Patrick Pantel

  28. N(they,do) N(do) N() UNIGRAM 438699 ... DNS 298 DNS/WINS 2 dns1.isp.net 1 dnsadmin.exe 2 DNSName 1 DNSServer 1 do 384 ... NT 3313 ... pertinent 2 pervasiveness 1 Ph33r 3 phase 24 phased 1 ... phone 60 Phonebook 23 phrase 9 phrases 2 physical 123 PhysicalDisk 1 ... do 384 ... anything 2 approach 1 ... for 5 have 5 I 1 If 4 ... Link 1 list 1 ... no 1 not 97 Novell 1 offer 1 ... workitem 1 you 7 your 1 T(they,do) =9 they,do 22 . 1 approach 1 have 2 Link 1 not 7 on 3 open 1 so 1 under 5 T(do)=81 T()=10543 Courtesy of Patrick Pantel

  29. Witten-Bell Discounting (N-grams) • For N-grams with non-zero frequency: • Mass reserved for 0-frequency N-grams: • For 0-frequency N-grams: Courtesy of Patrick Pantel

  30. PWB(not|they,do) T=9 they,do 22 . 1 approach 1 have 2 Link 1 not 7 on 3 open 1 so 1 under 5 PWB(offer|they,do) Total N-gram Types 1 - 10543 2 - 114707 3 - 256844 PWB(have|they,do) Courtesy of Patrick Pantel

  31. Good-Turing Estimation • where • r = C(w1, …, wn) • Nr= the number of n-grams that occurred r times • This should only be used when r is small.

  32. Example • Corpus: a b a b • Observed bigrams: • b a: 1 • a b: 2 • N0=2, N1=1, N2=1, N=3 • Probability estimations: • f0= N1 /N0 =0.5

  33. Backing off • Estimate the probability with a linear combination of lower order estimations which are less likely to be 0. • Simple linear interpolation

  34. Evaluation of Language Model • Best method: • Use the language model in an application, e.g., spelling check, machine translation, speech recognition, … • Perplexity: the language model that assign the higher probability to the testing data is better.

More Related