340 likes | 526 Views
Language Modeling with N-Grams. Language Modeling. A Language Model is a probabilistic model that allows us to compute the probability of a sentence. Let w 1:n denote the word sequence w 1 w 2 …w n . What is the probability P(w 1:n )?. Why Language Modeling?.
E N D
Language Modeling • A Language Model is a probabilistic model that allows us to compute the probability of a sentence. • Let w1:n denote the word sequence w1w2…wn. • What is the probability P(w1:n)?
Why Language Modeling? • Determine which sequence of words is more likely • Predicting the next word given the previous words • Shannon game: • guessing the next letter given previous letters. • Applications in • Speech Recognition • Machine Translation • Context sensitive spelling check
Language Modeling in Speech Recognition • Some sequences of words sounds alike, but not all of them are good English sentences. • I went to a party • Eye went two a bar tea Rudolph the red nose reindeer. Rudolph the Red knows rain, dear. Rudolph the Red Nose reigned here.
Language Modeling in Machine Translation • Given a French sentence • On voit Jon à la télévision • And several possible English translations: • Jon appeared in TV. • In Jon appeared TV. • Jon appeared on TV. • Which one is more likely to be correct?
Context Sensitive Spelling • Which is most probable? • … I think they’re okay … • … I think there okay … • … I think their okay … • Which is most probable? • … by the way, are they’re likely to … • … by the way, are there likely to … • … by the way, are their likely to …
Axioms of Probability Theory • Suppose P(.) is a probability function, then 1. for any event E, 0≤P(E) ≤1. 2. P(S) = 1, where S is the sample space. 3. for any two mutually exclusive events E1 and E2, P(E1 U E2) = P(E1) + P(E2) • Any function that satisfies the above three axioms is a probability function.
Properties of Probability 1. P(¬E) = 1– P(E) 2. If E1 and E2 are logically equivalent, then P(E1)=P(E2). • E1: Not all philosophers are more than six feet tall. • E2: Some philosopher is not more that six feet tall. Then P(E1)=P(E2). 3. P(E1, E2)≤P(E1).
Conditional Probability • The probability of an event may change after knowing another event. The probability of A given B is denoted by P(A|B). • Example • P( W=space ) the probability of a randomly selected word from an English text is ‘space’ • P( W=space | W’=outer) the probability of ‘space’ if the previous word is ‘outer’
Chain Rule and Bayes Theorem • Chain Rule: P(A, B)=P(A)P(B|A) • Bayes Theorem If P(E2)>0, then P(E1|E2)=P(E2|E1)P(E1)/P(E2) This can be derived from the definition of conditional probability.
The n-gram Language Model Using the Chain Rule: P(A,B)=P(A)P(B|A) P(w1:n) =P(w1:n-1)P(wn|w1:n-1) = P(w1:n-2)P(wn-1|w1:n-2)P(wn|w1:n-1) = P(w1:n-3)P(wn-2|w1:n-3)P(wn-1|w1:n-2)P(wn|w1:n-1) = P(w1)P(w2|w1) P(w3|w1:2) P(w4|w1:3) …… P(wn-1|w1:n-2)P(wn|w1:n-1) Can we compute P(w1:n) in the reverse order?
Markov Assumption • W1:n-1 is called the history of wn • Sue swallowed the large green ______. • The statistics for the complete history is very sparse. • Markov Assumption: only the closest n words are relevant: P(wn|w1:n-1)≈P(wn|wn-N+1:n-1) • Bigram: only the previous one word matters • Trigram: only the previous two words matter • Therefore P(w1:n) ≈k=1,n P(wk|wk-N+1:k-1)
Examples: • Without Markov Assumption: • P(I went to a party) = ? • With Markov Assumption (n=3) • P(I went to a party) = ? • With Markov Assumption (n=2) • P(I went to a party) = ? • What does n=1 mean?
Parameters in N-gram Models • Suppose there are 20,000 words • very conservative assumption • Parameters • Bigram: 20,000x19,999 = 400M • Trigram:20,0002x19,999=8 trillion • 4-gram: 20,0003x19,999=1.6x1017 • Reliability vs. Relevance • as n increases, n-gram becomes more relevant, but less reliable.
Estimation of Probability • P(wn | w1:n-1) = P(w1:n)/P(w1:n-1) • Probabilities (subjective/objective) exist independent of data. • However, probabilities have to be estimated from data. • Maximum Likelihood Estimation • PMLE(wn | w1:n)=C(w1:n)/C(w1:n-1)
Maximum Likelihood Estimation • MLE assigns the highest probability to data. • Example: • training corpus: <s> a b a b </s> • MLE P(a|b)= ½, P(b|a)=1, P(a|<s>)=1, P(</s>|b) = ½, P(corpus)=1/2. • MLE is not suitable for NLP • MLE assigns 0 probability to unseen events. • One experiment shows that 23% of trigrams were previously unseen after 1.5M words.
p(z | xy) = ? Suppose our training data includes … xya .. … xyd … … xyd … but never xyz Should we conclude p(a | xy) = 1/3?p(d | xy) = 2/3?p(z | xy) = 0/3? NO! Absence of xyz might just be bad luck. How to Estimate
Smoothing the Estimates • Should we conclude • p(a | xy) = 1/3? reduce this • p(d | xy) = 2/3? reduce this • p(z | xy) = 0/3? increase this • Discount the positive counts somewhat • Reallocate that probability to the zeroes
Especially if the denominator is small … • 1/3 probably too high, 100/300 probably about right • Especially if numerator is small … • 1/300 probably too high, 100/30000 probably about right
Dealing with 0 Probability • Back-off • If the frequency count of N-gram is 0, used N-1 gram • Smoothing • Mix MLE with another probability distribution that guarantees not to give 0 probability.
UNIGRAM 438699 ... DNS 298 DNS/WINS 2 dns1.isp.net 1 dnsadmin.exe 2 DNSName 1 DNSServer 1 do 384 ... NT 3313 ... pertinent 2 pervasiveness 1 Ph33r 3 phase 24 phased 1 ... phone 60 Phonebook 23 phrase 9 phrases 2 physical 123 PhysicalDisk 1 ... do 384 ... anything 2 approach 1 ... for 5 have 5 I 1 If 4 ... Link 1 list 1 ... no 1 not 97 Novell 1 offer 1 ... workitem 1 you 7 your 1 they,do 22 . 1 approach 1 have 2 Link 1 not 7 on 3 open 1 so 1 under 5 Courtesy of Patrick Pantel
not 7 do 384 not 97 UNIGRAM 438699 ... DNS 298 DNS/WINS 2 dns1.isp.net 1 dnsadmin.exe 2 DNSName 1 DNSServer 1 do 384 ... NT 3313 ... pertinent 2 pervasiveness 1 Ph33r 3 phase 24 phased 1 ... phone 60 Phonebook 23 phrase 9 phrases 2 physical 123 PhysicalDisk 1 ... not 97 do 384 ... anything 2 approach 1 ... for 5 have 5 I 1 If 4 ... Link 1 list 1 ... no 1 not 97 Novell 1 offer 1 ... workitem 1 you 7 your 1 they,do 22 . 1 approach 1 have 2 Link 1 not 7 on 3 open 1 so 1 under 5 C(they,do,not) = 7 C(do,not) = 97 PMLE(not|they,do) = 7/22 = 0.318 PMLE(not|do) = 97/384 = 0.253 PMLE(offer|they,do) = 0/22 = 0 PMLE(have|they,do) = 2/22 = 0.091 Courtesy of Patrick Pantel
Add-One Smoothing • V is the number of types we might see • the vocabulary size (unique words) • Add-One Smoothing (+1): • Too much mass is reserved for 0-frequency N-grams • arbitrarily picked value “1” to add to N-grams Courtesy of Patrick Pantel
Vocabulary Size (V) = 10,543 Vocabulary Size (V) = 10,543 They,do 22 . 1 approach 1 have 2 Link 1 not 7 on 3 open 1 so 1 under 5 P+1(not|they,do) P+1(offer|they,do) P+1(have|they,do) Courtesy of Patrick Pantel
Witten-Bell Discounting (unigrams) • T is the number of types in the training corpus (T < V) • N is the number of tokens in the training corpus • Idea: Use the count of things seen once to estimate unseen events • we saw T words once Courtesy of Patrick Pantel
Witten-Bell Discounting (unigrams) • Total mass reserved for all 0-frequency N-grams is: • Where does this mass come from? • Z = number of 0-frequency words = V – T Courtesy of Patrick Pantel
Witten-Bell Discounting (N-grams) • Condition T, N and Z on N-gram context • unseen N-gram estimate is specific to a word history (context) • b is the number of N-gram types with the given context • b is the number of N-gram tokens with the given context • b is the number of 0-frequency N-grams with the given context Courtesy of Patrick Pantel
N(they,do) N(do) N() UNIGRAM 438699 ... DNS 298 DNS/WINS 2 dns1.isp.net 1 dnsadmin.exe 2 DNSName 1 DNSServer 1 do 384 ... NT 3313 ... pertinent 2 pervasiveness 1 Ph33r 3 phase 24 phased 1 ... phone 60 Phonebook 23 phrase 9 phrases 2 physical 123 PhysicalDisk 1 ... do 384 ... anything 2 approach 1 ... for 5 have 5 I 1 If 4 ... Link 1 list 1 ... no 1 not 97 Novell 1 offer 1 ... workitem 1 you 7 your 1 T(they,do) =9 they,do 22 . 1 approach 1 have 2 Link 1 not 7 on 3 open 1 so 1 under 5 T(do)=81 T()=10543 Courtesy of Patrick Pantel
Witten-Bell Discounting (N-grams) • For N-grams with non-zero frequency: • Mass reserved for 0-frequency N-grams: • For 0-frequency N-grams: Courtesy of Patrick Pantel
PWB(not|they,do) T=9 they,do 22 . 1 approach 1 have 2 Link 1 not 7 on 3 open 1 so 1 under 5 PWB(offer|they,do) Total N-gram Types 1 - 10543 2 - 114707 3 - 256844 PWB(have|they,do) Courtesy of Patrick Pantel
Good-Turing Estimation • where • r = C(w1, …, wn) • Nr= the number of n-grams that occurred r times • This should only be used when r is small.
Example • Corpus: a b a b • Observed bigrams: • b a: 1 • a b: 2 • N0=2, N1=1, N2=1, N=3 • Probability estimations: • f0= N1 /N0 =0.5
Backing off • Estimate the probability with a linear combination of lower order estimations which are less likely to be 0. • Simple linear interpolation
Evaluation of Language Model • Best method: • Use the language model in an application, e.g., spelling check, machine translation, speech recognition, … • Perplexity: the language model that assign the higher probability to the testing data is better.