260 likes | 595 Views
N-gram model limitations. Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams to previously unseen N-grams. This leads to another question: how do we do this?. Unsmoothed bigrams.
E N D
N-gram model limitations • Q: What do we do about N-grams which were not in our training corpus? • A: We distribute some probability mass from seen N-grams to previously unseen N-grams. • This leads to another question: how do we do this?
Unsmoothed bigrams • Recall that we use unigram and bigram counts to compute bigram probabilities: • P(wn|wn-1) = C(wn-1wn) / C(wn-1)
How many bigrams are in a text? • Suppose text had N words, how many bigrams (tokens) does it contain? • At most N: we assume <s> appearing before first word to get a bigram probability for the word in the initial position. • Example (5 words): • words: <s> w1 w2 w3 w4 w5 • bigrams: <s> w1, w1 w2, w2 w3, w3 w4, w4 w5
How many possible bigrams are there? • With a vocabulary of N words, there are N2 possible bigrams.
Example description • Berkeley Restaurant Project corpus • approximately 10,000 sentences • 1616 word types • tables will show counts or probabilities for 7 word types, carefully chosen so that the 7 by 7 matrix is not too sparse • notice that many counts in first table are zero (25 zeros of 49 entries)
Unsmoothed N-grams Bigram counts (figure 6.4 from text)
Computing probabilities • Recall formula (we normalize by unigram counts): • P(wn|wn-1) = C(wn-1wn) / C(wn-1) • Unigram counts are: p( eat | to ) = c( to eat ) / c( to ) = 860 / 3256 = .26 p( to | eat ) = c( eat to ) / c(eat) = 2 / 938 = .0021
Unsmoothed N-grams Bigram probabilities (figure 6.5 from text): p( wn | wn-1 )
What do zeros mean? • Just because a bigram has a zero count or a zero probability does not mean that it cannot occur – it just means it didn’t occur in the training corpus. • So we arrive back at our question: what do we do with bigrams that have zero counts when we encounter them?
Let’s rephrase the question • How can we ensure that none of the possible bigrams have zero counts/probabilities? • Process of spreading the probability mass around to all possible bigrams is called smoothing. • We start with a very simple model, called add-one smoothing.
Add-one smoothing • Basic idea: add one to actual counts, across the board. • This ensures that there are no unigrams or bigrams with zero counts. • Typically this adds too much probability mass to non-occurring bigrams.
Add-one smoothing:computing the probabilities • Unadjusted probabilities: • P(wn|wn-1) = C(wn-1wn) / C(wn-1) • Adjusted probabilities: • P*(wn|wn-1) = [ C(wn-1wn) + 1 ] / [ C(wn-1) + V ] • Notes • V is total number of word types in vocabulary • In numerator we add one to the count of each bigram, just as we do with the unigram counts. • In denominator we add V, since we are adding one more bigram token of the form wn-1w, for each w in our vocabulary
A simple approach to smoothing:Add-one smoothing Add-one smoothed bigram counts (figure 6.6 from text)
Back to the probabilities • Recall the formula for the adjusted probabilities: • P*(wn|wn-1) = [ C(wn-1wn) + 1 ] / [ C(wn-1) + V ] • Unigram counts (adjusted by adding V=1616): p( eat | to ) = c( to eat ) / c( to ) = 861 / 4872 = .18 (was .26) p( to | eat ) = c( eat to ) / c( eat ) = 3 / 2554 = .0012 (was .0021) p( eat | lunch ) = c( lunch eat ) / c( lunch ) = 1 / 2075 = .00048 (was 0) p( eat | want ) = c( want eat ) / c( want ) = 1 / 2931 = .00034 (was 0)
A simple approach to smoothing:Add-one smoothing Add-one smoothed bigram probabilities (figure 6.7 from text)
Discounting • We define the discount to be the ratio of new and old counts (in our case smoothed and unsmoothed counts). • Discounts for add-one smoothing for this example:
What does the discount tell us? • The discount tells us from where the probability mass is coming. • "Looking at the discount … shows us how strikingly the counts for each prefix-word have been reduced; the bigrams starting with Chinese were discounted by a factor of 8!" [p. 209]
Witten-Bell discounting • Another approach to smoothing • Basic idea: “Use the count of things you’ve seen once to help estimate the count of things you’ve never seen.” [p. 211] • “How can we compute the probability of seeing an N-gram for the first time? By counting the number of times we saw N-grams for the first time in our training corpus. This is very simple to produce since the count of “first-time” N-grams is just the number of N-gram types we saw in the data (since we had to see each type for the first time exactly once).” [p. 211]
How much probability mass can we reassign? • Total probability mass assigned to all (as yet) unseen n-grams is T / [ T + N ], where • T is the total number of observed types (not vocabulary size) • N is the number of tokens • “We can think of our training corpus as a series of events; one event for each token and one event for each new type.” [p. 211] • Formula above estimates “the probability of a new type event occurring.” [p. 211]
Distribution of probability mass • This probability mass is distributed evenly amongst the unseen n-grams. • Z = number of zero-count n-grams. • pi* = [ T / (N+T) ] / Z = T / Z(N+T)
Discounting (unigram case) • This probability mass has to come from somewhere. • Recall the unsmoothed probability is pi = ci / N (N is number of tokens) • The smoothed (discounted) probability for non-zero count unigrams is pi* = ci / (N + T) • We can also give smoothed counts (Z is # unigram types with zero counts): • For zero-count unigrams: ci* = (T/Z) N/(N+T) • For non-zero count unigrams: ci* = ci * N/(N+T)
Discounting (bigram case) • Total probability mass being redistributed (conditioned on first word of bigram): i:c(wxwi)=0p*(wi|wx) = T(wx) / (N(wx) + T(wx)) • Distrubute this probability mass to unseen bigrams. • Z is # bigram types with given first word with zero counts. • The smoothed (discounted) probability is p*(wi|wx) = T(wx) / Z(wx)(N(wx) + T(wx)) c(wi-1wi) = 0 p*(wi|wx) = c(wxwi) / (c(wx) + T(wx)) c(wi-1wi) > 0
Witten-Bell: discounted counts Witten-Bell smoothed (discounted) bigram counts (figure 6.9 from text) Notice that counts which were 0 unsmoothed are <1 smoothed; contrast with Add-One Smoothing.
Discount comparison • Table shows discounts for add-one and Witten-Bell smoothing for this example:
Training sets and test sets • Corpus divided into training set and test set • Need test items to not be in training set, else they will have artificially high probability • Can use this to evaluate different systems: • train two different systems on the same training set • compare performance of systems on the same test set