150 likes | 340 Views
Smoothing. Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004. The Sparse Data Problem.
E N D
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004
The Sparse Data Problem • Maximum likelihood estimation works fine for data that occur frequently in the training corpus • Problem 1: Low frequency n-grams • If n-gram x occurs twice, and n-gram y occurs once. Is x really twice as likely as y? • Problem 2: Zero counts • If n-gram y does not occur in the training data, does that mean that it should have probability zero?
The Sparse Data Problem • Data sparseness is a serious and frequently occurring problem • Probability of a sequence is zero if it contains unseen n-grams
Add-One Smoothing • Most simple smoothing technique • For all n-grams, including unseen n-grams, add one to their counts • Un-smoothed probability: • Add-one probability:
Add-One Smoothing P(wn|wn-1) = C(wn-1wn)/C(wn-1) P+1(wn|wn-1) = [C(wn-1wn)+1]/[C(wn-1)+V]
ci=c(wi,wi-1) ci’=(ci+1) Add-One Smoothing
Add-One Smoothing • Pro: Very simple technique • Cons: • Too much probability mass is shifted towards unseen n-grams • Probability of frequent n-grams is underestimated • Probability of rare (or unseen) n-grams is overestimated • All unseen n-grams are smoothed in the same way • Using a smaller added-counted does not solve this problem in principle
Witten-Bell Discounting • Probability mass is shifted around, depending on the context of words • If P(wi | wi-1,…,wi-m) = 0, then the smoothed probability PWB(wi | wi-1,…,wi-m) is higher if the sequence wi-1,…,wi-m occurs with many different words wi
Witten-Bell Smoothing • Let’s consider bi-grams • T(wi-1) is the number of different words (types) that occur to the right of wi-1 • N(wi-1) is the number of all word occurrences (tokens) to the right of wi-1 • Z(wi-1) is the number of bigrams in the current data set starting with wi-1 that do not occur in the training data
Witten-Bell Smoothing • If c(wi-1,wi) = 0 • If c(wi-1,wi) > 0
ci ci′=(ci+1) ci′ = T/Z · if ci=0 ci · otherwise Witten-Bell Smoothing
Witten-Bell Smoothing • Witten-Bell Smoothing is more conservative when subtracting probability mass • Gives rather good estimates • Problem: If wi-1 and wi did not occur in the training data the smoothed probability is still zero
Backoff Smoothing • Deleted interpolation • If the n-gram wi-n,…,wi is not in the training data, use wi-(n-1) ,…,wi • More general, combine evidence from different n-grams • Where lambda is the ‘confidence’ weight for the longer n-gram • Compute lambda parameters from held-out data • Lambdas can be n-gram specific
Other Smoothing Approaches • Good-Turing Discounting: Re-estimate amount of probability mass for zero (or low count) n-grams by looking at n-grams with higher counts • Kneser-Ney Smoothing: Similar to Witten-Bell smoothing but considers number of word types preceding a word • Katz Backoff Smoothing: Reverts to shorter n-gram contexts if the count for the current n-gram is lower than some threshold