210 likes | 397 Views
Advanced Smoothing, Evaluation of Language Models. Witten-Bell Discounting. A zero ngram is just an ngram you haven’t seen yet…but every ngram in the corpus was unseen once…so... How many times did we see an ngram for the first time? Once for each ngram type (T)
E N D
Witten-Bell Discounting • A zero ngram is just an ngram you haven’t seen yet…but every ngram in the corpus was unseen once…so... • How many times did we see an ngram for the first time? Once for each ngram type (T) • Est. total probability of unseen bigrams as • View training corpus as series of events, one for each token (N) and one for each new type (T) • We can divide the probability mass equally among unseen bigrams….or we can condition the probability of an unseen bigram on the first word of the bigram • Discount values for Witten-Bell are much more reasonable than Add-One
Good-Turing Discounting • Re-estimate amount of probability mass for zero (or low count) ngrams by looking at ngrams with higher counts • Estimate • E.g. N0’s adjusted count is a function of the count of ngrams that occur once, N1 • Assumes: • word bigrams follow a binomial distribution • We know number of unseen bigrams (VxV-seen)
Interpolation and Backoff • Typically used in addition to smoothing techniques/ discounting • Example: trigrams • Smoothing gives some probability mass to all the trigram types not observed in the training data • We could make a more informed decision! How? • If backoff finds an unobserved trigram in the test data, it will “back off” to bigrams (and ultimately to unigrams) • Backoff doesn’t treat all unseen trigrams alike • When we have observed a trigram, we will rely solely on the trigram counts
Backoff methods (e.g. Katz ‘87) • For e.g. a trigram model • Compute unigram, bigram and trigram probabilities • In use: • Where trigram unavailable back off to bigram if available, o.w. unigram probability • E.g An omnivorous unicorn
Smoothing: Simple Interpolation • Trigram is very context specific, very noisy • Unigram is context-independent, smooth • Interpolate Trigram, Bigram, Unigram for best combination • Find 0<<1 by optimizing on “held-out” data • Almost good enough
Smoothing: Held-out estmation • Finding parameter values • Split data into training, “heldout”, test • Try lots of different values for on heldout data, pick best • Test on test data • Sometimes, can use tricks like “EM” (estimation maximization) to find values • How much data for training, heldout, test? • Answer: enough test data to be statistically significant. (1000s of words perhaps)
Summary • N-gram probabilities can be used to estimate the likelihood • Of a word occurring in a context (N-1) • Of a sentence occurring at all • Smoothing techniques deal with problems of unseen words in a corpus
Practical Issues • Represent and compute language model probabilities on log format p1 p2 p3 p4 = exp (log p1 + log p2 + log p3 + log p4)
Class-based n-grams • P(wi|wi-1) = P(ci|ci-1) x P(wi|ci) Factored Language Models
Evaluating language models • We need evaluation metrics to determine how good our language models predict the next word • Intuition: one should average over the probability of new words
Some basic information theory • Evaluation metrics for language models • Information theory: measures of information • Entropy • Perplexity
Entropy • Average length of most efficient coding for a random variable • Binary encoding
Entropy • Example: betting on horses • 8 horses, each horse is equally likely to win • (Binary) Message required: 001, 010, 011, 100, 101, 110, 111, 000 • 3-bit message required
Entropy • 8 horses, some horses are more likely to win • Horse 1: ½ 0 • Horse 2: ¼ 10 • Horse 3: 1/8 110 • Horse 4: 1/16 1110 • Horse 5-8: 1/64 111100, 111101, 111110, 111111
Perplexity • Entropy: H • Perplexity: 2H • Intuitively: weighted average number of choices a random variable has to make • Equally likely horses: Entropy 3 – Perplexity 23 =8 • Biased horses: Entropy 2 – Perplexity 22 =4
Uncertainty measure (Shannon) given a random variable x r =2, pi = probability the event is i Biased coin: -0.8 * lg 0.8 + -0.2 * lg 0.2 = 0.258 + 0.464 =0.722 Unbiased coin: - 2* 0.5 * lg 0.5 = 1 lg = log2 (log base 2) entropy= H(x) = Shannon uncertainty Perplexity (average) branching factor weighted average number of choices a random variable has to make Formula: 2H directly related to the entropy value H Examples Biased coin: 20.722 = 0.52 Unbiased coin: - 21= 2 Entropy
Given a word sequence: W = w1…wn Entropy for word sequences of length n in language L H(w1…wn)= - p(w1…wn) log p(w1…wn) over all sequences of length n in language L Entropy rate for word sequences of length n 1/n H(w1…wn) = -1/n p(w1…wn) log p(w1…wn) Entropy rateH(L) = limn> -1/n p(w1…wn) log p(w1…wn) n is number of words in the sequence Shannon-McMillan-Breiman theorem H(L) = limn→ - 1/n log p(w1…wn) select sufficiently large n possible then to take a single sequence instead of summing over all possible w1…wn long sequence will contain many shorter sequences Entropy and Word Sequences
Entropy of a sequence • Finite sequence: strings from a language L • Entropy rate (per-word entropy)
Entropy of a language • Entropy rate of language L • Shannon-McMillan-Breimann Theorem: • If a language is stationary and ergodic • A single sequence – if it is long enough – is representative for the language