750 likes | 774 Views
Language Modeling. Roadmap (for next two classes). Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation Absolute Discounting Kneser -Ney. Language Model Evaluation Metrics. Applications. Entropy and perplexity.
E N D
Roadmap (for next two classes) • Review LM evaluation metrics • Entropy • Perplexity • Smoothing • Good-Turing • Backoff and Interpolation • Absolute Discounting • Kneser-Ney
Entropy and perplexity • Entropy – measure information content, in bits • is message length with ideal code • Use if you want to measure in bits! • Cross entropy – measure ability of trained model to compactly represent test data • Average logprob of test data • Perplexity – measure average branching factor
Entropy and perplexity • Entropy – measure information content, in bits • is message length with ideal code • Use if you want to measure in bits! • Cross entropy – measure ability of trained model to compactly represent test data • Average logprob of test data • Perplexity – measure average branching factor
Entropy and perplexity • Entropy – measure information content, in bits • is message length with ideal code • Use if you want to measure in bits! • Cross entropy – measure ability of trained model to compactly represent test data • Average logprob of test data • Perplexity – measure average branching factor
Entropy and perplexity • Entropy – measure information content, in bits • is message length with ideal code • Use if you want to measure in bits! • Cross entropy – measure ability of trained model to compactly represent test data • Average logprob of test data • Perplexity – measure average branching factor
Language model perplexity • Recipe: • Train a language model on training data • Get negative logprobs of test data, compute average • Exponentiate! • Perplexity correlates rather well with: • Speech recognition error rates • MT quality metrics • LM Perplexities for word-based models are normally between say 50 and 1000 • Need to drop perplexity by a significant fraction (not absolute amount) to make a visible impact
Parameter estimation • What is it?
Parameter estimation • Model form is fixed (coin unigrams, word bigrams, …) • We have observations • H HH T T H T H H • Want to find the parameters • Maximum Likelihood Estimation – pick the parameters that assign the most probability to our training data • c(H) = 6; c(T) = 3 • P(H) = 6 / 9 = 2 / 3; P(T) = 3 / 9 = 1 / 3 • MLE picks parameters best for training data… • …but these don’t generalize well to test data – zeros!
Parameter estimation • Model form is fixed (coin unigrams, word bigrams, …) • We have observations • H HH T T H T H H • Want to find the parameters • Maximum Likelihood Estimation – pick the parameters that assign the most probability to our training data • c(H) = 6; c(T) = 3 • P(H) = 6 / 9 = 2 / 3; P(T) = 3 / 9 = 1 / 3 • MLE picks parameters best for training data… • …but these don’t generalize well to test data – zeros!
Parameter estimation • Model form is fixed (coin unigrams, word bigrams, …) • We have observations • H HH T T H T H H • Want to find the parameters • Maximum Likelihood Estimation – pick the parameters that assign the most probability to our training data • c(H) = 6; c(T) = 3 • P(H) = 6 / 9 = 2 / 3; P(T) = 3 / 9 = 1 / 3 • MLE picks parameters best for training data… • …but these don’t generalize well to test data – zeros!
Smoothing • Take mass from seen events, give to unseen events • Robin Hood for probability models • MLE at one end of the spectrum; uniform distribution the other • Need to pick a happy medium, and yet maintain a distribution
Smoothing techniques • Laplace • Good-Turing • Backoff • Mixtures • Interpolation • Kneser-Ney
Laplace • From MLE: • To Laplace:
Good-Turing Smoothing • New idea: Use counts of things you have seen to estimate those you haven’t
Good-Turing Josh Goodman Intuition • Imagine you are fishing • There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass • You have caught • 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish • How likely is it that the next fish caught is from a new species (one not seen in our previous catch)? Slide adapted from Josh Goodman, Dan Jurafsky
Good-Turing Josh Goodman Intuition • Imagine you are fishing • There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass • You have caught • 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish • How likely is it that the next fish caught is from a new species (one not seen in our previous catch)? • 3/18 • Assuming so, how likely is it that next species is trout? Slide adapted from Josh Goodman, Dan Jurafsky
Good-Turing Josh Goodman Intuition • Imagine you are fishing • There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass • You have caught • 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish • How likely is it that the next fish caught is from a new species (one not seen in our previous catch)? • 3/18 • Assuming so, how likely is it that next species is trout? • Must be less than 1/18 Slide adapted from Josh Goodman, Dan Jurafsky
Some more hypotheticals How likely is it to find a new fish in each of these places?
Good-Turing Smoothing • New idea: Use counts of things you have seen to estimate those you haven’t
Good-Turing Smoothing • New idea: Use counts of things you have seen to estimate those you haven’t • Good-Turing approach: Use frequency of singletons to re-estimate frequency of zero-count n-grams
Good-Turing Smoothing • New idea: Use counts of things you have seen to estimate those you haven’t • Good-Turing approach: Use frequency of singletons to re-estimate frequency of zero-count n-grams • Notation: Nc is the frequency of frequency c • Number of ngrams which appear c times • N0: # ngrams of count 0; N1: # of ngrams of count 1
Good-Turing Smoothing • Estimate probability of things which occur c times with the probability of things which occur c+1 times • Discounted counts: steal mass from seen cases to provide for the unseen: • MLE • GT
Enough about the fish…how does this relate to language? • Name some linguistic situations where the number of new words would differ
Enough about the fish…how does this relate to language? • Name some linguistic situations where the number of new words would differ • Different languages: • Chinese has almost no morphology • Turkish has a lot of morphology • Lots of new words in Turkish!
Enough about the fish…how does this relate to language? • Name some linguistic situations where the number of new words would differ • Different languages: • Chinese has almost no morphology • Turkish has a lot of morphology • Lots of new words in Turkish! • Different domains: • Airplane maintenance manuals: controlled vocabulary • Random web posts: uncontrolled vocab
Good-Turing Smoothing • N-gram counts to conditional probability • Use c* from GT estimate
Additional Issues in Good-Turing • General approach: • Estimate of c* for Nc depends on N c+1 • What if Nc+1 = 0? • More zero count problems • Not uncommon: e.g. fish example, no 4s
Modifications • Simple Good-Turing • Compute Nc bins, then smooth Nc to replace zeroes • Fit linear regression in log space • log(Nc) = a +b log(c) • What about large c’s? • Should be reliable • Assume c*=c if c is large, e.g c > k (Katz: k =5) • Typically combined with other approaches
Backoff and Interpolation • Another really useful source of knowledge • If we are estimating: • trigram p(z|x,y) • but count(xyz) is zero • Use info from:
Backoff and Interpolation • Another really useful source of knowledge • If we are estimating: • trigram p(z|x,y) • but count(xyz) is zero • Use info from: • Bigram p(z|y) • Or even: • Unigram p(z)
Backoff and Interpolation • Another really useful source of knowledge • If we are estimating: • trigram p(z|x,y) • but count(xyz) is zero • Use info from: • Bigram p(z|y) • Or even: • Unigram p(z) • How to combine this trigram, bigram, unigram info in a valid fashion?
Backoffvs. Interpolation • Backoff: use trigram if you have it, otherwise bigram, otherwise unigram
Backoffvs. Interpolation • Backoff: use trigram if you have it, otherwise bigram, otherwise unigram • Interpolation: always mix all three
Backoff • Bigram distribution • But could be zero… • What if we fell back (or “backed off”) to a unigram distribution? • Also could be zero…
Backoff • What’s wrong with this distribution? • Doesn’t sum to one! • Need to steal mass…
Mixtures • Given distributions and • Pick any number between and • is a distribution • (Laplace is a mixture!)
Interpolation • Simple interpolation • Or, pick interpolation value based on context • Intuition: Higher weight on more frequent n-grams
How to Set the Lambdas? • Use a held-out, or development, corpus • Choose lambdas which maximize the probability of some held-out data • I.e. fix the N-gram probabilities • Then search for lambda values • That when plugged into previous equation • Give largest probability for held-out set • Can use EM to do this search
Kneser-Ney Smoothing • Most commonly used modern smoothing technique • Intuition: improving backoff • I can’t see without my reading…… • Compare P(Francisco|reading) vs P(glasses|reading)
Kneser-Ney Smoothing • Most commonly used modern smoothing technique • Intuition: improving backoff • I can’t see without my reading…… • Compare P(Francisco|reading) vs P(glasses|reading) • P(Francisco|reading) backs off to P(Francisco)
Kneser-Ney Smoothing • Most commonly used modern smoothing technique • Intuition: improving backoff • I can’t see without my reading…… • Compare P(Francisco|reading) vs P(glasses|reading) • P(Francisco|reading) backs off to P(Francisco) • P(glasses|reading) > 0 • High unigram frequency of Francisco > P(glasses|reading)
Kneser-Ney Smoothing • Most commonly used modern smoothing technique • Intuition: improving backoff • I can’t see without my reading…… • Compare P(Francisco|reading) vs P(glasses|reading) • P(Francisco|reading) backs off to P(Francisco) • P(glasses|reading) > 0 • High unigram frequency of Francisco > P(glasses|reading) • However, Francisco appears in few contexts, glasses many
Kneser-Ney Smoothing • Most commonly used modern smoothing technique • Intuition: improving backoff • I can’t see without my reading…… • Compare P(Francisco|reading) vs P(glasses|reading) • P(Francisco|reading) backs off to P(Francisco) • P(glasses|reading) > 0 • High unigram frequency of Francisco > P(glasses|reading) • However, Francisco appears in few contexts, glasses many • Interpolate based on # of contexts • Words seen in more contexts, more likely to appear in others
Kneser-Ney Smoothing: bigrams • Modeling diversity of contexts • So