1.21k likes | 1.36k Views
Language Modeling. Roadmap (for next two classes). Review LMs What are they? How (and where) are they used? How are they trained? Evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation Absolute Discounting Kneser -Ney. What is a language model?.
E N D
Roadmap (for next two classes) • Review LMs • What are they? • How (and where) are they used? • How are they trained? • Evaluation metrics • Entropy • Perplexity • Smoothing • Good-Turing • Backoff and Interpolation • Absolute Discounting • Kneser-Ney
What is a language model? Gives a probability of communication of transmitted signals of information (Claude Shannon, Information Theory) Lots of ties to Cryptography and Information Theory We most often use n-gram models
Applications • What word sequence (English) does this phoneme sequence correspond to: AY D L AY K T UW
Applications • What word sequence (English) does this phoneme sequence correspond to: AY D L AY K T UW R EH K AH N AY S B IY CH
Applications • What word sequence (English) does this phoneme sequence correspond to: AY D L AY K T UW R EH K AH N AY S B IY CH • Goal of LM: P(“I’d like to recognize speech”) > P(“I’d like to wreck a nice beach”)
Why n-gram LMs? • We could just count how often a sentence occurs… • …but language is too productive – infinite combos! • Break down by word – predict each given its history • We could just count words in context… • …but even contexts get too sparse. • Just use the last words in a -gram model
Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty
Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty • If I can predict the next word before it comes, there’s no information content
Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty • If I can predict the next word before it comes, there’s no information content • Zero uncertainty means the signal has zero information • How many bits of additional information do I need to guess the next symbol?
Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty • If I can predict the next word before it comes, there’s no information content • Zero uncertainty means the signal has zero information • How many bits of additional information do I need to guess the next symbol? • Perplexity is the average branching factor • If message has zero information,
Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty • If I can predict the next word before it comes, there’s no information content • Zero uncertainty means the signal has zero information • How many bits of additional information do I need to guess the next symbol? • Perplexity is the average branching factor • If message has zero information, then branching factor is 1
Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty • If I can predict the next word before it comes, there’s no information content • Zero uncertainty means the signal has zero information • How many bits of additional information do I need to guess the next symbol? • Perplexity is the average branching factor • If message has zero information, then branching factor is 1 • If message needs one bit, branching factor is 2 • If message needs two bits, branching factor is 4
Entropy and perplexity • Entropy measures the information content in a distribution == the uncertainty • If I can predict the next word before it comes, there’s no information content • Zero uncertainty means the signal has zero information • How many bits of additional information do I need to guess the next symbol? • Perplexity is the average branching factor • If message has zero information, then branching factor is 1 • If message needs one bit, branching factor is 2 • If message needs two bits, branching factor is 4 • Entropy and perplexity measure the same thing (uncertainty / information content) with different scales
Entropy of a distribution • Start with a distribution over events in the event space • Entropy measuresthe minimum number of bitsnecessary to encode a message assuming that has distribution
Entropy of a distribution • Start with a distribution over events in the event space • Entropy measuresthe minimum number of bitsnecessary to encode a message assuming that has distribution • Key notion – you can use shorter codes for more common messages • (If you’ve heard of Huffman coding, here it is…)
Computing Entropy Ideal code length for this symbol
Computing Entropy Expected occurrences of Ideal code length for this symbol
Entropy example • What binary code would I use to represent these?
Entropy example • What binary code would I use to represent these? • Sample: • cabaa 3/1/1
Perplexity • Just • If entropy measures # of bits per symbol • Just exponentiate to get the branching factor
The Train/Test Split and Entropy • Before, we were computing • This scores how well we’re doing • if we know the true distribution • We estimate parameters on training and evaluate on test
Cross entropy • Estimate distribution on training corpus; see how well it predicts testing corpus • Let • be the distribution we learned from training data • be the test data • Then cross entropy of test given training is: • This is the negative average logprob • Also, average number of bits required to encode each test data symbol using our learned distribution
Cross entropy, formally True distribution , assumed distribution Wrote codebook using , encode messages from Let be count-based distribution of test data , then
Language model perplexity • Recipe: • Train a language model on training data • Get negative logprobs of test data, compute average • Exponentiate! • Perplexity correlates rather well with: • Speech recognition error rates • MT quality metrics • LM Perplexities for word-based models are normally between say 50 and 1000 • Need to drop perplexity by a significant fraction (not absolute amount) to make a visible impact
Tasks • You get parameters • You want to produce data that conforms to this distribution • This is simulation or data generation
Tasks • You get parameters • And observations • HHHHTHTTHHHHH • You need to answer: “How likely is this data according to the model? • This is evaluating the likelihood function
Tasks • You get observations: • HHTHTTHTHTHHTHTHTTHTHT • You need to find a set of parameters: • This is parameter estimation
Parameter estimation • Wekeep talking about things like: • as a distribution with parameters • How do we estimate parameters? • What’s the likelihood of these parameters?
Parameter estimation techniques • Often use Relative Frequency Estimate • For certain distributions… • “how likely is it that I get k heads when I flip n times”(Binomial distributions) • “how likely is it that I get five 6s when I roll five dice”(Multinomial distributions) • …Relative Freq = Maximum Likelihood Estimate (MLE) • This is the set of parameters for which the underlying distribution has the max likelihood (another max!) • Formalizes your intuition from the prior slide
Maximum Likelihood has problems :/ • Remember: • Two problems: • What happens if ? • We assign zero probability to an event… • Even worse, what if ? • Divide by zero is undefined!
Smoothing • Main goal: prevent zero numerators (zero probs) and zero denominators (divide by zeros) • Make a “sharp” distribution (where some outputs have large probabilities and others have zero probs) be “smoother” • The smoothest distribution is the uniform distribution • Constraint: • Result should still be a distribution
Smoothing techniques • Add one (Laplace) • This can help, but it generally doesn’t do a good job of estimating what’s going on
Mixtures / interpolation • Say I have two distributions and • Pick any number between and • Then is a distribution • Two things to show: • (a) Sums to one: • (b) All values are • and because they’re distributions • and since and • So the sum is non-negative, and we’re done
Laplace as a mixture • Say we have outcomes and total observations. Laplace says: Laplace is a mixture between MLE and uniform! Mixture weight is determined by N and K
BERP Corpus Bigrams • Original bigram probabilites
BERP Smoothed Bigrams • Smoothed bigram probabilities from the BERP