200 likes | 469 Views
Part 5 Language Model. CSE717, SPRING 2008 CUBS, Univ at Buffalo. Examples of Good & Bad Language Models Excerption from Herman , comic strips by Jim Unger. 1. 2. 3. 4. What’s a Language Model. A Language model is a probability distribution over word sequences
E N D
Part 5Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo
Examples of Good& Bad Language Models Excerption from Herman, comic strips by Jim Unger 1 2 3 4
What’s a Language Model • A Language model is a probability distribution over word sequences • P(“And nothing but the truth”) 0.001 • P(“And nuts sing on the roof”) 0
What’s a language model for? • Speech recognition • Handwriting recognition • Spelling correction • Optical character recognition • Machine translation • (and anyone doing statistical modeling)
The Equation The observation can be image features (handwriting recognition), acoustics (speech recognition), word sequence in another language (MT), etc.
How Language Models work • Hard to compute P(“And nothing but the truth”) • Decompose probability P(“and nothing but the truth) = P(“and”) P(“nothing|and”) P(“but|and nothing”) P(“the|and nothing but”) P(“truth|and nothing but the”)
The Trigram Approximation Assume each word depends only on the previous two words P(“the|and nothing but”) P(“the|nothing but”) P(“truth|and nothing but the”) P(“truth|but the”)
How to find probabilities? Count from real text Pr(“the | nothing but”) c(“nothing but the”) / c(“nothing but”)
Evaluation • How can you tell a good language model from a bad one? • Run a speech recognizer (or your application of choice), calculate word error rate • Slow • Specific to your recognizer
Perplexity An example Data: “the whole truth and nothing but the truth” Lexicon: L={the, whole, truth, and, nothing, but} Model 1: uni-gram, Pr(L1)=…=Pr(L6)=1/6 Model 2: unigram, Pr(“the”)=Pr(“truth”)=1/4, Pr(“whole”)=Pr(“and”)=Pr(“nothing”)=Pr(“but”)=1/8
Perplexity:Is lower better? • Remarkable fact: the “true” model for data has the lowest possible perplexity • Lower the perplexity, the closer we are to true model. • Perplexity correlates well with the error rate of recognition task • Correlates better when both models are trained on same data • Doesn’t correlate well when training data changes
Smoothing • Terrible on test data: If no occurrences of C(xyz), probability is 0 • P(sing|nuts) =0 leads to infinite perplexity!
Smoothing: Add One • Add one smoothing: • Add delta smoothing: • Simple add-one smoothing does not perform well – the probability of rarely seen events is over-estimated
Smoothing: Simple Interpolation Interpolate Trigram, Bigram, Unigram for best combination Almost good enough
Smoothing: Redistribution of Probability Mass (Backing Off) [Katz87] • Discounting Discounted probability mass • Redistribution (n-1)-gram
Linear Discount Factor can be determined by the relative frequency of singletons, i.e., events observed exactly once in the data [Ney95]
More General Formulation • Drawback of linear discount The counts of frequently observed events are modified the most ; against the “law of large numbers” • Generalization : function of y, determined by cross-validation Requires more data Computation is expensive
Absolute Discounting The discount is an absolute value Works pretty well, easier than linear discounting
References [1] Katz S, Estimation of probabilities from sparse data for the language model component of a speech recognizer, IEEE Trans on Acoustics, Speech, and Signal Processing 35(3):400-401, 1987 [2] Ney H, Essen U, Kneser R, On the estimation of “small” probabilities by leaving-one-out, ITTT Trans. on PAMI 17(12): 1202-1212, 1995 [3] Joshua Goodman, A tutorial of language model: the State of The Art in Language Modeling, research.microsoft.com/~joshuago/lm-tutorial-public.ppt