Part 5 Language Model

Part 5Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo

Examples of Good& Bad Language Models Excerption from Herman, comic strips by Jim Unger 1 2 3 4

What’s a Language Model • A Language model is a probability distribution over word sequences • P(“And nothing but the truth”)  0.001 • P(“And nuts sing on the roof”)  0

What’s a language model for? • Speech recognition • Handwriting recognition • Spelling correction • Optical character recognition • Machine translation • (and anyone doing statistical modeling)

The Equation The observation can be image features (handwriting recognition), acoustics (speech recognition), word sequence in another language (MT), etc.

How Language Models work • Hard to compute P(“And nothing but the truth”) • Decompose probability P(“and nothing but the truth) = P(“and”) P(“nothing|and”)  P(“but|and nothing”)  P(“the|and nothing but”)  P(“truth|and nothing but the”)

The Trigram Approximation Assume each word depends only on the previous two words P(“the|and nothing but”)  P(“the|nothing but”) P(“truth|and nothing but the”)  P(“truth|but the”)

How to find probabilities? Count from real text Pr(“the | nothing but”)  c(“nothing but the”) / c(“nothing but”)

Evaluation • How can you tell a good language model from a bad one? • Run a speech recognizer (or your application of choice), calculate word error rate • Slow • Specific to your recognizer

Perplexity An example Data: “the whole truth and nothing but the truth” Lexicon: L={the, whole, truth, and, nothing, but} Model 1: uni-gram, Pr(L1)=…=Pr(L6)=1/6 Model 2: unigram, Pr(“the”)=Pr(“truth”)=1/4, Pr(“whole”)=Pr(“and”)=Pr(“nothing”)=Pr(“but”)=1/8

Perplexity:Is lower better? • Remarkable fact: the “true” model for data has the lowest possible perplexity • Lower the perplexity, the closer we are to true model. • Perplexity correlates well with the error rate of recognition task • Correlates better when both models are trained on same data • Doesn’t correlate well when training data changes

Smoothing • Terrible on test data: If no occurrences of C(xyz), probability is 0 • P(sing|nuts) =0 leads to infinite perplexity!

Smoothing: Add One • Add one smoothing: • Add delta smoothing: • Simple add-one smoothing does not perform well – the probability of rarely seen events is over-estimated

Smoothing: Simple Interpolation Interpolate Trigram, Bigram, Unigram for best combination Almost good enough

Smoothing: Redistribution of Probability Mass (Backing Off) [Katz87] • Discounting Discounted probability mass • Redistribution (n-1)-gram

Linear Discount Factor can be determined by the relative frequency of singletons, i.e., events observed exactly once in the data [Ney95]

More General Formulation • Drawback of linear discount The counts of frequently observed events are modified the most ; against the “law of large numbers” • Generalization : function of y, determined by cross-validation Requires more data Computation is expensive

Absolute Discounting The discount is an absolute value Works pretty well, easier than linear discounting

References [1] Katz S, Estimation of probabilities from sparse data for the language model component of a speech recognizer, IEEE Trans on Acoustics, Speech, and Signal Processing 35(3):400-401, 1987 [2] Ney H, Essen U, Kneser R, On the estimation of “small” probabilities by leaving-one-out, ITTT Trans. on PAMI 17(12): 1202-1212, 1995 [3] Joshua Goodman, A tutorial of language model: the State of The Art in Language Modeling, research.microsoft.com/~joshuago/lm-tutorial-public.ppt

Part 5 Language Model

Part 5 Language Model

Presentation Transcript

Language Part II Language development

Part 5

Part 5

Language – Part II

Part 5

Part 5:

Deformable Part Model

Language Part 1

Language Model

PART 5

Part 5

Language Model (LM)

Model 5

Part 5 Parameter Identification (Model Calibration/Updating)

Part 5

Part 5

PART 5

Deformable Part Model

Language Model

Part 5