Language Models & Smoothing

Language Models & Smoothing Shallow Processing Techniques for NLP Ling570 October 19, 2011

Announcements • Career exploration talk: Bill McNeill • Thursday (10/20): 2:30-3:30pm • Thomson 135 & Online (Treehouse URL) • Treehouse meeting: Friday 10/21: 11-12 • Thesis topic brainstorming • GP Meeting: Friday 10/21: 3:30-5pm • PCAR 291 & Online (…/clmagrad)

Roadmap • Ngram language models • Constructing language models • Generative language models • Evaluation: • Training and Testing • Perplexity • Smoothing: • Laplace smoothing • Good-Turing smoothing • Interpolation & backoff

Ngram Language Models • Independence assumptions moderate data needs • Approximate probability given all prior words • Assume finitehistory • Unigram: Probability of word in isolation • Bigram: Probability of word given 1 previous • Trigram: Probability of word given 2 previous • N-gram approximation Bigram sequence

Berkeley Restaurant Project Sentences • can you tell me about any good cantonese restaurants close by • mid priced thai food is what i’m looking for • tell me about chez panisse • can you give me a listing of the kinds of food that are available • i’m looking for a good place to eat breakfast • when is caffevenezia open during the day

Bigram Counts • Out of 9222 sentences • Eg. “I want” occurred 827 times

Bigram Probabilities • Divide bigram counts by prefix unigram counts to get probabilities.

Probabilistic Language Generation • Coin-flipping models • A sentence is generated by a randomized algorithm • The generator can be in one of several “states” • Flip coins to choose the next state • Flip other coins to decide which letter or word to output

Generated Language:Effects of N • 1. Zero-order approximation: • XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD

Generated Language:Effects of N • 1. Zero-order approximation: • XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD • 2. First-order approximation: • OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH RBL

Generated Language:Effects of N • 1. Zero-order approximation: • XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD • 2. First-order approximation: • OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH RBL • 3. Second-order approximation: • ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIND ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE

Word Models: Effects of N • 1. First-order approximation: • REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE

Word Models: Effects of N • 1. First-order approximation: • REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE • 2. Second-order approximation: • THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED

Shakespeare

The Wall Street Journal is Not Shakespeare

Evaluation

Evaluation - General • Evaluation crucial for NLP systems • Required for most publishable results • Should be integrated early • Many factors:

Evaluation - General • Evaluation crucial for NLP systems • Required for most publishable results • Should be integrated early • Many factors: • Data • Metrics • Prior results • …..

Evaluation Guidelines • Evaluate your system • Use standard metrics • Use (standard) training/dev/test sets • Describing experiments: (Intrinsic vs Extrinsic)

Evaluation Guidelines • Evaluate your system • Use standard metrics • Use (standard) training/dev/test sets • Describing experiments: (Intrinsic vs Extrinsic) • Clearly lay out experimental setting

Evaluation Guidelines • Evaluate your system • Use standard metrics • Use (standard) training/dev/test sets • Describing experiments: (Intrinsic vs Extrinsic) • Clearly lay out experimental setting • Compare to baseline and previous results • Perform error analysis

Evaluation Guidelines • Evaluate your system • Use standard metrics • Use (standard) training/dev/test sets • Describing experiments: (Intrinsic vs Extrinsic) • Clearly lay out experimental setting • Compare to baseline and previous results • Perform error analysis • Show utility in real application (ideally)

Data Organization • Training: • Training data: used to learn model parameters

Data Organization • Training: • Training data: used to learn model parameters • Held-out data: used to tune additional parameters

Data Organization • Training: • Training data: used to learn model parameters • Held-out data: used to tune additional parameters • Development (Dev) set: • Used to evaluate system during development • Avoid overfitting

Data Organization • Training: • Training data: used to learn model parameters • Held-out data: used to tune additional parameters • Development (Dev) set: • Used to evaluate system during development • Avoid overfitting • Test data: Used for final, blind evaluation

Data Organization • Training: • Training data: used to learn model parameters • Held-out data: used to tune additional parameters • Development (Dev) set: • Used to evaluate system during development • Avoid overfitting • Test data: Used for final, blind evaluation • Typical division of data:80/10/10 • Tradeoffs • Cross-validation

Evaluting LMs • Extrinsic evaluation (aka in vivo) • Embed alternate models in system • See which improves overall application • MT, IR, …

Evaluting LMs • Extrinsic evaluation (aka in vivo) • Embed alternate models in system • See which improves overall application • MT, IR, … • Intrinsic evaluation: • Metric applied directly to model • Independent of larger application • Perplexity

Evaluting LMs • Extrinsic evaluation (aka in vivo) • Embed alternate models in system • See which improves overall application • MT, IR, … • Intrinsic evaluation: • Metric applied directly to model • Independent of larger application • Perplexity • Why not just extrinsic?

Perplexity

Perplexity • Intuition: • A better model will have tighter fit to test data • Will yield higher probability on test data

Perplexity • Intuition: • A better model will have tighter fit to test data • Will yield higher probability on test data • Formally,

Perplexity • Intuition: • A better model will have tighter fit to test data • Will yield higher probability on test data • Formally, • For bigrams:

Perplexity • Intuition: • A better model will have tighter fit to test data • Will yield higher probability on test data • Formally, • For bigrams: • Inversely related to probability of sequence • Higher probability  Lower perplexity

Perplexity • Intuition: • A better model will have tighter fit to test data • Will yield higher probability on test data • Formally, • For bigrams: • Inversely related to probability of sequence • Higher probability  Lower perplexity • Can be viewed as average branching factor of model

Perplexity Example • Alphabet: 0,1,…,9 • Equiprobable

Perplexity Example • Alphabet: 0,1,…,9; • Equiprobable: P(X)=1/10

Perplexity Example • Alphabet: 0,1,…,9; • Equiprobable: P(X)=1/10 • PP(W)=

Perplexity Example • Alphabet: 0,1,…,9; • Equiprobable: P(X)=1/10 • PP(W)= • If probability of 0 is higher, PP(W) will be

Perplexity Example • Alphabet: 0,1,…,9; • Equiprobable: P(X)=1/10 • PP(W)= • If probability of 0 is higher, PP(W) will be lower

Thinking about Perplexity • Given some vocabulary V with a uniform distribution • I.e. P(w) = 1/|V|

Thinking about Perplexity • Given some vocabulary V with a uniform distribution • I.e. P(w) = 1/|V| • Under a unigram LM, the perplexity is • PP(W) =

Language Models & Smoothing