830 likes | 1.07k Views
Language Models & Smoothing. Shallow Processing Techniques for NLP Ling570 October 19, 2011. Announcements. Career exploration talk: Bill McNeill Thursday (10/20): 2:30-3:30pm Thomson 135 & Online ( T reehouse URL) Treehouse meeting: Friday 10/21: 11-12 Thesis topic brainstorming
E N D
Language Models & Smoothing Shallow Processing Techniques for NLP Ling570 October 19, 2011
Announcements • Career exploration talk: Bill McNeill • Thursday (10/20): 2:30-3:30pm • Thomson 135 & Online (Treehouse URL) • Treehouse meeting: Friday 10/21: 11-12 • Thesis topic brainstorming • GP Meeting: Friday 10/21: 3:30-5pm • PCAR 291 & Online (…/clmagrad)
Roadmap • Ngram language models • Constructing language models • Generative language models • Evaluation: • Training and Testing • Perplexity • Smoothing: • Laplace smoothing • Good-Turing smoothing • Interpolation & backoff
Ngram Language Models • Independence assumptions moderate data needs • Approximate probability given all prior words • Assume finitehistory • Unigram: Probability of word in isolation • Bigram: Probability of word given 1 previous • Trigram: Probability of word given 2 previous • N-gram approximation Bigram sequence
Berkeley Restaurant Project Sentences • can you tell me about any good cantonese restaurants close by • mid priced thai food is what i’m looking for • tell me about chez panisse • can you give me a listing of the kinds of food that are available • i’m looking for a good place to eat breakfast • when is caffevenezia open during the day
Bigram Counts • Out of 9222 sentences • Eg. “I want” occurred 827 times
Bigram Probabilities • Divide bigram counts by prefix unigram counts to get probabilities.
Bigram Estimates of Sentence Probabilities • P(<s> I want english food </s>) = P(i|<s>)* P(want|I)* P(english|want)* P(food|english)* P(</s>|food) =.000031
P(english|want) = .0011 P(chinese|want) = .0065 P(to|want) = .66 P(eat | to) = .28 P(food | to) = 0 P(want | spend) = 0 P (i | <s>) = .25 Kinds of Knowledge What types of knowledge are captured by ngram models?
P(english|want) = .0011 P(chinese|want) = .0065 P(to|want) = .66 P(eat | to) = .28 P(food | to) = 0 P(want | spend) = 0 P (i | <s>) = .25 Kinds of Knowledge What types of knowledge are captured by ngram models? World knowledge
P(english|want) = .0011 P(chinese|want) = .0065 P(to|want) = .66 P(eat | to) = .28 P(food | to) = 0 P(want | spend) = 0 P (i | <s>) = .25 Kinds of Knowledge What types of knowledge are captured by ngram models? World knowledge Syntax
P(english|want) = .0011 P(chinese|want) = .0065 P(to|want) = .66 P(eat | to) = .28 P(food | to) = 0 P(want | spend) = 0 P (i | <s>) = .25 Kinds of Knowledge What types of knowledge are captured by ngram models? World knowledge Syntax Discourse
Probabilistic Language Generation • Coin-flipping models • A sentence is generated by a randomized algorithm • The generator can be in one of several “states” • Flip coins to choose the next state • Flip other coins to decide which letter or word to output
Generated Language:Effects of N • 1. Zero-order approximation: • XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD
Generated Language:Effects of N • 1. Zero-order approximation: • XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD • 2. First-order approximation: • OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH RBL
Generated Language:Effects of N • 1. Zero-order approximation: • XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD • 2. First-order approximation: • OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH RBL • 3. Second-order approximation: • ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIND ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE
Word Models: Effects of N • 1. First-order approximation: • REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE
Word Models: Effects of N • 1. First-order approximation: • REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE • 2. Second-order approximation: • THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED
Evaluation - General • Evaluation crucial for NLP systems • Required for most publishable results • Should be integrated early • Many factors:
Evaluation - General • Evaluation crucial for NLP systems • Required for most publishable results • Should be integrated early • Many factors: • Data • Metrics • Prior results • …..
Evaluation Guidelines • Evaluate your system • Use standard metrics • Use (standard) training/dev/test sets • Describing experiments: (Intrinsic vs Extrinsic)
Evaluation Guidelines • Evaluate your system • Use standard metrics • Use (standard) training/dev/test sets • Describing experiments: (Intrinsic vs Extrinsic) • Clearly lay out experimental setting
Evaluation Guidelines • Evaluate your system • Use standard metrics • Use (standard) training/dev/test sets • Describing experiments: (Intrinsic vs Extrinsic) • Clearly lay out experimental setting • Compare to baseline and previous results • Perform error analysis
Evaluation Guidelines • Evaluate your system • Use standard metrics • Use (standard) training/dev/test sets • Describing experiments: (Intrinsic vs Extrinsic) • Clearly lay out experimental setting • Compare to baseline and previous results • Perform error analysis • Show utility in real application (ideally)
Data Organization • Training: • Training data: used to learn model parameters
Data Organization • Training: • Training data: used to learn model parameters • Held-out data: used to tune additional parameters
Data Organization • Training: • Training data: used to learn model parameters • Held-out data: used to tune additional parameters • Development (Dev) set: • Used to evaluate system during development • Avoid overfitting
Data Organization • Training: • Training data: used to learn model parameters • Held-out data: used to tune additional parameters • Development (Dev) set: • Used to evaluate system during development • Avoid overfitting • Test data: Used for final, blind evaluation
Data Organization • Training: • Training data: used to learn model parameters • Held-out data: used to tune additional parameters • Development (Dev) set: • Used to evaluate system during development • Avoid overfitting • Test data: Used for final, blind evaluation • Typical division of data:80/10/10 • Tradeoffs • Cross-validation
Evaluting LMs • Extrinsic evaluation (aka in vivo) • Embed alternate models in system • See which improves overall application • MT, IR, …
Evaluting LMs • Extrinsic evaluation (aka in vivo) • Embed alternate models in system • See which improves overall application • MT, IR, … • Intrinsic evaluation: • Metric applied directly to model • Independent of larger application • Perplexity
Evaluting LMs • Extrinsic evaluation (aka in vivo) • Embed alternate models in system • See which improves overall application • MT, IR, … • Intrinsic evaluation: • Metric applied directly to model • Independent of larger application • Perplexity • Why not just extrinsic?
Perplexity • Intuition: • A better model will have tighter fit to test data • Will yield higher probability on test data
Perplexity • Intuition: • A better model will have tighter fit to test data • Will yield higher probability on test data • Formally,
Perplexity • Intuition: • A better model will have tighter fit to test data • Will yield higher probability on test data • Formally,
Perplexity • Intuition: • A better model will have tighter fit to test data • Will yield higher probability on test data • Formally,
Perplexity • Intuition: • A better model will have tighter fit to test data • Will yield higher probability on test data • Formally, • For bigrams:
Perplexity • Intuition: • A better model will have tighter fit to test data • Will yield higher probability on test data • Formally, • For bigrams: • Inversely related to probability of sequence • Higher probability Lower perplexity
Perplexity • Intuition: • A better model will have tighter fit to test data • Will yield higher probability on test data • Formally, • For bigrams: • Inversely related to probability of sequence • Higher probability Lower perplexity • Can be viewed as average branching factor of model
Perplexity Example • Alphabet: 0,1,…,9 • Equiprobable
Perplexity Example • Alphabet: 0,1,…,9; • Equiprobable: P(X)=1/10
Perplexity Example • Alphabet: 0,1,…,9; • Equiprobable: P(X)=1/10 • PP(W)=
Perplexity Example • Alphabet: 0,1,…,9; • Equiprobable: P(X)=1/10 • PP(W)= • If probability of 0 is higher, PP(W) will be
Perplexity Example • Alphabet: 0,1,…,9; • Equiprobable: P(X)=1/10 • PP(W)= • If probability of 0 is higher, PP(W) will be lower
Thinking about Perplexity • Given some vocabulary V with a uniform distribution • I.e. P(w) = 1/|V|
Thinking about Perplexity • Given some vocabulary V with a uniform distribution • I.e. P(w) = 1/|V| • Under a unigram LM, the perplexity is • PP(W) =