1.11k likes | 1.34k Views
Language Modeling. Roadmap. Motivation: LM applications N-grams Training and Testing Evaluation: Perplexity Entropy Smoothing ( next class): Laplace smoothing Good-Turing smoothing Interpolation & backoff. Predicting Words.
E N D
Roadmap • Motivation: • LM applications • N-grams • Training and Testing • Evaluation: • Perplexity • Entropy • Smoothing (next class): • Laplace smoothing • Good-Turing smoothing • Interpolation & backoff
Predicting Words • Given a sequence of words, the next word is (somewhat) predictable: • I’d like to place a collect …..
Predicting Words • Given a sequence of words, the next word is (somewhat) predictable: • I’d like to place a collect ….. • Ngram models: Predict next word given previous N • Language models (LMs): • Statistical models of word sequences
Predicting Words • Given a sequence of words, the next word is (somewhat) predictable: • I’d like to place a collect ….. • Ngram models: Predict next word given previous N • Language models (LMs): • Statistical models of word sequences • Approach: • Build model of word sequences from corpus • Given alternative sequences, select the most probable
Predicting Sequences • Given an n-gram model, we can also answer questions about prob. of a sequence • Comparative probabilities of sequences, e.g., in MT In: GesternhabeichmeineMutter angerufen. Out: Yesterday have I my mother called. Yesterday I have my mom called. Yesterday I have called my mom. Yesterday I has called my mom. Yesterday I call my mom. I called my mother yesterday.
N-gram LM Applications • Used in • Speech recognition • Spelling correction • Part-of-speech tagging • Machine translation • Information retrieval • Language Identification
Terminology • Corpus (pl. corpora): • Online collection of text of speech • E.g. Brown corpus: 1M word, balanced text collection • E.g. Switchboard: 240 hrs of speech; ~3M words
Terminology • Corpus (pl. corpora): • Online collection of text of speech • E.g. Brown corpus: 1M word, balanced text collection • E.g. Switchboard: 240 hrs of speech; ~3M words • Wordform: • Full inflected or derived form of word: cats, glottalized
Terminology • Corpus (pl. corpora): • Online collection of text of speech • E.g. Brown corpus: 1M word, balanced text collection • E.g. Switchboard: 240 hrs of speech; ~3M words • Wordform: • Full inflected or derived form of word: cats, glottalized • Word types: # of distinct words in corpus
Terminology • Corpus (pl. corpora): • Online collection of text of speech • E.g. Brown corpus: 1M word, balanced text collection • E.g. Switchboard: 240 hrs of speech; ~3M words • Wordform: • Full inflected or derived form of word: cats, glottalized • Word types: # of distinct words in corpus • Word tokens: total # of words in corpus
Corpus Counts • Estimate probabilities by counts in large collections of text/speech • Should we count: • Wordformvslemma ? • Case? Punctuation? Disfluency? • Type vs Token ?
Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars.
Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct):
Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct): 14 • Word tokens (“ ):
Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct): 14 • Word tokens (“ ): 16. • I do uh main- mainly business data processing • Utterance (spoken “sentence” equivalent)
Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct): 14 • Word tokens (“ ): 16. • I do uh main- mainly business data processing • Utterance (spoken “sentence” equivalent) • What about: • Disfluencies • main-: fragment • uh: filler (aka filled pause)
Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct): 14 • Word tokens (“ ): 16. • I do uh main- mainly business data processing • Utterance (spoken “sentence” equivalent) • What about: • Disfluencies • main-: fragment • uh: filler (aka filled pause) • Keep, depending on app.: can help prediction; uh vs um
LM Task • Training: • Given a corpus of text, learn probabilities of word sequences
LM Task • Training: • Given a corpus of text, learn probabilities of word sequences • Testing: • Given trained LM and new text, determine sequence probabilities, or • Select most probable sequence among alternatives
Word Prediction • Goal: • Given some history, what is probability of some next word? • Formally, P(w|h) • e.g. P(call|I’d like to place a collect)
Word Prediction • Goal: • Given some history, what is probability of some next word? • Formally, P(w|h) • e.g. P(call|I’d like to place a collect) • How can we compute?
Word Prediction • Goal: • Given some history, what is probability of some next word? • Formally, P(w|h) • e.g. P(call|I’d like to place a collect) • How can we compute? • Relative frequency in a corpus • C(I’d like to place a collect call)/C(I’d like to place a collect) • Issues?
Word Prediction • Goal: • Given some history, what is probability of some next word? • Formally, P(w|h) • e.g. P(call|I’d like to place a collect) • How can we compute? • Relative frequency in a corpus • C(I’d like to place a collect call)/C(I’d like to place a collect) • Issues? • Zero counts: language is productive! • Joint word sequence probability of length N: • Count of all sequences of length N & count of that sequence
Word Sequence Probability • Notation: • P(Xi=the) written as P(the) • P(w1w2w3…wn) =
Word Sequence Probability • Notation: • P(Xi=the) written as P(the) • P(w1w2w3…wn) = • Compute probability of word sequence by chain rule • Links to word prediction by history
Word Sequence Probability • Notation: • P(Xi=the) written as P(the) • P(w1w2w3…wn) = • Compute probability of word sequence by chain rule • Links to word prediction by history • Issues?
Word Sequence Probability • Notation: • P(Xi=the) written as P(the) • P(w1w2w3…wn) = • Compute probability of word sequence by chain rule • Links to word prediction by history • Issues? • Potentially infinite history
Word Sequence Probability • Notation: • P(Xi=the) written as P(the) • P(w1w2w3…wn) = • Compute probability of word sequence by chain rule • Links to word prediction by history • Issues? • Language infinitely productive: we don’t know how to compute the exact probability
Markov Assumptions • Exact computation requires too much data • And we may not have all the data (even on the Web!)
Markov Assumptions • Exact computation requires too much data • And we may not have all the data (even on the Web!) • Approximate probability given all prior words • Assume finitehistory
Markov Assumptions • Exact computation requires too much data • And we may not have all the data (even on the Web!) • Approximate probability given all prior words • Assume finitehistory • Unigram: Probability of word in isolation (0th order) • Bigram: Probability of word given 1 previous • First-order Markov • Trigram: Probability of word given 2 previous
Markov Assumptions • Exact computation requires too much data • And we may not have all the data (even on the Web!) • Approximate probability given all prior words • Assume finitehistory • Unigram: Probability of word in isolation (0th order) • Bigram: Probability of word given 1 previous • First-order Markov • Trigram: Probability of word given 2 previous • N-gram approximation Bigram sequence
Unigram Models • P(w1w2…w3)~
Unigram Models • P(w1w2…w3) ~ P(w1)*P(w2)*…*P(wn) • Training: • Estimate P(w) given corpus
Unigram Models • P(w1w2…w3) ~ P(w1)*P(w2)*…*P(wn) • Training: • Estimate P(w) given corpus • Relative frequency:
Unigram Models • P(w1w2…w3) ~ P(w1)*P(w2)*…*P(wn) • Training: • Estimate P(w) given corpus • Relative frequency: P(w) = C(w)/N, N=# tokens in corpus • How many parameters?
Unigram Models • P(w1w2…w3) ~ P(w1)*P(w2)*…*P(wn) • Training: • Estimate P(w) given corpus • Relative frequency: P(w) = C(w)/N, N=# tokens in corpus • How many parameters? • Testing: For sentence s, compute P(s)
Bigram Models • P(w1w2…w3) = P(BOS w1w2….wnEOS)
Bigram Models • P(w1w2…w3) = P(BOS w1w2….wnEOS) • ~ P(BOS)*P(w1|BOS)*P(w2|w1)*…*P(wn|wn-1)*P(EOS|wn) • Training: • Relative frequency:
Bigram Models • P(w1w2…w3) = P(BOS w1w2….wnEOS) • ~ P(BOS)*P(w1|BOS)*P(w2|w1)*…*P(wn|wn-1)*P(EOS|wn) • Training: • Relative frequency: P(wi|wi-1) = C(wi-1wi)/C(wi-1) • How many parameters?
Bigram Models • P(w1w2…w3) = P(BOS w1w2….wnEOS) • ~ P(BOS)*P(w1|BOS)*P(w2|w1)*…*P(wn|wn-1)*P(EOS|wn) • Training: • Relative frequency: P(wi|wi-1) = C(wi-1wi)/C(wi-1) • How many parameters? • Testing: For sentence s, compute P(s) • Model with PFA: • Input symbols? Probabilities on arcs? States?
Trigram Models • P(w1w2…w3) = P(BOS w1w2….wnEOS)
Trigram Models • P(w1w2…w3) = P(BOS w1w2….wnEOS) • ~ P(BOS)*P(w1|BOS)*P(w2|BOS,w1)*…*P(wn|wn-2,wn-1)*P(EOS|wn-1,wn) • Training: • P(wi|wi-2,wi-1)
Trigram Models • P(w1w2…w3) = P(BOS w1w2….wnEOS) • ~ P(BOS)*P(w1|BOS)*P(w2|BOS,w1)*…*P(wn|wn-2,wn-1)*P(EOS|wn-1,wn) • Training: • P(wi|wi-2,wi-1) = C(wi-2 wi-1wi)/C(wi-2wi-1) • How many parameters?
Trigram Models • P(w1w2…w3) = P(BOS w1w2….wnEOS) • ~ P(BOS)*P(w1|BOS)*P(w2|BOS,w1)*…*P(wn|wn-2,wn-1)*P(EOS|wn-1,wn) • Training: • P(wi|wi-2,wi-1) = C(wi-2 wi-1wi)/C(wi-2wi-1) • How many parameters? • How many states?
Recap • Ngrams: • # FSA states: |V|n-1 • # Model parameters:
Recap • Ngrams: • # FSA states: |V|n-1 • # Model parameters: |V|n • Issues:
Recap • Ngrams: • # FSA states: |V|n-1 • # Model parameters: |V|n • Issues: • Data sparseness, Out-of-vocabulary elements (OOV) • Smoothing • Mismatches between training & test data • Other Language Models
Maximum Likelihood Estimation (MLE) • MLE estimate – normalize counts from corpus between 0 and 1 • C(xy), normalize all counts for bigrams sharing x: • Since C(wn-1w) = C(wn-1), then