Language Modeling

Language Modeling

Roadmap • Motivation: • LM applications • N-grams • Training and Testing • Evaluation: • Perplexity • Entropy • Smoothing (next class): • Laplace smoothing • Good-Turing smoothing • Interpolation & backoff

Predicting Words • Given a sequence of words, the next word is (somewhat) predictable: • I’d like to place a collect …..

Predicting Words • Given a sequence of words, the next word is (somewhat) predictable: • I’d like to place a collect ….. • Ngram models: Predict next word given previous N • Language models (LMs): • Statistical models of word sequences

Predicting Words • Given a sequence of words, the next word is (somewhat) predictable: • I’d like to place a collect ….. • Ngram models: Predict next word given previous N • Language models (LMs): • Statistical models of word sequences • Approach: • Build model of word sequences from corpus • Given alternative sequences, select the most probable

Predicting Sequences • Given an n-gram model, we can also answer questions about prob. of a sequence • Comparative probabilities of sequences, e.g., in MT In: GesternhabeichmeineMutter angerufen. Out: Yesterday have I my mother called. Yesterday I have my mom called. Yesterday I have called my mom. Yesterday I has called my mom. Yesterday I call my mom. I called my mother yesterday.

N-gram LM Applications • Used in • Speech recognition • Spelling correction • Part-of-speech tagging • Machine translation • Information retrieval • Language Identification

Terminology • Corpus (pl. corpora): • Online collection of text of speech • E.g. Brown corpus: 1M word, balanced text collection • E.g. Switchboard: 240 hrs of speech; ~3M words

Terminology • Corpus (pl. corpora): • Online collection of text of speech • E.g. Brown corpus: 1M word, balanced text collection • E.g. Switchboard: 240 hrs of speech; ~3M words • Wordform: • Full inflected or derived form of word: cats, glottalized

Terminology • Corpus (pl. corpora): • Online collection of text of speech • E.g. Brown corpus: 1M word, balanced text collection • E.g. Switchboard: 240 hrs of speech; ~3M words • Wordform: • Full inflected or derived form of word: cats, glottalized • Word types: # of distinct words in corpus

Terminology • Corpus (pl. corpora): • Online collection of text of speech • E.g. Brown corpus: 1M word, balanced text collection • E.g. Switchboard: 240 hrs of speech; ~3M words • Wordform: • Full inflected or derived form of word: cats, glottalized • Word types: # of distinct words in corpus • Word tokens: total # of words in corpus

Corpus Counts • Estimate probabilities by counts in large collections of text/speech • Should we count: • Wordformvslemma ? • Case? Punctuation? Disfluency? • Type vs Token ?

Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars.

Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct):

Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct): 14 • Word tokens (“ ):

Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct): 14 • Word tokens (“ ): 16. • I do uh main- mainly business data processing • Utterance (spoken “sentence” equivalent)

Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct): 14 • Word tokens (“ ): 16. • I do uh main- mainly business data processing • Utterance (spoken “sentence” equivalent) • What about: • Disfluencies • main-: fragment • uh: filler (aka filled pause)

Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct): 14 • Word tokens (“ ): 16. • I do uh main- mainly business data processing • Utterance (spoken “sentence” equivalent) • What about: • Disfluencies • main-: fragment • uh: filler (aka filled pause) • Keep, depending on app.: can help prediction; uh vs um

LM Task • Training: • Given a corpus of text, learn probabilities of word sequences

LM Task • Training: • Given a corpus of text, learn probabilities of word sequences • Testing: • Given trained LM and new text, determine sequence probabilities, or • Select most probable sequence among alternatives

Word Prediction • Goal: • Given some history, what is probability of some next word? • Formally, P(w|h) • e.g. P(call|I’d like to place a collect)

Word Prediction • Goal: • Given some history, what is probability of some next word? • Formally, P(w|h) • e.g. P(call|I’d like to place a collect) • How can we compute?

Word Prediction • Goal: • Given some history, what is probability of some next word? • Formally, P(w|h) • e.g. P(call|I’d like to place a collect) • How can we compute? • Relative frequency in a corpus • C(I’d like to place a collect call)/C(I’d like to place a collect) • Issues?

Word Prediction • Goal: • Given some history, what is probability of some next word? • Formally, P(w|h) • e.g. P(call|I’d like to place a collect) • How can we compute? • Relative frequency in a corpus • C(I’d like to place a collect call)/C(I’d like to place a collect) • Issues? • Zero counts: language is productive! • Joint word sequence probability of length N: • Count of all sequences of length N & count of that sequence

Word Sequence Probability • Notation: • P(Xi=the) written as P(the) • P(w1w2w3…wn) =

Word Sequence Probability • Notation: • P(Xi=the) written as P(the) • P(w1w2w3…wn) = • Compute probability of word sequence by chain rule • Links to word prediction by history

Word Sequence Probability • Notation: • P(Xi=the) written as P(the) • P(w1w2w3…wn) = • Compute probability of word sequence by chain rule • Links to word prediction by history • Issues?

Word Sequence Probability • Notation: • P(Xi=the) written as P(the) • P(w1w2w3…wn) = • Compute probability of word sequence by chain rule • Links to word prediction by history • Issues? • Potentially infinite history

Word Sequence Probability • Notation: • P(Xi=the) written as P(the) • P(w1w2w3…wn) = • Compute probability of word sequence by chain rule • Links to word prediction by history • Issues? • Language infinitely productive: we don’t know how to compute the exact probability

Markov Assumptions • Exact computation requires too much data • And we may not have all the data (even on the Web!)

Markov Assumptions • Exact computation requires too much data • And we may not have all the data (even on the Web!) • Approximate probability given all prior words • Assume finitehistory

Markov Assumptions • Exact computation requires too much data • And we may not have all the data (even on the Web!) • Approximate probability given all prior words • Assume finitehistory • Unigram: Probability of word in isolation (0th order) • Bigram: Probability of word given 1 previous • First-order Markov • Trigram: Probability of word given 2 previous

Markov Assumptions • Exact computation requires too much data • And we may not have all the data (even on the Web!) • Approximate probability given all prior words • Assume finitehistory • Unigram: Probability of word in isolation (0th order) • Bigram: Probability of word given 1 previous • First-order Markov • Trigram: Probability of word given 2 previous • N-gram approximation Bigram sequence

Unigram Models • P(w1w2…w3)~

Unigram Models • P(w1w2…w3) ~ P(w1)*P(w2)*…*P(wn) • Training: • Estimate P(w) given corpus

Unigram Models • P(w1w2…w3) ~ P(w1)*P(w2)*…*P(wn) • Training: • Estimate P(w) given corpus • Relative frequency:

Unigram Models • P(w1w2…w3) ~ P(w1)*P(w2)*…*P(wn) • Training: • Estimate P(w) given corpus • Relative frequency: P(w) = C(w)/N, N=# tokens in corpus • How many parameters?

Unigram Models • P(w1w2…w3) ~ P(w1)*P(w2)*…*P(wn) • Training: • Estimate P(w) given corpus • Relative frequency: P(w) = C(w)/N, N=# tokens in corpus • How many parameters? • Testing: For sentence s, compute P(s)

Bigram Models • P(w1w2…w3) = P(BOS w1w2….wnEOS)

Bigram Models • P(w1w2…w3) = P(BOS w1w2….wnEOS) • ~ P(BOS)*P(w1|BOS)*P(w2|w1)*…*P(wn|wn-1)*P(EOS|wn) • Training: • Relative frequency:

Bigram Models • P(w1w2…w3) = P(BOS w1w2….wnEOS) • ~ P(BOS)*P(w1|BOS)*P(w2|w1)*…*P(wn|wn-1)*P(EOS|wn) • Training: • Relative frequency: P(wi|wi-1) = C(wi-1wi)/C(wi-1) • How many parameters? • Testing: For sentence s, compute P(s) • Model with PFA: • Input symbols? Probabilities on arcs? States?

Trigram Models • P(w1w2…w3) = P(BOS w1w2….wnEOS)

Recap • Ngrams: • # FSA states: |V|n-1 • # Model parameters:

Recap • Ngrams: • # FSA states: |V|n-1 • # Model parameters: |V|n • Issues:

Recap • Ngrams: • # FSA states: |V|n-1 • # Model parameters: |V|n • Issues: • Data sparseness, Out-of-vocabulary elements (OOV) •  Smoothing • Mismatches between training & test data • Other Language Models

Maximum Likelihood Estimation (MLE) • MLE estimate – normalize counts from corpus between 0 and 1 • C(xy), normalize all counts for bigrams sharing x: • Since C(wn-1w) = C(wn-1), then

Language Modeling