1 / 75

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing. Lecture 13 2/25/2013. Recommended reading. Jurafsky & Martin Chapter 4: N-gram language models and smoothing. Outline. Generative probabilistic models Language models More smoothing Evaluating language models

ermin
Download Presentation

LING / C SC 439/539 Statistical Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING / C SC 439/539Statistical Natural Language Processing Lecture 13 2/25/2013

  2. Recommended reading • Jurafsky & Martin • Chapter 4: N-gram language models and smoothing

  3. Outline • Generative probabilistic models • Language models • More smoothing • Evaluating language models • Probability and grammaticality • Programming assignment #3

  4. Generative probabilistic models • A generative probabilistic model is a model that defines a probability distribution over the outcomes of the random variables it represents • Let s be a string that a model generates • ∑s p(s) = 1.0 • Some examples of generative models: • Naïve Bayes • Language models • Hidden Markov Models • Probabilistic Context-Free Grammars

  5. Structure in generative models • When we specify independencies and conditional independencies in the probability distribution of a generative model, we are making assumptions about the statistical distribution of the data • Such assumptions may not actually be true! • Structured models can be viewed as: • Generating strings in a particular manner • Imposing a particular structure upon strings • Regardless of whether or not the strings, as a natural phenomenon, were actually generated through our model

  6. Graphical models • Generative models are often visualized as graphical models • Graphical model: shows the probability relationships in a set of random variables • Bayesian networks are an example of a graphical model • (But not all graphical models are Bayes nets or generative; will see these later)

  7. Naïve Bayes viewed as a generative model • To generate C and X1, X2, …, Xn: • Generate the class C • Then generate each Xi conditional upon the value of the class C … X1 X2 Xn

  8. Common questions for generative models • Given a generative model, and a training corpus, how do we: 1. Estimate the parameters of the model from data? 2. Calculate the probability of a string that the model generates? 3. Find the most likely string that the model generates? 4. Perform classification?

  9. Answers to questions, for Naïve Bayes • Estimate the parameters of the model from data? • Count p(C) and p(X|C) for all X and C, then smooth • Calculate the probability of a string that the model generates? • Multiply factors in Naïve Bayes equation • Find the most likely string that the model generates? • (“String” = a set of features with particular values) • Select the class and feature values that maximize joint probability • Perform classification? • Select highest-probability class for a particular set of features

  10. Outline • Generative probabilistic models • Language models • More smoothing • Evaluating language models • Probability and grammaticality • Programming assignment #3

  11. Language models • We would like to assign a probability to a sequence of words (or other types of units) • Let W = w1, w2, …, wn • Interpret this as a sequence; this is not (just) a joint distribution of N random variables • What is p(W)?

  12. Applications of language models • Machine translation: what's the most likely translation? • Quehambretengoyo What hunger have I Hungry I am so I am so hungry Have I that hunger • Speech recognition: what’s the most likely word sequence? • Recognize speech • Wreck a nice beach • Handwriting recognition • Spelling correction • POS tagging

  13. POS tagging is similar • POS tagging of a sentence: What is the most likely tag sequence? • NN VB DT NNS • NN NN DT NNS • Let T = t1, t2, …, tn • What is P(T)? • POS tag model: like a language model, but defined over POS tags

  14. Language modeling • Language modeling is the specific task of predicting the next word in a sequence • Given a sequence of n-1 words, what is the most likely next word? • argmaxwn p(wn| w1, w2, …, wn-1)

  15. Language modeling software • CMU-Cambridge Statistical Language Modeling toolkit http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html • SRI Language Modeling toolkit http://www.speech.sri.com/projects/srilm/

  16. Calculate the probability of a sentence W • Example: • Let W = Lady Gaga wore her meat dress today • Use a corpus to estimate probability • p(W) = count(W) / # of sentences in corpus http://assets.nydailynews.com/polopoly_fs/1.441573!/img/httpImage/image.jpg

  17. Problem: sparse data • Zero probability sentence • Brown Corpus (~50,000 sentences) does not contain the sentence “Lady Gaga wore her meat dress today” • Even Google does not find this sentence. • However, it’s a perfectly fine sentence • There must be something wrong with our probability estimation method

  18. Intuition: count shorter sequences • Although we don’t see “Lady Gaga wore her meat dress today” in a corpus, we can find substrings: • Lady • Lady Gaga • wore • her meat dress • Lady Gaga wore her meat dress • Lady Gaga wore • wore her • dress today

  19. p(W): apply chain rule • W = w1, …, wn • p(w1, …, wn) = p(w1, ..., wn-1) * p(wn | w1, ..., wn-1) • p(w1, …, wn) = p(w1) * p(w2 | w1) * p(w3 | w1, w2) * … * p(wn | w1, ..., wn-1)

  20. Estimate probabilities from corpusLet C = “count” • p(w1) = C(w1) / # of words in corpus • p(w2 | w1) = C(w1, w2) / C(w1) • p(w3 | w1, w2) = C(w1, w2, w3) / C(w1, w2) • p(wn | w1, ..., wn-1) = C(w1, ..., wn) / C(w1,..., wn-1)

  21. This isn’t any easier • By applying the chain rule, we reduce the calculation to include counts for short sequences, such as C(w1) and C(w1, w2) • But long sequences remain, such as C(w1, ..., wn), which was in the original computation we wanted to perform! p(W) = C(W) / # of sentences in corpus = C(w1, ..., wn) / # of sentences in corpus

  22. Solution: Markov assumption • Markov assumption: limited context • The previous N items matter in the determination of the current item, rather than the entire history • Example: current word depends on previous word • Let p( wn | w1, ..., wn-1 ) = p( wn | wn-1 ) • Under this model: p( today | Lady Gaga wore her meat dress ) = p( today | dress )

  23. Markov assumption is an example of conditional independence • In an Nth-order Markov model, N is the amount of previous context • Applied to language models: • The current word is conditionally dependent upon the previous N words, but conditionally independent of all words previous to those • p( wi | w1, ..., wi-1) = p( wi | wi-N, ..., wi-1)

  24. 0th-order language model • Also called a unigram model • Let p( wn | w1, ..., wn-1) = p( wn ) • Zero context generation • Each word is generated independently of others • As a graphical model (all variables are independent): … w1 w2 wn

  25. 1st-order language model • Also called a bigram model • Let p( wn | w1, ..., wn-1) = p( wn | wn-1 ) • As a graphical model: W1 … Wn-1 Wn

  26. 2nd-order language model • Also called a trigram model • Let p( wn | w1, ..., wn-1) = p( wn | wn-2, wn-1 ) • As a graphical model: W1 … Wn-2 Wn-1 Wn

  27. Initial items in sequence • In an Nth-order model, the initial N elements in the sequence are dependent upon all elements generated so far • For example, this doesn’t make sense: • Under a trigram model: p(w0) = p( w0 | w-2, w-1 ) • Trigram model, correctly: • p(w0) = p( w0 ) • p(w1) = p( w1 | w0 ) • p(w2) = p( w2 | w0, w1 )

  28. p(W) under the different language models • Unigram model: p(Lady Gaga wore her meat dress today) = p(Lady) * p(Gaga) * p(wore) * p(her) * p(meat) * p(dress) * p(today) • Bigram model: p(Lady Gaga wore her meat dress today) = p(Lady) * p(Gaga|Lady) * p(wore|Gaga) * p(her|wore) * p(meat|her) * p(dress|meat) * p(today|dress) • Trigram model: p(Lady Gaga wore her meat dress today) = p(Lady) * p(Gaga|Lady) * p(wore|Lady Gaga) * p(her|Gaga wore) * p(meat|wore her) * p(dress|her meat) * p(today|meat dress)

  29. Summary of language models • Nth-order Markov assumption: • p( wi | w1,...,wi-1) = p( wi | wi-N, ..., wi-1) • Count occurrences of length - N word sequences (N-grams) in a corpus • Model for joint probability of sequence:

  30. p(w1, …, wn) = p(w1) * p(w2 | w1) * p(w3 | w1, w2) * … * p(wn | w1, ..., wn-1) • Example, 1st-order model: • p(w1, …, wn) = p(w1) * p(w2 | w1) * p(w3 | w2) * … * p(wn | wn-1)

  31. Toy example of a language model • Probability distributions in a bigram language model, counted from a training corpus Note that ∑n p(cn|cn-1) = 1.0 p(a) = 1.0 p(d|b) = 1.0 p(b|a) = 0.7 p(e|c) = 0.6 p(c|a) = 0.3 p(f|c) = 0.4 • Language model imposes a probability distribution over all strings it generates. This model generates {abd,ace,acf}. p(a,b,d) = 1.0 * 0.7 * 1.0 = 0.7 p(a,c,e) = 1.0 * 0.3 * 0.6 = 0.18 p(a,c,f) = 1.0 * 0.3 * 0.4 = 0.12 ∑W p(W) = 1.0

  32. Initial items in sequence: alternative • Add sentence boundary markers to training corpus. Example: <s> <s> Lady Gaga is rich . <s> <s> She likes to wear meat . <s> <s> But she does not eat meat . <s> <s> • Generation: begin with first word conditional upon context consisting entirely of sentence boundary markers. Example, trigram model: p(w0) = p( w0 | <s> <s> ) p(w1) = p( w1 | <s> w0 ) p(w2) = p( w2 | w0, w1 )

  33. Trade-offs in choice of N • Higher N: • Longer units, better approximation of a language • Sparse data problem is worse • Lower N: • Shorter units, worse approximation of a language • Sparse data problem is not as bad

  34. N-gram approximations of English • We can create a Markov (i.e., N-gram) approximation of English by randomly generating sequences according to a language model • As N grows, looks more like the original language • This was realized a long time ago: • Claude Shannon, 1948 • Invented information theory • Frederick Damerau • Ph.D., Yale linguistics, 1966 • Empirical Investigation of Statistically Generated Sentences

  35. Shannon: character N-gram approximations of English(though he uses “N” to mean “N-1”) 0th-order model 1st-order model 2nd-order model

  36. Damerau: 0th-order word model(lines show grammatical sequences)

  37. Damerau: 5th-order word model(lines show grammatical sequences)

  38. English can be “similar” to a Markov process: the style of actual patent claims

  39. Need to deal with sparse data • Data sparsity grows with higher N • Many possible N-grams are non-existent in corpora, even for small N • “Lady Gaga wore her rutabaga dress today” • Google count of “her rutabaga dress” is zero

  40. Zero counts cause problems for language models • If any term in the probability equation is zero, the probability of the entire sequence is zero • Toy example: bigram model p(w0, w1, w2, w3) = p(w0) * p(w1|w0) * p(w2|w1) * p(w3|w2) = .01 * 0.3 * 0 * .04 = 0.0 • Need to smooth

  41. Outline • Generative probabilistic models • Language models • More smoothing • Evaluating language models • Probability and grammaticality • Programming assignment #3

  42. Smoothing methods to be covered • Previously: • Add-one smoothing • Deleted estimation • Good-Turing smoothing • Today: • Witten-Bell smoothing • Backoff smoothing • Interpolated backoff

  43. How do we treat novel N-grams? • Simple methods that assign equal probability to all zero-count N-grams: • Add-one smoothing • Deleted estimation • Good-Turing smoothing • Assign differing probability to zero-count N-grams: • Witten-Bell smoothing • Backoff smoothing • Interpolated backoff

  44. 4. Witten-Bell smoothing • Key idea: a zero-frequency N-gram is an event that hasn’t happened yet • If p(wi | wi-k, …, wi-1) = 0, then the estimate pWB(wi | wi-k, …, wi-1) is higher if wi-k, …, wi-1 occurs with many different wi • Called “diversity smoothing”

  45. Witten-Bell smoothing • If p(wi | wi-k, …, wi-1) = 0, then the estimate pWB(wi | wi-k, …, wi-1) is higher if wi-k, …, wi-1 occurs with many different wi • Example: compare these two cases • p(C|A,B) = 0 and ABA, ABB, ABD, ABE, ABF have nonzero counts • p(Z|X,Y) = 0 and XYA, XYB have nonzero counts • We would expect that the smoothed estimate of p(C|A,B) should be higher than the smoothed estimate of p(Z|X,Y)

  46. Witten-Bell smoothing for bigrams • Let’s smooth bigrams: p(wi-1, wi) • T(wi-1) is the number of different words (types) that occur to the right of wi-1 • N(wi-1) is the number of all word occurrences (tokens) to the right of wi-1 • If c(wi-1, wi) = 0, • = # of types of bigrams starting with wi-1 # tokens of wi-1 + # of types of bigrams starting with wi-1

  47. Witten-Bell Smoothing • Unsmoothed: • Smoothed: • If c(wi-1, wi) = 0, • If c(wi-1, wi) > 0, Takes probability mass away from non- zero-count items

  48. 5. Backoff smoothing • Consider p( zygote|see the) vs. p( baby|see the) • Suppose these trigrams both have zero counts: see the baby see the zygote • And we have that: • Unigram: p(baby) > p(zygote) • Bigram: p(the baby) > p(the zygote) • Trigram: we would expect that p(see the baby) > p(see the zygote) p( baby|see the) > p( zygote|see the)

  49. Backoff smoothing • Hold out probability mass for novel events • But divide up unevenly, in proportion to the backoff probability • Unlike add-one, deleted estimation, Good-Turing • For p(Z|X, Y), the backoff probability is p(Z|Y) • For p(Z|Y), the backoff probability is p(Z)

  50. Backoff smoothing: details • For p(Z|X, Y), the backoff probability is p(Z|Y) • Novel events are types Z that were never observed after X,Y • For p(Z|Y), the backoff probability is p(Z) • Novel events are types Z that were never observed after Y • Even if Z was never observed after X,Y, it may have been observed after the shorter, more frequent context Y. • Then p(Z|Y) can be estimated without further backoff. If not, we back off further to p(Z). • For p(Z), the backoff probability for novel Z can be assigned using other methods

More Related