1 / 34

I256 Applied Natural Language Processing Fall 2009

I256 Applied Natural Language Processing Fall 2009. Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing . Barbara Rosario. Today. Exercises Design a graphical model Learn parameters for naïve bayes Language models ( n -grams)

rumor
Download Presentation

I256 Applied Natural Language Processing Fall 2009

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. I256 Applied Natural Language ProcessingFall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario

  2. Today • Exercises • Design a graphical model • Learn parameters for naïve bayes • Language models (n-grams) • Sparse data & smoothing methods

  3. Exercise • Let’s design a GM • Problem: topic and subtopics classification • Each document has one broad semantic topic (e.g. politics, sports, etc.) • There are several subtopics in each document • Example: a sport document can contain a part describing a match, a part describing the location of the match and one on the persons

  4. Exercise • The goal is to classify the overall topic (T) of the documents and all the subtopics (STi) • Assumptions: • The subtopics STi depend on the topic of the T document • The subtopics STi are conditionally independent of each other (given T) • The words of the document wj depend on the subtopic STi and are conditionally independent of each other (given STi) • For simplicity assume as many topics nodes as there are words • How would a GM encoding this assumptions look like? • Variable? Edges? Joint Pb distributions?

  5. Exercise • What about now if the words of the document depend also directly from the topic T? • The subtopic persons may be quite different if the overall topic is sport or politics • What about now if there is an ordering in the subtopics, i.e. STi depend on T and also STi-1

  6. Recall the general joint probability distribution: P(X1, ..XN) = i P(Xi | Par(Xi) ) T w1 wn w2 Naïve Bayes for topic classification P(T, w1..wn) = P(T) P(w1| T) P(w2| T) … P(wn| T )= = P(T)i P(wi | T) Estimation (Training): Given data, estimate: P(T), P(wi | T) Inference (Testing): Compute conditional probabilities: P(T | w1, w2, ..wn)

  7. Estimate:for each wi , Tj Exercise • Topic = sport (num words = 15) • D1: 2009 open season • D2: against Maryland Sept • D3: play six games • D3: schedule games weekends • D4: games games games • Topic = politics (num words = 19) • D1: Obama hoping rally support • D2: billion stimulus package • D3: House Republicans tax • D4: cuts spending GOP games • D4: Republicans obama open • D5: political season P(obama | T = politics) = P(w= obama, T = politcs)/ P(T = politcs) = (c(w= obama, T = politcs)/ 34 )/(19/34) = 2/19 P(obama | T = sport) = P(w= obama, T = sport)/ P(T = sport) = (c(w= obama, T = sport)/ 34 )/(15/34) = 0 P(season | T=politics) = P(w=season, T=politcs)/ P(T=politcs) = (c(w=season, T=politcs)/ 34 )/(19/34) = 1/19 P(season | T= sport) = P(w=season, T= sport)/ P(T= sport) = (c(w=season, T= sport)/ 34 )/(15/34) = 1/19 P(republicans|T=politics)=P(w=republicans,T=politcs)/ P(T=politcs)=c(w=republicans,T=politcs)/19 = 2/19 P(republicans|T= sport)=P(w=republicans,T= sport)/ P(T= sport)=c(w=republicans,T= sport)/19 = 0/15 = 0

  8. Exercise: inference • What is the topic of new documents: • Republicans obama season • games season open • democrats kennedy house

  9. Exercise: inference • Recall: Bayes decision rule Decide Tj if P(Tj | c) > P(Tk | c) for Tj ≠Tk c is the context, here the words of the documents • We want to assign the topic T for which T’ = argmaxTjP(Tj | c)

  10. Because of the dependencies encoded in the GM Bayes rule This GM Exercise: Bayes classification • We compute P(Tj | c)with Bayes rule

  11. That is, for each Tjwe calculate and see which one is higher Exercise: Bayes classification New sentences: republicans obama season T = politics? P(politics I c) = P(politics) P(Republicans|politics) P(obama|politics) P(season| politics) = 19/34 2/19 2/19 1/19 > 0 T = sport? P(sport I c) = P(sport) P(Republicans|sport) P(obama| sport) P(season| sport) = 15/34 0 0 1/19 = 0 Choose T = politics

  12. That is, for each Tjwe calculate and see which one is higher Exercise: Bayes classification New sentences: democrats kennedy house T = politics? P(politics I c) = P(politics) P(democrats |politics) P(kennedy|politics) P(house| politics) = 19/34 0 0 1/19 = 0 democrats kennedy: unseen words  data sparsity How can we address this?

  13. Today • Exercises • Design of a GM • Learn parameters • Language models (n-grams) • Sparse data & smoothing methods

  14. Language Models • Model to assign scores to sentences • Probabilities should broadly indicate likelihood of sentences • P( I saw a van) >> P( eyes awe of an) • Not grammaticality • P(artichokes intimidate zippers) ≈ 0 • In principle, “likely” depends on the domain, context, speaker… Adapted from Dan Klein’s CS 288 slides

  15. Language models • Related: the task of predicting the next word • Can be useful for • Spelling corrections • I need to notified the bank • Machine translations • Speech recognition • OCR (optical character recognition) • Handwriting recognition • Augmentative communication • Computer systems to help the disabled in communication • For example, systems that let choose words with hand movements

  16. Language Models • Model to assign scores to sentences • Sentence: w1, w2, … wn • Break sentence probability down with chain rule (no loss of generality) • Too many histories!

  17. Wi-2 wi Markov assumption: n-gram solution w1 wi • Markov assumption: only the prior local context --- the last “few” n words– affects the next word • N-gram models: assume each word depends only on a short linear history • Use N-1 words to predict the next one

  18. n-gram: Unigrams (n = 1) From Dan Klein’s CS 288 slides

  19. n-gram: Bigrams (n = 2) From Dan Klein’s CS 288 slides

  20. W1 W2 W3 . . . WN n-gram: Trigrams (n = 3) From Dan Klein’s CS 288 slides

  21. Choice of n • In principle we would like the n of the n-gram to be large • green • large green • the large green • swallowed the large green • swallowed should influence the choice of the next word (mountain is unlikely, pea more likely) • The crocodile swallowed the large green .. • Mary swallowed the large green .. • And so on…

  22. Discrimination vs. reliability • Looking at longer histories (large n) should allows us to make better prediction (better discrimination) • But it’s much harder to get reliable statistics since the number of parameters to estimate becomes too large • The larger n, the larger the number of parameters to estimate, the larger the data needed to do statistically reliable estimations

  23. Language Models • N size of vocabulary • Unigrams • Bi-grams • Tri-grams For each wi calculate P(wi): N of such numbers: N parameters For each wi, wj calculate P(wi| wj,): NxN parameters For each wi, wjwk calculate P(wi| wj, wk): NxNxN parameters

  24. N-grams and parameters • Assume we have a vocabulary of 20,000 words • Growth in number of parameters for n-grams models:

  25. Sparsity • Zipf’s law: most words are rare • This makes frequency-based approaches to language hard • New words appear all the time, new bigrams more often, trigrams or more, still worse! • These relative frequency estimates are the MLE (maximum likelihood estimates): choice of parameters that give the highest probability to the training corpus

  26. Sparsity • The larger the number of parameters, the more likely it is to get 0 probabilities • Note also the product: • If we have one 0 for un unseen events, the 0 propagates and gives us 0 probabilities for the whole sentence

  27. Tackling data sparsity • Discounting or smoothing methods • Change the probabilities to avoid zeros • Remember pd have to sum to 1 • Decrease the non zeros probabilities (seen events) and put the rest of the probability mass to the zeros probabilities (unseen events)

  28. Smoothing From Dan Klein’s CS 288 slides

  29. Smoothing • Put probability mass on “unseen events” • Add one /delta (uniform prior) • Add one /delta (unigram prior) • Linear interpolation • ….

  30. Smoothing: Combining estimators • Make linear combination of multiple probability estimates • (Providing that we weight the contribution of each of them so that the result is another probability function) • Linear interpolation or mixture models

  31. Smoothing: Combining estimators • Back-off models • Special case of linear interpolation

  32. Smoothing: Combining estimators • Back-off models: trigram version

  33. Beyond N-Gram LMs • Discriminative models (n-grams are generative model) • Grammar based • Syntactic models: use tree models to capture long-distance syntactic effects • Structural zeros: some n-grams are syntactically forbidden, keep estimates at zero • Lexical • Word forms • Unknown words • Semantic based • Semantic classes: do statistic at the semantic classes level (eg., WordNet) • More data (Web)

  34. Summary • Given a problem (topic and subtopic classification, language models): design a GM • Learn parameters from data • But: data sparsity • Need to smooth the parameters

More Related