340 likes | 475 Views
I256 Applied Natural Language Processing Fall 2009. Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing . Barbara Rosario. Today. Exercises Design a graphical model Learn parameters for naïve bayes Language models ( n -grams)
E N D
I256 Applied Natural Language ProcessingFall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario
Today • Exercises • Design a graphical model • Learn parameters for naïve bayes • Language models (n-grams) • Sparse data & smoothing methods
Exercise • Let’s design a GM • Problem: topic and subtopics classification • Each document has one broad semantic topic (e.g. politics, sports, etc.) • There are several subtopics in each document • Example: a sport document can contain a part describing a match, a part describing the location of the match and one on the persons
Exercise • The goal is to classify the overall topic (T) of the documents and all the subtopics (STi) • Assumptions: • The subtopics STi depend on the topic of the T document • The subtopics STi are conditionally independent of each other (given T) • The words of the document wj depend on the subtopic STi and are conditionally independent of each other (given STi) • For simplicity assume as many topics nodes as there are words • How would a GM encoding this assumptions look like? • Variable? Edges? Joint Pb distributions?
Exercise • What about now if the words of the document depend also directly from the topic T? • The subtopic persons may be quite different if the overall topic is sport or politics • What about now if there is an ordering in the subtopics, i.e. STi depend on T and also STi-1
Recall the general joint probability distribution: P(X1, ..XN) = i P(Xi | Par(Xi) ) T w1 wn w2 Naïve Bayes for topic classification P(T, w1..wn) = P(T) P(w1| T) P(w2| T) … P(wn| T )= = P(T)i P(wi | T) Estimation (Training): Given data, estimate: P(T), P(wi | T) Inference (Testing): Compute conditional probabilities: P(T | w1, w2, ..wn)
Estimate:for each wi , Tj Exercise • Topic = sport (num words = 15) • D1: 2009 open season • D2: against Maryland Sept • D3: play six games • D3: schedule games weekends • D4: games games games • Topic = politics (num words = 19) • D1: Obama hoping rally support • D2: billion stimulus package • D3: House Republicans tax • D4: cuts spending GOP games • D4: Republicans obama open • D5: political season P(obama | T = politics) = P(w= obama, T = politcs)/ P(T = politcs) = (c(w= obama, T = politcs)/ 34 )/(19/34) = 2/19 P(obama | T = sport) = P(w= obama, T = sport)/ P(T = sport) = (c(w= obama, T = sport)/ 34 )/(15/34) = 0 P(season | T=politics) = P(w=season, T=politcs)/ P(T=politcs) = (c(w=season, T=politcs)/ 34 )/(19/34) = 1/19 P(season | T= sport) = P(w=season, T= sport)/ P(T= sport) = (c(w=season, T= sport)/ 34 )/(15/34) = 1/19 P(republicans|T=politics)=P(w=republicans,T=politcs)/ P(T=politcs)=c(w=republicans,T=politcs)/19 = 2/19 P(republicans|T= sport)=P(w=republicans,T= sport)/ P(T= sport)=c(w=republicans,T= sport)/19 = 0/15 = 0
Exercise: inference • What is the topic of new documents: • Republicans obama season • games season open • democrats kennedy house
Exercise: inference • Recall: Bayes decision rule Decide Tj if P(Tj | c) > P(Tk | c) for Tj ≠Tk c is the context, here the words of the documents • We want to assign the topic T for which T’ = argmaxTjP(Tj | c)
Because of the dependencies encoded in the GM Bayes rule This GM Exercise: Bayes classification • We compute P(Tj | c)with Bayes rule
That is, for each Tjwe calculate and see which one is higher Exercise: Bayes classification New sentences: republicans obama season T = politics? P(politics I c) = P(politics) P(Republicans|politics) P(obama|politics) P(season| politics) = 19/34 2/19 2/19 1/19 > 0 T = sport? P(sport I c) = P(sport) P(Republicans|sport) P(obama| sport) P(season| sport) = 15/34 0 0 1/19 = 0 Choose T = politics
That is, for each Tjwe calculate and see which one is higher Exercise: Bayes classification New sentences: democrats kennedy house T = politics? P(politics I c) = P(politics) P(democrats |politics) P(kennedy|politics) P(house| politics) = 19/34 0 0 1/19 = 0 democrats kennedy: unseen words data sparsity How can we address this?
Today • Exercises • Design of a GM • Learn parameters • Language models (n-grams) • Sparse data & smoothing methods
Language Models • Model to assign scores to sentences • Probabilities should broadly indicate likelihood of sentences • P( I saw a van) >> P( eyes awe of an) • Not grammaticality • P(artichokes intimidate zippers) ≈ 0 • In principle, “likely” depends on the domain, context, speaker… Adapted from Dan Klein’s CS 288 slides
Language models • Related: the task of predicting the next word • Can be useful for • Spelling corrections • I need to notified the bank • Machine translations • Speech recognition • OCR (optical character recognition) • Handwriting recognition • Augmentative communication • Computer systems to help the disabled in communication • For example, systems that let choose words with hand movements
Language Models • Model to assign scores to sentences • Sentence: w1, w2, … wn • Break sentence probability down with chain rule (no loss of generality) • Too many histories!
Wi-2 wi Markov assumption: n-gram solution w1 wi • Markov assumption: only the prior local context --- the last “few” n words– affects the next word • N-gram models: assume each word depends only on a short linear history • Use N-1 words to predict the next one
n-gram: Unigrams (n = 1) From Dan Klein’s CS 288 slides
n-gram: Bigrams (n = 2) From Dan Klein’s CS 288 slides
W1 W2 W3 . . . WN n-gram: Trigrams (n = 3) From Dan Klein’s CS 288 slides
Choice of n • In principle we would like the n of the n-gram to be large • green • large green • the large green • swallowed the large green • swallowed should influence the choice of the next word (mountain is unlikely, pea more likely) • The crocodile swallowed the large green .. • Mary swallowed the large green .. • And so on…
Discrimination vs. reliability • Looking at longer histories (large n) should allows us to make better prediction (better discrimination) • But it’s much harder to get reliable statistics since the number of parameters to estimate becomes too large • The larger n, the larger the number of parameters to estimate, the larger the data needed to do statistically reliable estimations
Language Models • N size of vocabulary • Unigrams • Bi-grams • Tri-grams For each wi calculate P(wi): N of such numbers: N parameters For each wi, wj calculate P(wi| wj,): NxN parameters For each wi, wjwk calculate P(wi| wj, wk): NxNxN parameters
N-grams and parameters • Assume we have a vocabulary of 20,000 words • Growth in number of parameters for n-grams models:
Sparsity • Zipf’s law: most words are rare • This makes frequency-based approaches to language hard • New words appear all the time, new bigrams more often, trigrams or more, still worse! • These relative frequency estimates are the MLE (maximum likelihood estimates): choice of parameters that give the highest probability to the training corpus
Sparsity • The larger the number of parameters, the more likely it is to get 0 probabilities • Note also the product: • If we have one 0 for un unseen events, the 0 propagates and gives us 0 probabilities for the whole sentence
Tackling data sparsity • Discounting or smoothing methods • Change the probabilities to avoid zeros • Remember pd have to sum to 1 • Decrease the non zeros probabilities (seen events) and put the rest of the probability mass to the zeros probabilities (unseen events)
Smoothing From Dan Klein’s CS 288 slides
Smoothing • Put probability mass on “unseen events” • Add one /delta (uniform prior) • Add one /delta (unigram prior) • Linear interpolation • ….
Smoothing: Combining estimators • Make linear combination of multiple probability estimates • (Providing that we weight the contribution of each of them so that the result is another probability function) • Linear interpolation or mixture models
Smoothing: Combining estimators • Back-off models • Special case of linear interpolation
Smoothing: Combining estimators • Back-off models: trigram version
Beyond N-Gram LMs • Discriminative models (n-grams are generative model) • Grammar based • Syntactic models: use tree models to capture long-distance syntactic effects • Structural zeros: some n-grams are syntactically forbidden, keep estimates at zero • Lexical • Word forms • Unknown words • Semantic based • Semantic classes: do statistic at the semantic classes level (eg., WordNet) • More data (Web)
Summary • Given a problem (topic and subtopic classification, language models): design a GM • Learn parameters from data • But: data sparsity • Need to smooth the parameters