1 / 54

Bayesian Generative Modeling

Bayesian Generative Modeling. Jason Eisner Summer School on Machine Learning Lisbon, Portugal – July 2011. 1. Jason Eisner Summer School on Machine Learning Lisbon, Portugal – July 2011. Bayesian Generative Modeling. what’s a model?. 2. Jason Eisner Summer School on Machine Learning

dimaia
Download Presentation

Bayesian Generative Modeling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bayesian Generative Modeling Jason Eisner Summer School on Machine Learning Lisbon, Portugal – July 2011 1

  2. Jason Eisner Summer School on Machine Learning Lisbon, Portugal – July 2011 Bayesian Generative Modeling what’s a model? 2

  3. Jason Eisner Summer School on Machine Learning Lisbon, Portugal – July 2011 Bayesian Generative Modeling what’s a generative model? 3

  4. Jason Eisner Summer School on Machine Learning Lisbon, Portugal – July 2011 Bayesian Generative Modeling what’s Bayesian? 4

  5. y x Task-centric view of the world Task evaluation(loss function) e.g., p(y|x) model and decoder

  6. Task-centric view of the world Task y x • Great way to track progress & compare systems • But may fracture us into subcommunities (our systems are incomparable & my semantics != your semantics) • Room for all of AI when solving any NLP task Spelling correction could get some benefit from deep semantics, unsupervised grammar induction, active learning, discourse, etc. • But in practice, focus on raising a single performance number • Within strict, fixed assumptions about the type of available data Do we want to build models & algs that are good for just one task? loss function p(y|x) model

  7. Variable-centric view of the world When we deeply understand language, what representations (type and token) does that understanding comprise?

  8. Bayesian View of the World observed data probability distribution hidden data

  9. Different tasks merely change which variables are observed and which ones you care about inferring

  10. Different tasks merely change which variables are observed and which ones you care about inferring

  11. Different tasks merely change which variables are observed and which ones you care about inferring

  12. All you need is “p” • Science = a descriptive theory of the world • Write down a formula for p(everything) • everything = observed  needed  latent • Given observed, what might needed be? • Most probable settings of needed are those that give comparatively large values of ∑latent p(observed, needed, latent) • Formally, we want p(needed | observed) = p(observed, needed) / p(observed) Since observed is constant, the conditional probability of needed varies with p(observed, needed), which is given above • (What do we do then?)

  13. All you need is “p” • Science = a descriptive theory of the world • Write down a formula for p(everything) • everything = observed  needed  latent • p can be any non-negative function you care to design • (as long as it sums to 1) • (or another finite positive number: just rescale) • But it’s often convenient to use a graphical model • Flexible modeling technique • Well understood • We know how to (approximately) compute with them

  14. Graphical model notation slide thanks to Zoubin Ghahramani

  15. Factor graphs slide thanks to Zoubin Ghahramani

  16. First, a familiar example Conditional Random Field (CRF) for POS tagging Rather basic NLP example Possible tagging (i.e., assignment to remaining variables) … … v v v preferred find tags Observed input sentence (shaded) 16

  17. Rather basic NLP example First, a familiar example Conditional Random Field (CRF) for POS tagging Possible tagging (i.e., assignment to remaining variables) Another possible tagging … … v a n preferred find tags Observed input sentence (shaded) 17

  18. Conditional Random Field (CRF) ”Binary” factor that measures compatibility of 2 adjacent tags Model reusessame parameters at this position … … preferred find tags 18

  19. Conditional Random Field (CRF) … … can’t be adj preferred find tags 19

  20. Conditional Random Field (CRF) p(van) is proportionalto the product of all factors’ values on van … … v a n preferred find tags 20

  21. Conditional Random Field (CRF) p(van) is proportionalto the product of all factors’ values on van = … 1*3*0.3*0.1*0.2 … … … v a n preferred find tags MRF vs. CRF? 21

  22. Inference: What do you know how to compute with this model? p(van) is proportionalto the product of all factors’ values on van = … 1*3*0.3*0.1*0.2 … … … v a n preferred find tags Maximize, sample, sum … 22

  23. Variable-centric view of the world When we deeply understand language, what representations (type and token) does that understanding comprise?

  24. lexicon (word types) semantics sentences discourse context resources inflection cognates transliteration abbreviation neologism language evolution entailment correlation tokens N translation alignment editing quotation speech misspellings,typos formatting entanglement annotation To recover variables, model and exploit their correlations

  25. How do you design the factors? • It’s easy to connect “English sentence” to “Portuguese sentence” … • … but you have to design a specific function that measures how compatible a pair of sentences is. • Often, you can think of a generative story in which the individual factors are themselves probabilities. • May require some latent variables.

  26. Directed graphical models (Bayes nets) Under any model: p(A,B,C, D,E) = p(A)p(B|A)p(C|A,B)p(D|A,B,C)p(E|A,B,C,D) Model above says: slide thanks to Zoubin Ghahramani (modified)

  27. Unigram model for generating text … w1 w2 w3 p(w1)  p(w2)  p(w3) …

  28. Explicitly show model’s parameters  “ is a vector that says which unigrams are likely” …  w1 w2 w3 p()  p(w1 | ) p(w2 | )  p(w3 | ) …

  29. “Plate notation” simplifies diagram “ is a vector that says which unigrams are likely”  w N1 p()  p(w1 | ) p(w2 | )  p(w3 | ) …

  30. Learn  from observed words(rather than vice-versa)  w N1 p()  p(w1 | ) p(w2 | )  p(w3 | ) …

  31. Explicitly show prior over  (e.g., Dirichlet) “Even if we didn’t observe word 5, the prior says that 5 = 0 is a terrible guess”  given   Dirichlet() wi    w N1 p()  p( | )  p(w1 | ) p(w2 | )  p(w3 | ) …

  32. Dirichlet Distribution • Each point on a k dimensional simplex is a multinomial probability distribution: 1 dog the cat 0 1 1 dog the cat slide thanks to Nigel Crook

  33. 1 1 1 1 0 1 1 1 Dirichlet Distribution • A Dirichlet Distribution is a distribution over multinomial distributions  in the simplex. 1 0 1 1 slide thanks to Nigel Crook

  34. slide thanks to Percy Liang and Dan Klein

  35. 0 Dirichlet Distribution • Example draws from a Dirichlet Distribution over the 3-simplex: Dirichlet(5,5,5) Dirichlet(0.2, 5, 0.2) 1 0 Dirichlet(0.5,0.5,0.5) slide thanks to Nigel Crook

  36. Explicitly show prior over  (e.g., Dirichlet) Posterior distribution p( | , w)is also a Dirichletjust like the prior p( | ). “Even if we didn’t observe word 5, the prior says that 5 = 0 is a terrible guess” prior = Dirichlet()  posterior = Dirichlet(+counts(w)) Mean of posterior is like the max-likelihood estimate of , but smooth the corpus counts by adding “pseudocounts” . (But better to use whole posterior, not just the mean.)   w N1 p()  p( | )  p(w1 | ) p(w2 | )  p(w3 | ) …

  37. Training and Test Documents “Learn  from document 1, use it to predict document 2” test What do good configurations look like if N1 is large? What if N1 is small? w N2 train   w N1

  38. Many Documents “Each document has its own unigram model” 3 w N3 Now does observing docs 1 and 3 help still predict doc 2? Only if  learns that all the ’s are similar (low variance). And in that case, why even have separate ’s? 2 w N2  1 w N1

  39. or tuned to maximizetraining or dev set likelihood Many Documents “Each document has its own unigram model”  given d  Dirichlet() wdi  d   w ND D

  40. Bayesian Text Categorization “Each document chooses one of only K topics(unigram models)”  given k  Dirichlet() wdi  k but which k?   w K ND D

  41.  z Bayesian Text Categorization  given   Dirichlet() zd   “Each document chooses one of only K topics(unigram models)” a distributionover topics 1…K  given k  Dirichlet() wdi  zd Allows documents to differ considerably while some still share  parameters. And, we can infer the probability that two documents have the same topic z. Might observe some topics. a topicin 1…K   w K ND D

  42. Latent Dirichlet Allocation(Blei, Ng & Jordan 2003) “Each documentchooses a mixtureof all K topics;each word gets itsown topic”  z   w K ND D

  43. (Part of) one assignment to LDA’s variables slide thanks to Dave Blei

  44. (Part of) one assignment to LDA’s variables slide thanks to Dave Blei

  45. Latent Dirichlet Allocation: Inference?  … z1 z2 z3 …   w1 w w2 w3 K K D

  46. Finite-State Dirichlet Allocation(Cui & Eisner 2006) “A different HMM for each document”   … z1 z2 z3 …   w1 w2 w3 K D

  47. Variants of Latent Dirichlet Allocation • Syntactic topic model: A word or its topic is influenced by its syntactic position. • Correlated topic model, hierarchical topic model, …: Some topics resemble other topics. • Polylingual topic model: All versions of the same document use the same topic mixture, even if they’re in different languages. (Why useful?) • Relational topic model: Documents on the same topic are generated separately but tend to link to one another. (Why useful?) • Dynamic topic model: We also observe a year for each document. The k topics  used in 2011 have evolved slightly from their counterparts in 2010.

  48. Dynamic Topic Model slide thanks to Dave Blei

  49. Dynamic Topic Model slide thanks to Dave Blei

  50. Dynamic Topic Model slide thanks to Dave Blei

More Related