1 / 64

LDA AND OTHER DIRECTED MODELS FOR MODELING TEXT

This text provides a recap of using Naïve Bayes as a Directed Graphical Model (DGM), discusses parameter learning techniques, and explores the pros and cons of different models. It also introduces Latent Dirichlet Allocation (LDA) and Gibbs sampling for LDA.

hildred
Download Presentation

LDA AND OTHER DIRECTED MODELS FOR MODELING TEXT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LDA AND OTHER DIRECTED MODELS FOR MODELING TEXT

  2. DGMs to describe parameter learning

  3. Recap: Naïve Bayes as a DGM • For each document d in the corpus (of size D): • Pick a label yd from Pr(Y) • For each word in d (of length Nd): • Pick a word xid from Pr(X|Y=yd) Y for every X X Nd D Plate diagram

  4. Recap: Naïve Bayes as a DGM • For each document d in the corpus (of size D): • Pick a label yd from Pr(Y) • For each word in d (of length Nd): • Pick a word xid from Pr(X|Y=yd) Y Not described: how do we smooth for classes? For multinomials? How many classes are there? …. X Nd D Plate diagram

  5. Recap: smoothing for a binomial Tom: estimate Θ=P(heads) for a binomial with MLE as: MLE: maximize Pr(D|θ) #heads MAP: maximize Pr(D|θ)Pr(θ) #tails and with MAP as: #imaginary heads #imaginary tails Smoothing = prior over the parameter θ

  6. Smoothing for a binomial as a DGM MAP for dataset D with α1 heads and α2 tails: γ #imaginary heads θ D #imaginary tails MAP is a Viterbi-style inference: want to find max probability paramete θ, to the posterior distribution Also: inference in a simple graph can be intractable if the conditional distributions are complicated

  7. Smoothing for a binomial as a DGM MAP for dataset D with α1 heads and α2 tails: γ #imaginary heads θ F #imaginary tails α1 + α2 Final recap: congugate for a multinomial is called a Dirichlet

  8. Recap: Naïve Bayes as a DGM Now: let’s turn Bayes up to 11 for naïve Bayes…. Y X Nd D Plate diagram

  9. A more Bayesian Naïve Bayes • From a Dirichlet α: • Draw a multinomial πover the K classes • From a Dirichlet η • For each class y=1…K • Draw a multinomial β[y] over the vocabulary • For each document d=1..D: • Pick a label yd from π • For each word in d (of length Nd): • Pick a word xid from β[yd] α π Y W Nd D β η K

  10. Pros and cons (as William sees them) This gets vague  This is precise  • From Dirichlet α: • Draw a multinomial πover the K classes • From a Dirichlet η • For each class y=1…K • Draw a multinomial β[y] over the vocabulary • For each document d=1..D: • Pick a label yd from π • For each word in d (of length Nd): • Pick a word xid from β[yd] α π Y W Nd D β η K

  11. Pros and cons (as William sees them) This is visual  • From Dirichlet α: • Draw a multinomial πover the K classes • From a Dirichlet η • For each class y=1…K • Draw a multinomial β[y] over the vocabulary • For each document d=1..D: • Pick a label yd from π • For each word in d (of length Nd): • Pick a word xid from β[yd] This is a program  α π Y W Nd D β η K

  12. Naïve Bayes-like DGMs covered so far ✔ Mixture of multinomials naïve Bayes SS naive Bayes ✔ Gaussian naïve Bayes SS Gaussian naïve Bayes Mixture of Gaussians ✔

  13. Supervised versus unsupervised NB α π Y Y W Nd D β η K

  14. Unsupervised naïve Bayes (mixture of multinomials) • From Dirichlet α: • Draw a multinomial πover the K classes • From a Dirichlet η • For each class y=1…K • Draw a multinomial β[y] over the vocabulary • For each document d=1..D: • Pick a label yd from π • For each word in d (of length Nd): • Pick a word xid from β[yd] α π Y W Nd D β η K

  15. Unsupervised naïve Bayes (mixture of multinomials) Learning: via EM • Pick π0,β[1]0,…, β[K]0 • For t=1,…,T: • For d=1,…D: • Find P(Yd|πt-1,β[1]t-1,…, β[K]t-1) • Maximize πt,β[1]t,…, β[K]t on the weighted data α π Y W Nd D β η K

  16. LATENT DIRICHLET ANALYSIS(LDA)

  17. The LDA Topic Model

  18. Unsupervised NB vs LDA different class distrib for each doc one class prior α π α π one Y per doc Y Y Y Y one Y per word W W Nd Nd D D β η β η K K

  19. Unsupervised NB vs LDA different class distrib θ for each doc one class prior α π α θd one Y per doc Y Y Zdi Y one Z per word W Wdi Nd Nd D D β η βk η K K

  20. LDA Blei’s motivation: start with BOW assumption • Assumptions: 1) documents are i.i.d 2) within a document, words are i.i.d. (bag of words) • For each document d = 1,,M • Generate d ~ D1(…) • For each word n = 1,, Nd • generate wn ~ D2( . | θdn) • Now pick your favorite distributions for D1, D2  w

  21. Unsupervised NB vs LDA α θd • Unsupervised NB clusters documents into latent classes • LDA clusters word occurrences into latent classes (topics) • The smoothness of βksame word suggests same topic • The smoothness of θd same document suggests same topic  Y Zdi Wdi w Nd D βk η K

  22. Mixed membership model • LDA’s view of a document

  23. LDA topics: top words w by Pr(w|Z=k) Z=13 Z=22 Z=27 Z=19

  24. SVM using 50 features: Pr(Z=k|θd) 50 topics vs all words, SVM

  25. Gibbs Sampling for LDA

  26. LDA • Latent Dirichlet Allocation • Parameter learning: • Variational EM • Not covered in 601-B • Collapsed Gibbs Sampling • Wait, why is sampling called “learning” here? • Here’s the idea….

  27. LDA • Gibbs sampling – works for any directed model! • Applicable when joint distribution is hard to evaluate but conditional distribution is known • Sequence of samples comprises a Markov Chain • Stationary distribution of the chain is the joint distribution Key capability: estimate distribution of one latent variables given the other latent variables and observed variables.

  28. α α θ θ Z Z1 Z2 X Y X1 X2 Y1 2 Y2 I’ll assume we know parameters for Pr(X|Z) and Pr(Y|X)

  29. Initialize all the hidden variables randomly, then…. α Pick Z1 ~ Pr(Z|x1,y1,θ) Pick θ ~ Pr(z1, z2,α) pick from posterior θ Pick Z2 ~ Pr(Z|x2,y2,θ) Z1 Z2 . . . X1 X2 Y1 Y2 Pick Z1 ~ Pr(Z|x1,y1,θ) Pick θ ~ Pr(z1, z2,α) pick from posterior So we will have (a sample of) the true θ Pick Z2 ~ Pr(Z|x2,y2,θ) in a broad range of cases eventually these will converge to samples from the true joint distribution

  30. Why does Gibbs sampling work? • Basic claim: when you sample x~P(X|y1,…,yk) then if y1,…,yk were sampled from the true joint then x will be sampled from the true joint • So the true joint is a “fixed point” • you tend to stay there if you ever get there • How long does it take to get there? • depends on the structure of the space of samples: how well-connected are they by the sampling steps?

  31. LDA • Latent Dirichlet Allocation • Parameter learning: • Variational EM • Not covered in 601-B • Collapsed Gibbs Sampling • What is collapsed Gibbs sampling?

  32. Initialize all the Z’s randomly, then…. α Pick Z1 ~ Pr(Z1|z2,z3,α) Pick Z2 ~ Pr(Z2|z1,z3,α) θ Pick Z3 ~ Pr(Z3|z1,z2,α) Pick Z1 ~ Pr(Z1|z2,z3,α) Z1 Z2 Z3 Pick Z2 ~ Pr(Z2|z1,z3,α) Pick Z3 ~ Pr(Z3|z1,z2,α) X1 X2 Y1 X3 Y2 Y3 . . . Converges to samples from the true joint … and then we can estimate Pr(θ|α, sample of Z’s)

  33. Initialize all the Z’s randomly, then…. α Pick Z1 ~ Pr(Z1|z2,z3,α) θ What’s this distribution? Z1 Z2 Z3 X1 X2 Y1 X3 Y2 Y3

  34. Initialize all the Z’s randomly, then…. α Pick Z1 ~ Pr(Z1|z2,z3,α) θ Simpler case: What’s this distribution? called a Dirichlet-multinomial and it looks like this: Z1 Z2 Z3 If there are k values for the Z’s and nk = # Z’s with value k, then it turns out:

  35. Initialize all the Z’s randomly, then…. α Pick Z1 ~ Pr(Z1|z2,z3,α) It turns out that sampling from a Dirichlet-multinomial is very easy! θ • Notation: • k values for the Z ’s • nk = # Z ’s with value k • Z= (Z1,…, Zm) • Z(-i) = (Z1,…, Zi-1 , Z+1 ,…, Zm) • nk (-i)= # Z ’s with value k excluding Zi Z1 Z2 Z3

  36. What about with downstream evidence? α captures the constraints on Zi via θ (from “above”, “causal” direction) what about via β? Pr(Z) = (1/c) * Pr(Z|E+) Pr(E-|Z) θ Z1 Z2 Z3 η X1 X2 X3 Z1 β[k] Z2 β[k] X1 X2 η

  37. Sampling for LDA • Notation: • k values for the Zd,i’s • Z(-d,i)= all the Z’s but Zd,i • nw,k = # Zd,i’s with value k paired with Wd,i=w • n*,k = # Zd,i’s with value k • nw,k (-d,i)= nw,k excluding Zi,d • n*,k (-d,i)= n*k excluding Zi,d • n*,k d,(-i)= n*k fromdoc d excluding Zi,d α θd Y Zdi Wdi Nd Pr(E-|Z) Pr(Z|E+) D βk η fraction of time Z=k in doc d K fraction of time W=w in topic k

  38. Unsupervised NB vs LDA α θd • Unsupervised NB clusters documents into latent classes • LDA clusters word occurrences into latent classes (topics) • The smoothness of βksame word suggests same topic • The smoothness of θd same document suggests same topic  Y Zdi Wdi w Nd D βk η K

  39. EVEN MORE DETAIL ON LDA…

  40. Way way more detail

  41. More detail

  42. What gets learned…..

  43. Some comments on LDA • Very widely used model • Also a component of many other models

  44. LDA is not just for text An example: modeling urban neighborhoods • Grid cell in lat/long space ~ “document” • 4Square check-in categories ~ “words” • Prototypical neighborhood category ~ set of related “words” ~ topic

More Related