1 / 37

Latent Dirichlet Allocation a generative model for text

Latent Dirichlet Allocation a generative model for text. David M. Blei, Andrew Y. Ng, Michael I. Jordan (2002) Presenter: Ido Abramovich. Overview. Motivation Other models Notation and terminology Latent Dirichlet allocation method LDA in relation to other models

ike
Download Presentation

Latent Dirichlet Allocation a generative model for text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Latent Dirichlet Allocationa generative model for text David M. Blei, Andrew Y. Ng, Michael I. Jordan (2002) Presenter: Ido Abramovich

  2. Overview • Motivation • Other models • Notation and terminology • Latent Dirichlet allocation method • LDA in relation to other models • A geometric interpretation • The problems of estimating • Example

  3. Motivation • What do we want to do with text corpora? classification, novelty detection, • summarization and similarity/relevance judgments. • Given a text corpora or other collection of discrete data we wish to: • Find a short description of the data. • Preserve the essential statistical relationships

  4. Term Frequency – Inverse Document Frequency • tf-idf (Salton and McGill, 1983) • The term frequency count is compared to an inverse document frequency count. • Results in a txd matrix – thus reducing the corpus to a fixed-length list • Basic identification of sets of words that are discriminative for documents in the collection • Used for search engines

  5. LSI (Deerwester et al., 1990) • Latent Semantic Indexing • Classic attempt at solving this problem in information retrieval • Uses SVD to reduce document representations • Models synonymy and polysemy • Computing SVD is slow • Non-probabilistic model

  6. pLSIHoffman (1999) • A generative model • Models each word in a document as a sample from a mixture model. • Each word is generated from a single topic, different words in the document may be generated from different topics. • Each document is represented as a list of mixing proportions for the mixture components.

  7. Exchangeability • A finite set of random variables is said to be exchangeable if the joint distribution is invariant to permutation. If π is a permutation of the integers from 1 to N: • An infinite sequence of random is infinitely exchangeable if every finite subsequence is exchangeable

  8. bag-of-words Assumption • Word order is ignored • “bag-of-words” – exchangeability, not i.i.d • Theorem (De Finetti, 1935) – if are infinitely exchangeable, then the joint probability has a representation as a mixture: For some random variable θ

  9. Notation and terminology • A word is an item from a vocabulary indexed by {1,…,V}. We represent words using unit-basis vectors. The vth word is represented by a V-vector w such that and for • A document is a sequence of N words denoted by , where is the nth word in the sequence. • A corpus is a collection of M documents denoted by

  10. Latent Dirichlet allocation • LDA is a generative probabilistic model of a corpus. The basic idea is that the documents are represented as random mixtures over latent topics, where a topic is characterized by a distribution over words.

  11. LDA – generative process • Choose • Choose • For each of the N words : • Choose a topic • Choose a word from , a multinomial probability conditioned on the topic

  12. Dirichlet distribution • A k-dimensional Dirichlet random variable θ can take values in the (k-1)-simplex, and has the following probability density on this simplex:

  13. The graphical model

  14. The LDA equations

  15. LDA and exchangeability • We assume that words are generated by topics and that those topics are infinitely exchangeable within a document. • By de Finetti’s theorem: • By marginalizing out the topic variables, we get eq. 3 in the previous slide.

  16. Unigram model

  17. Mixture of unigrams

  18. Probabilistic LSI

  19. A geometric interpretation word simplex

  20. A geometric interpretation topic 1 topic simplex word simplex topic 2 topic 3

  21. A geometric interpretation topic 1 topic simplex word simplex topic 2 topic 3

  22. A geometric interpretation topic 1 topic simplex word simplex topic 2 topic 3

  23. Inference • We want to compute the posterior dist. Of the hidden variables given a document: • Unfortunately, this is intractable to compute in general. We write Eq. (3) as:

  24. Variational inference

  25. Parameter estimation • Variational EM • (E Step) For each document, find the optimizing values of the variational parameters (γ, φ) with α, βfixed. • (M Step) Maximize variational distribution w.r.t. α, βfor the γandφvalues found in the E step.

  26. Smoothed LDA • Introduces Dirichlet smoothing on βto avoid the “zero frequency problem” • More Bayesian approach • Inference and parameter learning similar to unsmoothed LDA

  27. Document modeling • Unlabeled data – our goal is density estimation. • Compute the perplexity of a held-out test to evaluate the models – lower perplexity score indicates better generalization. .

  28. Document Modeling – cont.data used • C. Elegans Community abstracts • 5,225 abstracts • 28,414 unique terms • TREC AP corpus (subset) • 16,333 newswire articles • 23,075 unique terms • Held-out data – 10% • Removed terms – 50 stop words, words appearing once (AP)

  29. nematode

  30. AP

  31. Document Modeling – cont.Results • Both pLSI and mixture suffer from overfitting. • Mixture – peaked posteriors in the training set. • Can solve overfitting with variational Bayesian smoothing.

  32. Document Modeling – cont.Results • Both pLSI and mixture suffer from overfitting. • pLSI – overfitting due to dimensionality of the p(z|d) parameter. • As k gets larger, the chance that a training document will cover all the topics in a new document decreases

  33. Other uses

  34. Summary • Based on the exchangeability assumption • Can be viewed as a dimensionality reduction technique • Exact inference is intractable, we can approximate instead • Can be used in other collection – images and caption for example.

More Related