Latent Dirichlet Allocation a generative model for text

Latent Dirichlet Allocationa generative model for text David M. Blei, Andrew Y. Ng, Michael I. Jordan (2002) Presenter: Ido Abramovich

Overview • Motivation • Other models • Notation and terminology • Latent Dirichlet allocation method • LDA in relation to other models • A geometric interpretation • The problems of estimating • Example

Motivation • What do we want to do with text corpora? classification, novelty detection, • summarization and similarity/relevance judgments. • Given a text corpora or other collection of discrete data we wish to: • Find a short description of the data. • Preserve the essential statistical relationships

Term Frequency – Inverse Document Frequency • tf-idf (Salton and McGill, 1983) • The term frequency count is compared to an inverse document frequency count. • Results in a txd matrix – thus reducing the corpus to a fixed-length list • Basic identification of sets of words that are discriminative for documents in the collection • Used for search engines

LSI (Deerwester et al., 1990) • Latent Semantic Indexing • Classic attempt at solving this problem in information retrieval • Uses SVD to reduce document representations • Models synonymy and polysemy • Computing SVD is slow • Non-probabilistic model

pLSIHoffman (1999) • A generative model • Models each word in a document as a sample from a mixture model. • Each word is generated from a single topic, different words in the document may be generated from different topics. • Each document is represented as a list of mixing proportions for the mixture components.

Exchangeability • A finite set of random variables is said to be exchangeable if the joint distribution is invariant to permutation. If π is a permutation of the integers from 1 to N: • An infinite sequence of random is infinitely exchangeable if every finite subsequence is exchangeable

bag-of-words Assumption • Word order is ignored • “bag-of-words” – exchangeability, not i.i.d • Theorem (De Finetti, 1935) – if are infinitely exchangeable, then the joint probability has a representation as a mixture: For some random variable θ

Notation and terminology • A word is an item from a vocabulary indexed by {1,…,V}. We represent words using unit-basis vectors. The vth word is represented by a V-vector w such that and for • A document is a sequence of N words denoted by , where is the nth word in the sequence. • A corpus is a collection of M documents denoted by

Latent Dirichlet allocation • LDA is a generative probabilistic model of a corpus. The basic idea is that the documents are represented as random mixtures over latent topics, where a topic is characterized by a distribution over words.

LDA – generative process • Choose • Choose • For each of the N words : • Choose a topic • Choose a word from , a multinomial probability conditioned on the topic

Dirichlet distribution • A k-dimensional Dirichlet random variable θ can take values in the (k-1)-simplex, and has the following probability density on this simplex:

The graphical model

The LDA equations

LDA and exchangeability • We assume that words are generated by topics and that those topics are infinitely exchangeable within a document. • By de Finetti’s theorem: • By marginalizing out the topic variables, we get eq. 3 in the previous slide.

Unigram model

Mixture of unigrams

Probabilistic LSI

A geometric interpretation word simplex

A geometric interpretation topic 1 topic simplex word simplex topic 2 topic 3

Inference • We want to compute the posterior dist. Of the hidden variables given a document: • Unfortunately, this is intractable to compute in general. We write Eq. (3) as:

Variational inference

Parameter estimation • Variational EM • (E Step) For each document, find the optimizing values of the variational parameters (γ, φ) with α, βfixed. • (M Step) Maximize variational distribution w.r.t. α, βfor the γandφvalues found in the E step.

Smoothed LDA • Introduces Dirichlet smoothing on βto avoid the “zero frequency problem” • More Bayesian approach • Inference and parameter learning similar to unsmoothed LDA

Document modeling • Unlabeled data – our goal is density estimation. • Compute the perplexity of a held-out test to evaluate the models – lower perplexity score indicates better generalization. .

Document Modeling – cont.data used • C. Elegans Community abstracts • 5,225 abstracts • 28,414 unique terms • TREC AP corpus (subset) • 16,333 newswire articles • 23,075 unique terms • Held-out data – 10% • Removed terms – 50 stop words, words appearing once (AP)

nematode

Document Modeling – cont.Results • Both pLSI and mixture suffer from overfitting. • Mixture – peaked posteriors in the training set. • Can solve overfitting with variational Bayesian smoothing.

Document Modeling – cont.Results • Both pLSI and mixture suffer from overfitting. • pLSI – overfitting due to dimensionality of the p(z|d) parameter. • As k gets larger, the chance that a training document will cover all the topics in a new document decreases

Other uses

Summary • Based on the exchangeability assumption • Can be viewed as a dimensionality reduction technique • Exact inference is intractable, we can approximate instead • Can be used in other collection – images and caption for example.

Latent Dirichlet Allocation a generative model for text

Latent Dirichlet Allocation a generative model for text

Presentation Transcript

An Introduction to Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation

A Generative Retrieval Model for Structured Documents

Generative Models For Text

Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation

Topic Model Latent Dirichlet Allocation

Latent Dirichlet Allocation( LDA)

A Latent Dirichlet Allocation Method For Selectional Preferences

Using Latent Dirichlet Allocation for Child Narrative Analysis

Project 2 Latent Dirichlet Allocation

A Multicomponent Latent Trait Model for Diagnosis

Patterns And A Generative Model

Latent Dirichlet Allocation

Latent Dirichlet Allocation

Latent Risk Model

Text-classification using Latent Dirichlet Allocation - intro graphical model

Latent Dirichlet Allocation (LDA)

Latent model

Second Space: A Generative Model For The Blogosphere

TopicXP: Exploring Topics in Source Code using Latent Dirichlet Allocation

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA)