230 likes | 512 Views
David M Blei, Andrew Y Ng & Michael I Jordan presented by Tilaye Alemu & Anand Ramkissoon. Latent Dirichlet Allocation. Motivation for LDA. In lay terms: document modelling text classification collaborative filtering ... ...in the context of Information Retrieval
E N D
David M Blei, Andrew Y Ng & Michael I Jordan presented by Tilaye Alemu & Anand Ramkissoon Latent Dirichlet Allocation
Motivation for LDA • In lay terms: • document modelling • text classification • collaborative filtering • ... • ...in the context of Information Retrieval • The principal focus in this paper is on document classification within a corpus
Structure of this talk • Part 1: • Theory • Background • (some) other approaches • Part 2: • Experimental results • some details of usage • wider applications
LDA: conceptual features • Generative • Probabilistic • Collections of discrete data • 3-level hierarchical Bayesian model • mixture models • efficient approximate inference techniques • variational methods • EM algorithm for empirical Bayes parameter estimation
How to classify text documents • Word (term) frequency • tf-idf • term-by-document matrix • discriminative sets of words • fixed-length lists of numbers • little statistical structure • Dimensionality reduction techniques • Latent Semantic Indexing • Singular value decomposition • not generative
How to classify text documents ct'd • probabilistic LSI (PLSI) • each word generated by one topic • each document generated by a mixture of topics • a document is represented as a list of mixing proportions for topics • No generative model for these numbers • Number of parameters grows linearly with the corpus • Overfitting • How to classify documents outside training set
A major simplifying assumption • A document is a “bag of words” • A corpus is a “bag of documents” • order is unimportant • exchangeability • de Finetti representation theorem • any collection of exchangeable random variables has a representation as a (generally infinite) mixture distribution
A note about exchangeability • Does not mean that random variables are iid • iid when conditioned on wrt to an underlying latent parameter of a probability distribution • Conditionally the joint distribution is simple and factored
Notation • word: unit of discrete data, an item from a vocabulary indexed {1,...,V} • each word is a unit basis V-vector • document: sequence of N words w=(w1,...,wN) • corpus a collection of M documents D=(w1,...,wM) • Each document is considered a random mixture over latent topics • Each topic is considered a distribution over words
LDA assumes a generative processfor each document in the corpus
Marginalize over z • The word distribution • The generative process
A family of distributions on latent variables • The Dirichlet parameter γ and the multinomial parameters φ are the free variational parameters
The update equations • Minimize the Kullback-Leibler divergence between the distribution and the true posterior