960 likes | 1.15k Views
Fast and Accurate Inference for Topic Models. James Foulds University of California, Santa Cruz Presented at eBay Research Labs. Motivation. There is an ever-increasing wealth of digital information available Wikipedia News articles Scientific articles Literature Debates
E N D
Fast and Accurate Inference for Topic Models James Foulds University of California, Santa Cruz Presented at eBay Research Labs
Motivation • There is an ever-increasing wealth of digital information available • Wikipedia • News articles • Scientific articles • Literature • Debates • Blogs, social media … • We would like automatic methods to help us understand this content
Motivation • Personalized recommender systems • Social network analysis • Exploratory tools for scientists • The digital humanities • …
Dimensionality reduction The quick brown fox jumps over the sly lazy dog
Dimensionality reduction The quick brown fox jumps over the sly lazy dog [5 6 37 1 4 30 9 22 570 12]
Dimensionality reduction The quick brown fox jumps over the sly lazy dog [5 6 37 1 4 30 9 22 570 12] FoxesDogsJumping [40% 40% 20% ]
Latent Variable Models Z Latent variables X Φ Parameters Observed data Data Points Dimensionality(X) >> dimensionality(Z) Z is a bottleneck, which finds a compressed, low-dimensional representationof X
Latent Feature Models forSocial Networks Alice Bob Claire
Latent Feature Models forSocial Networks Alice Bob Tango Salsa Cycling Fishing Running Claire Waltz Running
Latent Feature Models forSocial Networks Alice Bob Tango Salsa Cycling Fishing Running Claire Waltz Running
Latent Feature Models forSocial Networks Alice Bob Tango Salsa Cycling Fishing Running Claire Waltz Running
Miller, Griffiths, Jordan (2009)Latent Feature Relational Model Alice Bob Tango Salsa Cycling Fishing Running Claire Waltz Running Z =
Latent Representations • Binary latent feature • Latent class • Mixed membership
Latent Representations • Binary latent feature • Latent class • Mixed membership
Latent Representations • Binary latent feature • Latent class • Mixed membership
Miller, Griffiths, Jordan (2009)Latent Feature Relational Model Alice Bob Tango Salsa Cycling Fishing Running Claire Waltz Running Z =
Miller, Griffiths, Jordan (2009)Latent Feature Relational Model Alice Bob Tango Salsa Cycling Fishing Running E[Y]=(ZWZT) Claire Waltz Running Z =
Topics Topic 1 Reinforcement learning Topic 2 Learning algorithms Topic 3 Character recognition Distribution over all words in dictionary A vector of discrete probabilities (sums to one)
Topics Topic 1 Reinforcement learning Topic 2 Learning algorithms Topic 3 Character recognition Top 10 words
Topics Topic 1 Reinforcement learning Topic 2 Learning algorithms Topic 3 Character recognition Top 10 words
Latent Dirichlet Allocation(Blei et al., 2003) • For each document d • Draw its topic proportionθ(d) ~ Dirichlet(α) • For each wordwd,n • Draw a topic assignmentzd,n ~ Discrete(θ(d)) • Draw a word from the chosen topic wd,n ~ Discrete(φZd,n) φ
Latent Dirichlet Allocation(Blei et al., 2003) • For each topic k • Draw its distribution over wordsφ(k) ~ Dirichlet(β) • For each wordwd,n • Draw a topic assignmentzd,n ~ Discrete(θ(d)) • Draw a word from the chosen topic wd,n ~ Discrete(φZd,n) φ
Latent Dirichlet Allocation(Blei et al., 2003) • For each document d • Draw its topic proportionθ(d) ~ Dirichlet(α) • For each wordwd,n • Draw a topic assignmentzd,n ~ Discrete(θ(d)) • Draw a word from the chosen topic wd,n ~ Discrete(φZd,n) φ
Latent Dirichlet Allocation(Blei et al., 2003) • For each document d • Draw its topic proportionθ(d) ~ Dirichlet(α) • For each wordwd,n • Draw a topic assignmentzd,n ~ Discrete(θ(d)) • Draw a word from the chosen topic wd,n ~ Discrete(φZd,n) φ
Latent Dirichlet Allocation(Blei et al., 2003) • For each document d • Draw its topic proportionθ(d) ~ Dirichlet(α) • For each wordwd,n • Draw a topic assignmentzd,n ~ Discrete(θ(d)) • Draw a word from the chosen topic wd,n ~ Discrete(φZd,n) φ
Latent Dirichlet Allocation(Blei et al., 2003) • For each document d • Draw its topic proportionθ(d) ~ Dirichlet(α) • For each wordwd,n • Draw a topic assignmentzd,n ~ Discrete(θ(d)) • Draw a word from the chosen topic wd,n ~ Discrete(φZd,n) φ
Latent Dirichlet Allocation(Blei et al., 2003) • For each document d • Draw its topic proportionθ(d) ~ Dirichlet(α) • For each wordwd,n • Draw a topic assignmentzd,n ~ Discrete(θ(d)) • Draw a word from the chosen topic wd,n ~ Discrete(φZd,n) φ
LDA as Matrix Factorization x θ φT
LDA on Wikipedia 10 mins 1 hour 6 hours 12 hours
LDA on Wikipedia 10 mins 1 hour 6 hours 12 hours
LDA on Wikipedia 1 full iteration = 3.5 days! 10 mins 1 hour 6 hours 12 hours
LDA on Wikipedia Stochastic variational inference Stochastic variational inference 10 mins 1 hour 6 hours 12 hours
LDA on Wikipedia Stochastic collapsedvariational inference 10 mins 1 hour 6 hours 12 hours
Collapsed Inference for LDAGriffiths and Steyvers (2004) • Marginalize out the parameters, and perform inference on the latent variables only Z Z
Collapsed Inference for LDAGriffiths and Steyvers (2004) • Marginalize out the parameters, and perform inference on the latent variables only • Simpler, faster and fewer update equations • Better mixing for Gibbs sampling
Collapsed Inference for LDAGriffiths and Steyvers (2004) • Collapsed Gibbs sampler
Collapsed Inference for LDAGriffiths and Steyvers (2004) • Collapsed Gibbs sampler Word-topic counts
Collapsed Inference for LDAGriffiths and Steyvers (2004) • Collapsed Gibbs sampler Document-topic counts
Collapsed Inference for LDAGriffiths and Steyvers (2004) • Collapsed Gibbs sampler Topic counts
Stochastic Optimization for ML Stochastic algorithms • While (not converged) • Process a subset of the dataset, to estimate the update • Update parameters
Stochastic Optimization for ML • Stochastic gradient descent • Estimate the gradient • Stochastic variational inference (Hoffman et al. 2010, 2013) • Estimate the natural gradient of the variational parameters • Online EM (Cappe and Moulines, 2009) • EstimateE-step sufficient statistics
Goal: Build a Fast, Accurate,Scalable Algorithm for LDA • Collapsed LDA • Easy to implement • Fast • Accurate • Mixes well / propagates information quickly • Stochastic algorithms • Scalable • Quickly forgets random initialization • Memory requirements, update time independent of size of data set • Can estimate topics before a single pass of the data is complete • Our contribution: an algorithm which gets the best of both worlds
Variational Bayesian Inference • An optimization strategy for performing posterior inference, i.e. estimating Pr(Z|X) Q P
Variational Bayesian Inference • An optimization strategy for performing posterior inference, i.e. estimating Pr(Z|X) Q KL(Q || P) P