210 likes | 456 Views
Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability. Ramesh Nallapati, William Cohen and John Lafferty Machine Learning Department Carnegie Mellon University. Latent Dirichlet Allocation (LDA).
E N D
Parallelized variational EM for Latent Dirichlet Allocation:An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John Lafferty Machine Learning Department Carnegie Mellon University
Latent Dirichlet Allocation (LDA) • A directed graphical model for topic mining from large scale document collections • A completely unsupervised technique • Extracts semantically coherent multinomial distributions over vocabulary called topics • Represents documents in a lower dimensional topic-space ICDM’07 HPDM workshop
LDA: generative process ICDM’07 HPDM workshop
LDA: topics ICDM’07 HPDM workshop
LDA: inference • Intractable for exact inference • Several approximation inference techniques available • Stochastic techniques • MCMC Sampling • Numerical techniques • Loopy Belief propagation • Variational Inference • Expectation Propagation ICDM’07 HPDM workshop
LDA: variational inference • The true (intractable) posterior probability of the latent variables approximated by a fully factored variational posterior • Lower bound on the true data-log-likelihood: • The difference is equal to the KL-divergence between the variational posterior and true posterior ICDM’07 HPDM workshop
LDA: variational Inference E-step: M-step: ICDM’07 HPDM workshop
LDA: variational inference • The main bottleneck is E-step • Key insight: • Variational parameters d and dnk can be computed independently for various documents • E-step can be parallelized • Two implementations • Multi-processor architecture with shared memory • Distributed architecture with shared disk ICDM’07 HPDM workshop
Parallel implementation • Hardware: • Linux machine with 4 CPUs • Each CPU is an Intel Xeon 2.4GHz processor • Shared 4GB RAM • 512 KB cache • Software: • David Blei’s LDA implementation in C • Used pthreads to parallelize the code ICDM’07 HPDM workshop
Parallel Implementation ICDM’07 HPDM workshop
Distributed implementation • Hardware: • Cluster of 96 nodes • Each is a linux machine • Transmetta Efficeon 1.2GHz processors • 1GB RAM and 1MB cache • Software • David Blei’s C-code forms the core • Perl code to co-ordinate the worker nodes • Rsh connections to invoke worker nodes • Communication through disk ICDM’07 HPDM workshop
Distributed implementation ICDM’07 HPDM workshop
Data • A subset of PubMed consisting of 300K docs • Collection Indexed using Lemur • Stopwords removed and stemmed • Vocabulary size: ¼ 100,000 • Generated subcollections of various sizes • Vocabulary size remains the same ICDM’07 HPDM workshop
Experiments • Studied runtime as a function of • number of threads/nodes • collection size • Fixed the number of topics at 50 • Multiprocessor setting: varied # CPUs from 1 to 4 • Distributed setting: varied # nodes from 1 to 90 • LDA initialization on a collection: • randomly initialized LDA run for 1 EM iteration • resulting model used as a starting point in all experiments • Reported average runtime per EM-iteration ICDM’07 HPDM workshop
Results: Multiprocessor ICDM’07 HPDM workshop
Results: Multiprocessorcase: 50,000 documents ICDM’07 HPDM workshop
Discussion • Plot shows E-step is the main bottle-neck • The speedup is not linear! • A speedup of only 1.85 from 1 to 4 CPUs (50,000 docs) • Possibly a conflict between threads in read-accessing the model in main-memory • Create a copy of the model in memory for each thread? • Results in huge memory requirements ICDM’07 HPDM workshop
Results: Distributed ICDM’07 HPDM workshop
Results: distributedcase: 50,000 documents ICDM’07 HPDM workshop
Discussion • Sub-linear speedups • Speedup of 14.5 from 1 to 50 nodes (50,000 docs) • Speedup tapers-off after an optimum number of nodes • Conflicts in disk reading • M-step: larger input filesize with more nodes • Optimum number of nodes increases with collection size • Scaling the cluster size is desirable for larger collections ICDM’07 HPDM workshop
Conclusions • Distributed version seems more desirable • Scalable • Cheaper • Future work: further improvements • Communication using RPCs • Loading only sparse model corresponding to the sparse doc-term matrix during E-step • Load one document at a time in E-step • Document clustering before splitting the data between nodes ICDM’07 HPDM workshop