480 likes | 596 Views
Scaling up LDA. (Monday’s lecture). What if you try and parallelize?. Split document/term matrix randomly and distribute to p processors .. then run “Approximate Distributed LDA”. Common subtask in parallel versions of: LDA, SGD, …. AllReduce. Introduction. Common pattern:
E N D
Scaling up LDA (Monday’s lecture)
What if you try and parallelize? Split document/term matrix randomly and distribute to p processors .. then run “Approximate Distributed LDA” Common subtask in parallel versions of: LDA, SGD, ….
Introduction • Common pattern: • do some learning in parallel • aggregate local changes from each processor • to shared parameters • distribute the new shared parameters • back to each processor • and repeat…. MAP REDUCE some sort of copy
Introduction • Common pattern: • do some learning in parallel • aggregate local changes from each processor • to shared parameters • distribute the new shared parameters • back to each processor • and repeat…. • AllReduce implemented in MPI, recently in VW code (John Langford) in a Hadoop/compatible scheme MAP ALLREDUCE
Gory details of VW Hadoop-AllReduce • Spanning-tree server: • Separate process constructs a spanning tree of the computenodes in the cluster and then acts as a server • Worker nodes (“fake” mappers): • Input for worker is locally cached • Workers all connect to spanning-tree server • Workers all execute the same code, which might contain AllReduce calls: • Workers synchronize whenever they reach an all-reduce
HadoopAllReduce don’t wait for duplicate jobs
2 24 features ~=100 non-zeros/example 2.3B examples example is user/page/ad and conjunctions of these, positive if there was a click-thru on the ad
50M examples explicitly constructed kernel 11.7M features 3,300 nonzeros/example old method: SVM, 3 days: reporting time to get to fixed test error
Pilfered from… NIPS 2010: Online Learning for LDA, Matthew Hoffman, Francis Bach & Blei
uses λ uses γ
recap Compute expectations over the z’s any way you want….
Technical Details q(zd) not q(zd)! Variationaldistrib: Approximate using Gibbs: after sampling for a while estimate: estimate using time and “coherence”: D(w) = # docs containing word w
Summary of LDA speedup tricks • Gibbs sampler: • O(N*K*T) and K grows with N • Need to keep the corpus (and z’s) in memory • You can parallelize • You need to keep a slice of the corpus • But you need to synchronize K multinomials over the vocabulary • AllReduce would help? • You can sparsify the sampling and topic-counts • Mimno’s trick - greatly reduces memory • You can do the computation on-line • Only need to keep K-multinomials and one document’s worth of corpus and z’s in memory • You can combine some of these methods • Online sparsified LDA • Parallel online sparsified LDA?