Scaling up LDA

Scaling up LDA (Monday’s lecture)

What if you try and parallelize? Split document/term matrix randomly and distribute to p processors .. then run “Approximate Distributed LDA” Common subtask in parallel versions of: LDA, SGD, ….

AllReduce

Introduction • Common pattern: • do some learning in parallel • aggregate local changes from each processor • to shared parameters • distribute the new shared parameters • back to each processor • and repeat…. MAP REDUCE some sort of copy

Introduction • Common pattern: • do some learning in parallel • aggregate local changes from each processor • to shared parameters • distribute the new shared parameters • back to each processor • and repeat…. • AllReduce implemented in MPI, recently in VW code (John Langford) in a Hadoop/compatible scheme MAP ALLREDUCE

Gory details of VW Hadoop-AllReduce • Spanning-tree server: • Separate process constructs a spanning tree of the computenodes in the cluster and then acts as a server • Worker nodes (“fake” mappers): • Input for worker is locally cached • Workers all connect to spanning-tree server • Workers all execute the same code, which might contain AllReduce calls: • Workers synchronize whenever they reach an all-reduce

HadoopAllReduce don’t wait for duplicate jobs

Second-order method - like Newton’s method

2 24 features ~=100 non-zeros/example 2.3B examples example is user/page/ad and conjunctions of these, positive if there was a click-thru on the ad

50M examples explicitly constructed kernel  11.7M features 3,300 nonzeros/example old method: SVM, 3 days: reporting time to get to fixed test error

On-line LDA

Pilfered from… NIPS 2010: Online Learning for LDA, Matthew Hoffman, Francis Bach & Blei

uses λ uses γ

Monday’s lecture

recap

recap Compute expectations over the z’s any way you want….

Technical Details q(zd) not q(zd)! Variationaldistrib: Approximate using Gibbs: after sampling for a while estimate: estimate using time and “coherence”: D(w) = # docs containing word w

better

Summary of LDA speedup tricks • Gibbs sampler: • O(N*K*T) and K grows with N • Need to keep the corpus (and z’s) in memory • You can parallelize • You need to keep a slice of the corpus • But you need to synchronize K multinomials over the vocabulary • AllReduce would help? • You can sparsify the sampling and topic-counts • Mimno’s trick - greatly reduces memory • You can do the computation on-line • Only need to keep K-multinomials and one document’s worth of corpus and z’s in memory • You can combine some of these methods • Online sparsified LDA • Parallel online sparsified LDA?

Scaling up LDA

Scaling up LDA

Presentation Transcript

Scaling up HIV

Scaling Up Without Blowing Up

Scaling Up

Scaling Up in Education

Scaling up innovation

Scaling Up in Illinois

Storage: Scaling Out > Scaling Up?

Scaling up Implementation

Scaling up Biodiversity Finance

Scaling Up An Introduction

SCALING UP CHE

Scaling up Biodiversity Finance

Scaling-Up the BIRN

Scaling Up PVSS

SCALING UP BIODIVERSITY FINANCE

Scaling up LDA

Scaling up LDA - 2

Scaling Up PVSS

Scaling up Reuse

State Scaling-up Workgroup

SCALING UP RtI 2.0

Scaling Up

Scaling up LDA

Scaling up LDA

Presentation Transcript

Scaling up HIV

Scaling Up Without Blowing Up

Scaling Up

Scaling Up in Education

Scaling up innovation

Scaling Up in Illinois

Storage: Scaling Out &gt; Scaling Up?

Scaling up Implementation

Scaling up Biodiversity Finance

Scaling Up An Introduction

SCALING UP CHE

Scaling up Biodiversity Finance

Scaling-Up the BIRN

Scaling Up PVSS

SCALING UP BIODIVERSITY FINANCE

Scaling up LDA

Scaling up LDA - 2

Scaling Up PVSS

Scaling up Reuse

State Scaling-up Workgroup

SCALING UP RtI 2.0

Scaling Up

Storage: Scaling Out > Scaling Up?