Large Scale Parallel Supervised Topic-Modeling -implementation plan-

Large Scale Parallel Supervised Topic-Modeling-implementation plan- Keisuke Kamataki Jun Zhu Eric Xing Sep 27, 2010

Implementation plan Big picture: Separate implementation for 3 steps (we can still run the distributed MedLDA from the step1) • Plan1 (E-step and M-step are separated programs. SVM in M-step is not parallelized) • Plan2 (E-step and M-step are separated programs. SVM in M-step is parallelized) • Plan3 (Everything is integrated and parallelized within a single program) *Start from plan1, then extend it to plan2, and try plan3 last.

Plan 1 (E-step and M-step are separated. SVM in M-step is not parallelized) Given –many documents z z z z Perform E-step (Gibbs sampling)in parallel way. Get Sufficient Stats Repeat until convergence Single Program α, β, η, μ Perform M-step on a single computer Single Program

Plan 1 in detail • Prepare E-step code and M-step code separately (probably in C++). Merge the codes using Shell/Perl/Ruby script • Easy to quickly implement and debug • Would be extendible to plan 2 and plan 3 • May not scale only in the situation when we have a large # of K(n-topics) and L(n-labels) …..should be solved in plan 2 To be a good start !!

Plan 2 (E-step and M-step are separated. SVM in M-step is parallelized) Given –many documents z z z z Perform E-step (Gibbs sampling)in parallel way. Get Sufficient Stats Repeat until convergence Single Program α, β η, μ η, μ Perform M-step In parallel way (only parallelize SVM to Estimateη and μ) Single Program

Plan 2 in detail • Prepare E-step code and M-step code separately. Merge the codes using Shell/Perl/Ruby script (same with plan 1) • Almost a copy from plan 1 except for the SVM part in M-step • SVM is parallelized. So, the estimation of η, μ would be fast and scalable (of course, we need to figure out how to parallelize SVM in the FB’s computing environment) Practical extension (only need to figure out how to parallelize SVM)

Plan 3 (Everything is integrated and parallelized within a single program) Given –many documents z z z z Perform E-step (Gibbs sampling)in parallel way. Get Sufficient Stats Repeat until convergence α, β, η, μ α, β, η, μ α, β, η, μ Perform M-step In parallel way Single Program

Plan 3 in detail • E-step and M-step (including SVM) is integrated within a single code • In practice, the computational efficiency and the algorithmic behavior would be almost same with the plan 2 (but the software will be more complicated and the implementation would take a lot of time) Could be beautiful in research aspect (but should be built as an extension of plan2 since the software will be complex)

ToDo for a while • Keisuke Prepare the core-part codes of Gibbs-sampling based LDA and the merging script • Jun Derive Gibbs-sample based equation in MedLDA

Large Scale Parallel Supervised Topic-Modeling -implementation plan-