Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce

Distributed Nonnegative Matrix Factorization for Web-Scale DyadicData Analysis on MapReduce Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, Yi-Min Wang Internet Services Research Center (ISRC) Microsoft Research Redmond

Internet Services Research Center (ISRC) • Advancing the state of the art in online services • Dedicated to accelerating innovations in search and ad technologies • Representing a new model for moving technologies quickly from research projects to improved products and services

Dyadic Data on the Web • Web abounds with dyadic data • Web search: term by document, query by clickedURL, web linkage, … • Advertising: query by ad, bid term by ad, user by ad, … • Social media: tag by image, user by community, friendship graph, … • Common characteristics • Good source for discovering latent relationships • High dimensionality, sparse, nonnegative, dynamic

Nonnegative Matrix Factorization (NMF) • Effective tool to uncover latent relationships in nonnegative matrices with many applications [Berry et al., 2007, Sra & Dhillon, 2006] • Interpretable dimensionality reduction [Lee & Seung, 1999] • Document clustering [Shahnaz et al., 2006, Xu et al, 2006] • Challenge: Can we scale NMF to million-by-million matrices

NMF Algorithm [Lee & Seung, 2000]

Parallel NMF [Robila & Maciak, 2006] • Parallelism on multi-core machines • Partition along the long dimension for parallelism • Assuming all matrices can be held in shared memory

Distributed NMF … … • Data Partition: A, W and H across machines . . . . . . . . . .

Copmuting DNMF: The Big Picture

… … … … … Map-III Map-V Map-II Map-I Map-IV … Reduce-II Reduce-I Reduce-III … … … Reduce-V …

… … Map-II Map-I … Reduce-I Reduce-II … … …

… … … Map-III Map-IV Reduce-III . . . . . . . . . . .

… … … Map-V … Reduce-V …

… … … … … Map-III Map-V Map-II Map-I Map-IV … Reduce-II Reduce-I Reduce-III … … … Reduce-V …

Experimental Evaluation • Synthesized data on a sandbox cluster • No interference from other jobs • Performance with various parameters • Real-world data on a commercial cluster • Real-world scalability

Synthesized Data on Sandbox Cluster • A Hadoop cluster with 8 workers in total • Worker: Pentium-IV CPU, 1 or 2 cores, 1~2 GB memory, 150G hard drive • V: Number of workers in cluster • Matrix simulator • Generate m-by-n matrix with sparsityδ • k: factorization dimensionality • Defaults:

Computation Breakdown • dominates the computation • is lightweight • The sparser, the faster

Performance w.r.t. Parameters • Linear to m×n×δ • Linear to factorization dimension k • Sub-ideal speedup w.r.t. cluster size V

Scalability on Real-world Data • User-by-Website matrix • Browsed URLs of opt-in users, represented by UID • URLs trimmed to site level • http://www.cnn.com/breakingnews --> www.cnn.com • Experiments on Microsoft SCOPE • SCOPE: Structure Computations Optimized for Parallel Execution [Chaikenet al., VLDB’08]

Executions w.r.t. Iterations • Observations • Longer total elapse time • Shorter time per iteration • Reason • Overlapped computation across iterations Normalized Elapse Time Iterations

Scalability w.r.t. Matrix Size • 3 hours per iteration, 20 iterations take around 20*3*0.72 ≈ 43 hours Less than 7 hours on a 43.9M-by-769M matrix with 4.38 billion nonzero values

Conclusion • NMF is an effective tool to uncover latent structures in dyadic data that is abundant on the Web • NMF is admissible to MapReduce • Distributed NMF solves the scalability challenge • Applications down the road

Q&A Thank You!

Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce