Learning with Hadoop – A case study on MapReduce based Data Mining

Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST

Outline • Hadoop Basics • Case Study • Word Count • Pairwise Similarity • PageRank • K-Means Clustering • Matrix Factorization • Cluster Coefficient • Resource Entries to ML labs • Advanced Topics • Q&A

Introduction to Hadoop • Hadoop Map/Reduce is • a java based software framework for easily writing applications • which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware • in a reliable, fault-tolerant manner.

Hadoop Cluster Architecture Job submission node HDFS master Client JobTracker NameNode TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode Slave node Slave node Slave node From Jimmy Lin’s slides

Hadoop HDFS

Hadoop Cluster Rack Awareness

Hadoop Development Cycle 1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 Hadoop Cluster You 5. Move data out of HDFS 6. Scp data from cluster From Jimmy Lin’s slides

Divide and Conquer “Work” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 Combine “Result” From Jimmy Lin’s slides

High-level MapReduce pipeline

Detailed Hadoop MapReduce data flow

Word Count with MapReduce Doc 1 Doc 2 Doc 3 one red cat 1 1 2 1 3 1 red fish, blue fish cat in the hat one fish, two fish Map two blue hat 1 1 2 1 3 1 fish fish 1 2 2 2 Shuffle and Sort: aggregate values by keys cat 3 1 blue 2 1 Reduce fish 1 4 hat 3 1 one 1 1 two 1 1 red 2 1 From Jimmy Lin’s slides

Calculating document pairwise similarity • Trivial Solution • load each vector o(N) times • load each term o(dft2)times Goal scalable and efficient solutionfor large collections From Jimmy Lin’s slides

Better Solution Each term contributes only if appears in • Load weights for each term once • Each term contributes o(dft2)partial scores From Jimmy Lin’s slides

Decomposition Each term contributes only if appears in • Load weights for each term once • Each term contributes o(dft2)partial scores reduce map From Jimmy Lin’s slides

Standard Indexing (a) Map (b) Shuffle (c) Reduce Shuffling group values by: terms tokenize doc combine posting list tokenize doc combine posting list tokenize doc combine posting list tokenize doc From Jimmy Lin’s slides

Inverted Indexing with MapReduce Doc 2 Doc 1 Doc 3 one red cat 1 1 2 1 3 1 red fish, blue fish cat in the hat one fish, two fish Map two blue hat 1 1 2 1 3 1 fish fish 1 2 2 2 Shuffle and Sort: aggregate values by keys cat 3 1 blue 2 1 Reduce fish 1 2 2 2 hat 3 1 one 1 1 two 1 1 red 2 1 From Jimmy Lin’s slides

Indexing (3-doc toy collection) Clinton ObamaClinton Clinton Obama Clinton Clinton 1 2 Indexing 1 ClintonCheney Cheney Clinton Cheney 1 Barack 1 Clinton Barack Obama ClintonBarackObama Obama 1 1 From Jimmy Lin’s slides

2 2 2 2 2 1 1 2 3 1 1 1 1 Pairwise Similarity (a) Generate pairs (b) Group pairs (c) Sum pairs Clinton 1 2 1 Cheney 1 Barack 1 Obama How to deal with the long list? 1 1 From Jimmy Lin’s slides

PageRank • PageRank – an information propagation model Intensive access of neighborhood list

PageRank with MapReduce Map n2 n4 n3 n5 n4 n5 n1 n2 n3 n1 n2 n2 n3 n3 n4 n4 n5 n5 Reduce How to maintain the graph structure? From Jimmy Lin’s slides

K-Means Clustering

K-Means Clustering with MapReduce Mapper_i-1 Mapper_i Mapper_i+1 1 2 3 4 Each Mapper loads a set of data samples, and assign each sample to a nearest centroid Each Mapper needs to keep a copy of centroids 1 2 3 4 1 2 3 4 1 2 3 4 Reducer_i-1 Reducer_i Reducer_i+1 3 4 2 3 2 4 How to set the initial centroids is very important! Usually we set the centroids using Canopy Clustering. [McCallum, Nigam and Ungar: "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching", SIGKDD 2000]

Matrix Factorization for Link Prediction • In this task, we observe a sparse matrix X∈ Rm×n with entries xij. Let R = {(i,j,r): r = xij, where xij ≠0} denote the set of observed links in the system. In order to predict the unobserved links in X, we model the users and the items by a user factor matrix U∈ Rk×m and an item factor matrix V∈ Rk×n. The goal is to approximate the link matrix X via multiplying the factor matrix U and V, which can be learnt by minimizing:

Solving Matrix Factorization via Alternative Least Squares • Given X and V, updating U: • Similarly, given X and U, we can alternatively update V n X m ui n k k V k k k k A k k b

MapReduce for ALS Stage 1 Stage 2 Mapper_i Mapper_i Group rating data in X using for item j Group features in V using for item j Group rating data in X using for user i i Vj i Vj+2 i+1 Vj Reducer_i Reducer_i Rating for item j Features for item j Standard ALS: Calculate A and b, and update Ui Align ratings and features for item j, and make a copy of Vjfor each observe xij i+1 Vj i-1 Vj i Vj

Cluster Coefficient • In graph mining, a clustering coefficient is a measure of degree to which nodes in a graph tend to cluster together. The local clustering coefficient of a vertex in a graph quantifies how close its neighbors are to being a clique (complete graph), which is used to determine whether a graph is a small-world network. [D. J. Watts and Steven Strogatz (June 1998). "Collective dynamics of 'small-world' networks". Nature 393 (6684): 440–442] How to maintain the Tier-2 neighbors?

Cluster Coefficient with MapReduce Stage 1 Stage 2 Mapper_i Mapper_i Reducer_i Reducer_i Calculate the cluster coefficient BFS based method need three stages, but actually we only need two!

Resource Entries to ML labs • Mahout • Apache’s scalable machine learning libraries • Jimmy Lin’s Lab • iSchool at the University of Maryland • Jimeng Sun & Yan Rong’s Collections • IBM TJ Watson Research Center • Edward Chang & Yi Wang • Google Beijing

Advanced Topics in Machine Learning with MapReduce • Probabilistic Graphical models • Gradient based optimization methods • Graph Mining • Others…

Some Advanced Tips • Design your algorithm with a divide and conquer manner • Make your functional units loosely dependent • Carefully manage your memory and disk storage • Discussions…

Q&A • Why not MPI? • Hadoop is Cheap in everything…D.P.T.H… • What’s the advantages of Hadoop? • Scalability! • How do you guarantee the model equivalence? • Guarantee equivalent/comparable function logics • How can you beat “large memory” solution? • Clever use of Sequential Disk Access

Learning with Hadoop – A case study on MapReduce based Data Mining