380 likes | 514 Views
Learning with Hadoop – A case study on MapReduce based Data Mining. Evan Xiang, HKUST. Outline. Hadoop Basics Case Study Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient Resource Entries to ML labs Advanced Topics Q&A.
E N D
Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST
Outline • Hadoop Basics • Case Study • Word Count • Pairwise Similarity • PageRank • K-Means Clustering • Matrix Factorization • Cluster Coefficient • Resource Entries to ML labs • Advanced Topics • Q&A
Introduction to Hadoop • Hadoop Map/Reduce is • a java based software framework for easily writing applications • which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware • in a reliable, fault-tolerant manner.
Hadoop Cluster Architecture Job submission node HDFS master Client JobTracker NameNode TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode Slave node Slave node Slave node From Jimmy Lin’s slides
Hadoop Development Cycle 1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 Hadoop Cluster You 5. Move data out of HDFS 6. Scp data from cluster From Jimmy Lin’s slides
Divide and Conquer “Work” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 Combine “Result” From Jimmy Lin’s slides
Outline • Hadoop Basics • Case Study • Word Count • Pairwise Similarity • PageRank • K-Means Clustering • Matrix Factorization • Cluster Coefficient • Resource Entries to ML labs • Advanced Topics • Q&A
Word Count with MapReduce Doc 1 Doc 2 Doc 3 one red cat 1 1 2 1 3 1 red fish, blue fish cat in the hat one fish, two fish Map two blue hat 1 1 2 1 3 1 fish fish 1 2 2 2 Shuffle and Sort: aggregate values by keys cat 3 1 blue 2 1 Reduce fish 1 4 hat 3 1 one 1 1 two 1 1 red 2 1 From Jimmy Lin’s slides
Outline • Hadoop Basics • Case Study • Word Count • Pairwise Similarity • PageRank • K-Means Clustering • Matrix Factorization • Cluster Coefficient • Resource Entries to ML labs • Advanced Topics • Q&A
Calculating document pairwise similarity • Trivial Solution • load each vector o(N) times • load each term o(dft2)times Goal scalable and efficient solutionfor large collections From Jimmy Lin’s slides
Better Solution Each term contributes only if appears in • Load weights for each term once • Each term contributes o(dft2)partial scores From Jimmy Lin’s slides
Decomposition Each term contributes only if appears in • Load weights for each term once • Each term contributes o(dft2)partial scores reduce map From Jimmy Lin’s slides
Standard Indexing (a) Map (b) Shuffle (c) Reduce Shuffling group values by: terms tokenize doc combine posting list tokenize doc combine posting list tokenize doc combine posting list tokenize doc From Jimmy Lin’s slides
Inverted Indexing with MapReduce Doc 2 Doc 1 Doc 3 one red cat 1 1 2 1 3 1 red fish, blue fish cat in the hat one fish, two fish Map two blue hat 1 1 2 1 3 1 fish fish 1 2 2 2 Shuffle and Sort: aggregate values by keys cat 3 1 blue 2 1 Reduce fish 1 2 2 2 hat 3 1 one 1 1 two 1 1 red 2 1 From Jimmy Lin’s slides
Indexing (3-doc toy collection) Clinton ObamaClinton Clinton Obama Clinton Clinton 1 2 Indexing 1 ClintonCheney Cheney Clinton Cheney 1 Barack 1 Clinton Barack Obama ClintonBarackObama Obama 1 1 From Jimmy Lin’s slides
2 2 2 2 2 1 1 2 3 1 1 1 1 Pairwise Similarity (a) Generate pairs (b) Group pairs (c) Sum pairs Clinton 1 2 1 Cheney 1 Barack 1 Obama How to deal with the long list? 1 1 From Jimmy Lin’s slides
Outline • Hadoop Basics • Case Study • Word Count • Pairwise Similarity • PageRank • K-Means Clustering • Matrix Factorization • Cluster Coefficient • Resource Entries to ML labs • Advanced Topics • Q&A
PageRank • PageRank – an information propagation model Intensive access of neighborhood list
PageRank with MapReduce Map n2 n4 n3 n5 n4 n5 n1 n2 n3 n1 n2 n2 n3 n3 n4 n4 n5 n5 Reduce How to maintain the graph structure? From Jimmy Lin’s slides
Outline • Hadoop Basics • Case Study • Word Count • Pairwise Similarity • PageRank • K-Means Clustering • Matrix Factorization • Cluster Coefficient • Resource Entries to ML labs • Advanced Topics • Q&A
K-Means Clustering with MapReduce Mapper_i-1 Mapper_i Mapper_i+1 1 2 3 4 Each Mapper loads a set of data samples, and assign each sample to a nearest centroid Each Mapper needs to keep a copy of centroids 1 2 3 4 1 2 3 4 1 2 3 4 Reducer_i-1 Reducer_i Reducer_i+1 3 4 2 3 2 4 How to set the initial centroids is very important! Usually we set the centroids using Canopy Clustering. [McCallum, Nigam and Ungar: "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching", SIGKDD 2000]
Outline • Hadoop Basics • Case Study • Word Count • Pairwise Similarity • PageRank • K-Means Clustering • Matrix Factorization • Cluster Coefficient • Resource Entries to ML labs • Advanced Topics • Q&A
Matrix Factorization for Link Prediction • In this task, we observe a sparse matrix X∈ Rm×n with entries xij. Let R = {(i,j,r): r = xij, where xij ≠0} denote the set of observed links in the system. In order to predict the unobserved links in X, we model the users and the items by a user factor matrix U∈ Rk×m and an item factor matrix V∈ Rk×n. The goal is to approximate the link matrix X via multiplying the factor matrix U and V, which can be learnt by minimizing:
Solving Matrix Factorization via Alternative Least Squares • Given X and V, updating U: • Similarly, given X and U, we can alternatively update V n X m ui n k k V k k k k A k k b
MapReduce for ALS Stage 1 Stage 2 Mapper_i Mapper_i Group rating data in X using for item j Group features in V using for item j Group rating data in X using for user i i Vj i Vj+2 i+1 Vj Reducer_i Reducer_i Rating for item j Features for item j Standard ALS: Calculate A and b, and update Ui Align ratings and features for item j, and make a copy of Vjfor each observe xij i+1 Vj i-1 Vj i Vj
Outline • Hadoop Basics • Case Study • Word Count • Pairwise Similarity • PageRank • K-Means Clustering • Matrix Factorization • Cluster Coefficient • Resource Entries to ML labs • Advanced Topics • Q&A
Cluster Coefficient • In graph mining, a clustering coefficient is a measure of degree to which nodes in a graph tend to cluster together. The local clustering coefficient of a vertex in a graph quantifies how close its neighbors are to being a clique (complete graph), which is used to determine whether a graph is a small-world network. [D. J. Watts and Steven Strogatz (June 1998). "Collective dynamics of 'small-world' networks". Nature 393 (6684): 440–442] How to maintain the Tier-2 neighbors?
Cluster Coefficient with MapReduce Stage 1 Stage 2 Mapper_i Mapper_i Reducer_i Reducer_i Calculate the cluster coefficient BFS based method need three stages, but actually we only need two!
Resource Entries to ML labs • Mahout • Apache’s scalable machine learning libraries • Jimmy Lin’s Lab • iSchool at the University of Maryland • Jimeng Sun & Yan Rong’s Collections • IBM TJ Watson Research Center • Edward Chang & Yi Wang • Google Beijing
Advanced Topics in Machine Learning with MapReduce • Probabilistic Graphical models • Gradient based optimization methods • Graph Mining • Others…
Some Advanced Tips • Design your algorithm with a divide and conquer manner • Make your functional units loosely dependent • Carefully manage your memory and disk storage • Discussions…
Outline • Hadoop Basics • Case Study • Word Count • Pairwise Similarity • PageRank • K-Means Clustering • Matrix Factorization • Cluster Coefficient • Resource Entries to ML labs • Advanced Topics • Q&A
Q&A • Why not MPI? • Hadoop is Cheap in everything…D.P.T.H… • What’s the advantages of Hadoop? • Scalability! • How do you guarantee the model equivalence? • Guarantee equivalent/comparable function logics • How can you beat “large memory” solution? • Clever use of Sequential Disk Access