1 / 49

MapReduce for Machine Learning on Multicore

MapReduce for Machine Learning on Multicore. Cheng-Tao Chu and Sang Kyun Kim et al. Stanford University, Stanford CA Presented by Inna Rytsareva. Outline. What is MapReduce? Problem Description and Formalization Statistical Query Model and Summation Form

gwenda
Download Presentation

MapReduce for Machine Learning on Multicore

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MapReduce for Machine Learning on Multicore Cheng-Tao Chu and Sang Kyun Kimet al. Stanford University, Stanford CA Presented by Inna Rytsareva

  2. Outline • What is MapReduce? • Problem Description and Formalization • Statistical Query Model and Summation Form • Architecture (inspired by MapReduce) • Adopted ML Algorithms • Experiments • Future of MapReduce for Machine Learning • Discussion Map-Reduce for Machine Learning on Multicore

  3. Motivation • Problem: lots of data • Example: 20+ billion web pages x 20KB = 400+ terabytes • One computer can read 30-35 MB/sec from disk • ~four months to read the web • ~1,000 hard drives just to store the web • Even more to do something with the data

  4. Motivation • Solution: spread the work over many machines • Same problem with 1000 machines, < 3 hours • programming work • communication and coordination • recovering from machine failure • status reporting • debugging • optimization • locality • repeat for every problem you want to solve

  5. Cluster Computing • Many racks of computers, thousands of machines per cluster • Limited bisection bandwidth between racks http://upload.wikimedia.org/wikipedia/commons/d/d3/IBM_Blue_Gene_P_supercomputer.jpg

  6. MapReduce • A simple programming model that applies to many large-scale computing problems • Hide messy details in MapReduce runtime library: • automatic parallelization • load balancing • network and disk transfer optimization • handling of machine failures • robustness

  7. Programming model • Input & Output: each a set of key/value pairs • Programmer specifies two functions: map() reduce() • map (in_key, in_value) -> list(out_key, intermediate_value) • Processes input key/value pair • Produces set of intermediate pairs • reduce (out_key, list(intermediate_value)) -> list(out_value) • Combines all intermediate values for a particular key • Produces a set of merged output values (usually just one)

  8. Example • Page 1: the weather is good • Page 2: today is good • Page 3: good weather is good.

  9. Map Output • Worker 1: • (the 1), (weather 1), (is 1), (good 1). • Worker 2: • (today 1), (is 1), (good 1). • Worker 3: • (good 1), (weather 1), (is 1), (good 1).

  10. Reduce Input • Worker 1: • (the 1) • Worker 2: • (is 1), (is 1), (is 1) • Worker 3: • (weather 1), (weather 1) • Worker 4: • (today 1) • Worker 5: • (good 1), (good 1), (good 1), (good 1)

  11. Reduce Output • Worker 1: • (the 1) • Worker 2: • (is 3) • Worker 3: • (weather 2) • Worker 4: • (today 1) • Worker 5: • (good 4)

  12. http://4.bp.blogspot.com/_j6mB7TMmJJY/STAYW9gC-NI/AAAAAAAAAGY/lLKo7sBp5i8/s1600-h/P1.pnghttp://4.bp.blogspot.com/_j6mB7TMmJJY/STAYW9gC-NI/AAAAAAAAAGY/lLKo7sBp5i8/s1600-h/P1.png

  13. Fault tolerance • On worker failure: • Detect failure via periodic heartbeats (ping) • Re-execute completed and in-progress map tasks • Re-execute in progress reduce tasks • Task completion committed through master • Master failure: • Could handle, but don't yet (master failure unlikely)

  14. MapReduce Transparencies • Parallelization • Fault-tolerance • Locality optimization • Load balancing

  15. Suitable for your task if • Have a cluster • Working with large dataset • Working with independent data (or assumed) • Can be cast into map and reduce

  16. References • Original paper J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Communications of the ACM, vol. 51, 2008, pp. 107-113. (http://labs.google.com/papers/mapreduce.html) • On wikipedia (http://en.wikipedia.org/wiki/MapReduce) • Hadoop – MapReduce in Java (http://lucene.apache.org/hadoop/) • Starfish - MapReduce in Ruby (http://rufy.com/starfish/)

  17. Outline • What is MapReduce? • Problem Description and Formalization • Statistical Query Model and Summation Form • Architecture (inspired by MapReduce) • Adopted ML Algorithms • Experiments • Future of MapReduce for Machine Learning • Discussion Map-Reduce for Machine Learning on Multicore

  18. Motivations • Industry-wide shift to multicore • No good framework for parallelize ML algorithms • Goal: develop a general and exact technique for parallel programming of a large class of ML algorithms for multicore processors http://upload.wikimedia.org/wikipedia/commons/a/af/E6750bs8.jpg

  19. Idea

  20. Outline • What is MapReduce? • Problem Description and Formalization • Statistical Query Model and Summation Form • Architecture (inspired by MapReduce) • Adopted ML Algorithms • Experiments • Future of MapReduce for Machine Learning • Discussion Map-Reduce for Machine Learning on Multicore

  21. Valiant Model [Valiant’84] • x is the input • y is a function of x that we want to learn • In Valiant model, the learning algorithm uses randomly drawn examples <x, y> to learn the target function

  22. Statistical Query Model [Kearns’98] • A restriction on Valiant model • A learning algorithm uses some aggregates over the examples, not the individual examples • Given a function f(x,y) over instances (data points x and labels y), a statistical oracle will return an estimate of the expectation of f(x,y) • Any model that computes gradients or sufficient statistics over f(x,y) fits this model • Typically this is achieved by summing over the data.

  23. Summation Form • Aggregate over the data: • Divide the data set into pieces • Compute aggregates on each cores • Combine all results at the end

  24. Example: Linear Regression Model: Goal: Solution: Given m examples: (x1, y1), (x2, y2), …, (xm, ym) We write a matrix X with x1, …, xmas rows, and row vector Y=(y1, y2, …ym). Then the solution is • Parallel computation:

  25. Outline • What is MapReduce? • Problem Description and Formalization • Statistical Query Model and Summation Form • Architecture (inspired by MapReduce) • Adopted ML Algorithms • Experiments • Future of MapReduce for Machine Learning • Discussion Map-Reduce for Machine Learning on Multicore

  26. Lighter Weight MapReduce for Multicore

  27. Outline • What is MapReduce? • Problem Description and Formalization • Statistical Query Model and Summation Form • Architecture (inspired by MapReduce) • Adopted ML Algorithms • Experiments • Future of MapReduce for Machine Learning • Conclusion and Discussion Map-Reduce for Machine Learning on Multicore

  28. Locally Weighted Linear Regression (LWLR) • Mappers: one sets compute subgroups of A, the other set compute subgroups b • Two reducers for computing A and b • Finally compute the solution Solve: When wi == 1, this is least squares.

  29. Naïve Bayes (NB) • Goal: estimate P(xj=k|y=1) and P(xj=k|y=0) and P(y) • Computation: count the occurrence of (xj=k, y=1) and (xj=k, y=0), count the occurrence of (y=1) and (y=0) • Mappers: count a subgroup of training samples • Reducer: aggregate the intermediate counts, and calculate the final result

  30. Gaussian Discriminative Analysis (GDA) • Goal: classification of x into classes of y • assuming each class is a Gaussian Mixture model with different means but same covariance. • Computation: • Mappers: compute for a subgroup of training samples • Reducer: aggregate intermediate results

  31. K-means • Computing the Euclidean distance between sample vectors and centroids • Recalculating the centroids • Divide the computation to subgroups to be handled by map-reduce

  32. Neural Network (NN) • Back-propagation, 3-layer network • Input, middle, 2 output nodes • Goal: compute the weights in the NN by back propagation • Mapper: propagate its set of training data through the network, and propagate errors to calculate the partial gradient for weights • Reducer: sums the partial gradients and does a batch gradient descent to update the weights

  33. Principal Components Analysis (PCA) • Compute the principle eigenvectors of the covariance matrix • Clearly, we can compute the summation form using map-reduce • Express the mean vector as a sum

  34. Other Algorithms • Logistic Regression • Independent Component Analysis • Support Vector Machine • Expectation Maximization (EM)

  35. Time Complexity Basically: Linear speed up with increasing number of cores

  36. Outline • What is MapReduce? • Problem Description and Formalization • Statistical Query Model and Summation Form • Architecture (inspired by MapReduce) • Adopted ML Algorithms • Experiments • Future of MapReduce for Machine Learning • Discussion Map-Reduce for Machine Learning on Multicore

  37. Setup • Compare map-reduce version and sequential version • 10 data sets • Machines: • Dual-processor Pentium-III 700MHz, 1GB RAM • 16-way Sun Enterprise 6000

  38. Dual-Processor SpeedUps

  39. SpeedUp for 2-16 processors Bold – average Error Bars – max/min Dashed - variance

  40. Multicore Simulator Results • Multicore simulator over the sensor dataset • Better results – reported for NN & LR • NN • 16 cores 15.5x • 32 cores 29x • 64 cores 54x • LR • 16 cores 15x • 32 cores 29.5x • 64 cores 53x • Could be because of less communication cost

  41. Conclusion • Parallelize summation forms • NO change in the underlying algorithm • NO approximation • Use map-reduce on a single machine

  42. Outline • What is MapReduce? • Problem Description and Formalization • Statistical Query Model and Summation Form • Architecture (inspired by MapReduce) • Adopted ML Algorithms • Experiments • Future of MapReduce for Machine Learning • Discussion Map-Reduce for Machine Learning on Multicore

  43. Apache Mahout • An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License • http://mahout.apache.org • Why Mahout? • Community • Documentation and Examples • Scalability • the Apache License • Not-specific research-oriented http://dictionary.reference.com/browse/mahout

  44. Focus: Scalable • Goal: Be as fast and efficient as the possible given the intrinsic design of the algorithm • Some algorithms won’t scale to massive machine clusters • Others fit logically on a MapReduce framework like Apache Hadoop • Still others will need other distributed programming models • Most Mahout implementations are MapReduce enabled • Work in Progress

  45. Sampling of Who uses Mahout? https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout

  46. Focus: Machine Learning Applications Examples Genetic Freq. Pattern Mining Classification Clustering Recommenders Utilities Lucene/Vectorizer Math Vectors/Matrices/SVD Collections (primitives) Apache Hadoop http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms

  47. Resources • “Mahout in Action” • Owen, Anil, Dunning and Friedman • http://awe.sm/5FyNe • “Introducing Apache Mahout” • http://www.ibm.com/developerworks/java/library/j-mahout/ • “Taming Text” by Ingersoll, Morton, Farris • “Programming Collective Intelligence” by Toby Segaran • “Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank • “Data-Intensive Text Processing with MapReduce” by Jimmy Lin and Chris Dyer

  48. Outline • What is MapReduce? • Problem Description and Formalization • Statistical Query Model and Summation Form • Architecture (inspired by MapReduce) • Adopted ML Algorithms • Experiments • Future of MapReduce for Machine Learning • Discussion Map-Reduce for Machine Learning on Multicore

  49. Discussion • What are other alternatives to MapReduce? • What to do if “summation form” is not applicable? • Does the dataset quality effect implementation and performance of parallel machine learning algorithms? • Multicore processors… future? Predicting Structural and Functional Sites in Proteins by Searching for Maximum-Weight Cliques

More Related