Large Scale Machine Learning based on MapReduce & GPU

Large Scale Machine Learningbased on MapReduce & GPU Lanbo Zhang

Motivation • Massive data challenges • More and more data to process (Google: 20,000 terabytes per day) • Data arrives faster and faster • Solutions • Invent faster ML algorithms: online algorithms • Stochastic gradient decent v.s. batch gradient decent • Parallelize learning processes: MapReduce, GPU, etc.

MapReduce • A programming model invented by Google • Jeffrey Dean , Sanjay Ghemawat, MapReduce: simplified data processing on large clusters, Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation (OSDI), p.10-10, December 06-08, 2004, San Francisco, CA • The objective • To support distributed computing on large data sets on clusters of computers • Features • Automatic parallelization and distribution • Fault-tolerance • I/O scheduling • Status and monitoring

User Interface • Users need to implement two functions • map (in_key, in_value) -> list(out_key, intermediate_value) • reduce (out_key, list(intermediate_value)) -> list(out_value) • Example: Count word occurrences Map (String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); Reduce (String output_key, Iteratorintermediate_values): // output_key: a word // output_values: a list of counts intresult = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));

MapReduce Usage Statistics in Google

MapReduce for Machine Learning • C. T. Chu, S. K. Kim, Y. A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun, "Map-reduce for machine learning on multicore," in NIPS 2006 • Algorithms that can be expressed in summation form could be parallelized in the MapReduce framework • Locally weighted linear regression (LWLR) • Logistic regression(LR): Newton-Raphson method • Naive Bayes (NB)

PCA • Linear SVM • EM: mixture of Gaussians (M-step)

Time complexity P: # of cores P’: speedup of matrix inversion and eigen-decomposition on multicore

Experimental Results Speedup from 1 to 16 cores over all datasets

Apache Hadoop • http://hadoop.apache.org/ • An open-source implementation of MapReduce • An excellent tutorial • http://hadoop.apache.org/common/docs/current/mapred_tutorial.html (with the famous examples of WordCount) • Very helpful if you need to quickly develop a simple Hadoop program • A comprehensive book • Tom White. Hadoop: The Definitive Guide. O'Reilly Media. May 2009http://oreilly.com/catalog/9780596521981 • Topics: Hadoop distributed file system, Hadoop I/O, How to set up a Hadoop cluster, how to develop a Hadoop application, Administration, etc. • Helpful if you want to become a Hadoop expert

Key User Interfaces of Hadoop • Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> • Implement the map function to define your map routines • Class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> • Implement the reduce function to define your reduce routines • Class JobConf • the primary interface to configure job parameters, which include but not limited to: • Input and output path (Hadoop Distributed File System) • Number of Mappers and Reducers • Job name • …

Apache Mahout • http://lucene.apache.org/mahout/ • A library of parallelized machine learning algorithms implemented on top of Hadoop • Applications • Clustering • Classification • Batch based collaborative filtering • Frequent itemset mining • …

Mahout in progress • Algorithms already implemented • K-Means, Fuzzy K-Means, Naive Bayes, Canopy clustering, Mean Shift, Dirichlet process clustering, Latent Dirichlet Allocation, Random Forests, etc. • Algorithms to be implemented • Stochastic gradient decent, SVM, NN, PCA, ICA, GDA, EM, etc.

GPU for Large-Scale ML • Graphics Processing Unit (GPU) is a specialized processor that offloads 3D or 2D graphics rendering from the CPU • GPUs’ highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms

NVIDIA GeForce 8800 GTX Specification

Logical Organization to Programmers • Each block can have up to 512 threads that synchronize • Millions of blocks can be issued

Programming Environment: CUDA • Compute Unified Device Architecture (CUDA) • A parallel computing architecture developed by NVIDIA • The computing engine in GPUs is accessible to software developers through industry standard programming language

SVM on GPUs • Catanzaro, B., Sundaram, N., and Keutzer, K. 2008. Fast support vector machine training and classification on graphics processors. In Proceedings of the 25th international Conference on Machine Learning (Helsinki, Finland, July 05 - 09, 2008). ICML '08, vol. 307.

SVM Training • Quadratic Program • The SMO algorithm

SVM Training on GPU • Each thread computes the following variable for each point:

Result: SVM training on GPU(Speedup over LibSVM)

SVM Classification • SVM classification task involves finding which side of the hyperplane a point lies on • Each thread evaluates kernel function for a point

Result: SVM classification on GPU(Speedup over LibSVM)

GPUMiner: Parallel Data Mining on Graphics Processors • Wenbin Fang, etc. Parallel Data Mining on Graphics Processors. Technical Report HKUST-CS08-07, Oct 2008 • Three components • The CPU-based storage and buffer manager to handle I/O and data transfer between CPU and GPU • The GPU-CPU co-processing parallel mining module • The GPU-based mining visualization module • Two mining algorithms implemented • K-Means clustering • Apriori (frequent pattern mining algorithm)

GPUMiner: System Architecture

The bitmap technique • Use a bitmap to represent the association between data objects and clusters (for K-means), and the association between items and transactions (for Apriori) • Supports efficient row-wise and column-wise operations exploiting the thread parallelism on the GPU • Use a summary vector to store the number of ones to accelerate counting on the number of ones in a row/column

K-means • Three functions executed on GPU in parallel • makeBitmap_kernel • computeCentriod_kernel • findCluster_kernel

Apriori • To find those frequent itemsets among a large number of transactions • The trie-based implementation • Uses a trie to store candidates and their supports • Uses a bitmap to store the Item-Transaction matrix • Obtain the item supports by counting 1s in the bitmap • The 1-bit counting and intersection operations are implemented as GPU programs

Experiments • Settings • GPU: NVIDIA GTX280, 30*8 processors • CPU: Intel Core2 Quad Core

Result: K-means • Baseline: Uvirginia • 35x faster than the four-threaded CPU-based couterpart

Result: Apriori • Baselines • CPU-based Apriori • Best implementation of FIMI’03

Conclusion • Both MapReduce and GPU are feasible strategies for large-scale parallelized machine learning • MapReduce aims at parallelization over computer clusters • The hardware architecture of GPUs makes them a natural choice for parallelized ML

Large Scale Machine Learning based on MapReduce & GPU