Shimin Chen Big Data Reading Group

Map-Reduce for Machine Learning on MulticoreC. Chu, S.K. Kim, Y. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun (NIPS 2006) Shimin ChenBig Data Reading Group

Motivations • Industry-wide shift to multicore • No good framework for parallelize ML algorithms • Goal: develop a general and exact technique for parallel programming of a large class of ML algorithms for multicore processors

Idea Statistical Query Model Summation Form Map-Reduce

Outline • Introduction • Statistical Query Model and Summation Form • Architecture (inspired by Map-Reduce) • Adopted ML Algorithms • Experiments • Conclusion

Valiant Model [Valiant’84] • x is the input • y is a function of x that we want to learn • In Valiant model, the learning algorithm uses randomly drawn examples <x, y> to learn the target function

Statistical Query Model [Kearns’98] • A restriction on Valiant model • A learning algorithm uses some aggregates over the examples, not the individual examples • More precisely, the learning algorithm interacts with a statistical query oracle • Learning algorithm asks about f(x,y) • Oracle returns the expectation that f(x,y) is true

Summation Form • Aggregate over the data: • Divide the data set into pieces • Compute aggregates on each cores • Combine all results at the end

Example: Linear Regression using Least Squares Model: Goal: Solution: Given m examples: (x1, y1), (x2, y2), …, (xm, ym) We write a matrix X with x1, …, xm as rows, and row vector Y=(y1, y2, …ym). Then the solution is Parallel computation: Cut to m/num_processor pieces

Lighter Weight Map-Reduce for Multicore

Locally Weighted Linear Regression (LWLR) • Mappers: one sets compute A, the other set compute b • Two reducers for computing A and b • Finally compute the solution Solve: When wi==1, this is least squares.

Naïve Bayes (NB) • Goal: estimate P(xj=k|y=1) and P(xj=k|y=0) • Computation: count the occurrence of (xj=k, y=1) and (xj=k, y=0), count the occurrence of (y=1) and (y=0), the compute division • Mappers: count a subgroup of training samples • Reducer: aggregate the intermediate counts, and calculate the final result

Gaussian Discriminative Analysis (GDA) • Goal: classification of x into classes of y • assuming each class is a Gaussian Mixture model with different means but same covariance. • Computation: • Mappers: compute for a subset of training samples • Reducer: aggregate intermediate results

K-means • Computing the Euclidean distance between sample vectors and centroids • Recalculating the centroids • Divide the computation to subgroups to be handled by map-reduce

Expectation Maximization (EM) • E-step computes some prob or counts per training example • M-step combines these values to update the parameters • Both of them can be parallelized using map-reduce

Neural Network (NN) • Back-propagation, 3-layer network • Input, middle, 2 output nodes • Goal: compute the weights in the NN by back propagation • Mapper: propagate its set of training data through the network, and propagate errors to calculate the partial gradient for weights • Reducer: sums the partial gradients and does a batch gradient descent to update the weights

Principal Components Analysis (PCA) • Compute the principle eigenvectors of the covariance matrix • Clearly, we can compute the summation form using map-reduce

Other Algorithms • Logistic Regression • Independent Component Analysis • Support Vector Machine

Time Complexity

Setup • Compare map-reduce version and sequential version • 10 data sets • Machines: • Dual-processor Pentium-III 700MHz, 1GB RAM • 16-way Sun Enterprise 6000 • (these are SMP, not multicore)

Dual-Processor SpeedUps

2-16 processor speedups More data in the paper

Multicore Simulator Results • A paragraph on this • Basically, says that results are better than multiprocessor machines. • Could be because of less communication cost

Conclusion • Parallelize summation forms • Use map-reduce on a single machine

Shimin Chen Big Data Reading Group