210 likes | 367 Views
Big Data Analytics with Lexisnexis /HPCC System: Optimizing HPCC Performance for BIG Matrix Calculations. Shujia Zhou, Zhiguang Wang, Ran Qi, Yelena Yesha IAB Meeting Research Report June 13, 2013. Introduction.
E N D
Big Data Analytics with Lexisnexis/HPCC System:Optimizing HPCCPerformance for BIG Matrix Calculations ShujiaZhou, Zhiguang Wang, Ran Qi, Yelena Yesha IAB Meeting Research Report June 13, 2013
Introduction • Numerical Linear Algebra procedures are among the primary building blocks of • machine learning algorithms. Conversely, machine learning problems are posing new mathematical and algorithmic questions. • Big data analytics requires matrix calculations in a cluster for matrices with billions of rows • LexisNexis is adding machine learning capabilities in its HPCC/ECL system • It is highly desirable to increase the performance of matrix calculations of HPCC/ECL in a multicore cluster
Problems • Increasing partition size will increase the efficiency of computation in a node. However, it also increases the across-node communication cost. Therefore, there is an optimal partition size • It is challenging to apply the trial-and-error approach to find such an optimal partition size, in particular for matrices with billions of rows
Objectives • Optimizing the balance between computational load and communication cost in partitioning matrices • Develop a performance model • Use MPI to implement a distributed matrix calculation algorithm and measure the parameters of the performance model and derive the optimal matrix partition size • Replace MPI with HPCC/ECL in the above method
Matrix Decomposition Assuming the number of nodes is 4. The calculation is A * B = C A X B Matrix can be divided into 4 chucks Matrix can be divided into 4 chucks C X = FINAL RESULT
Data Distribution and Calculation 0 A~H AE 0 Root node 0 distributes the data chucks A~H to 4 nodes. (root node keeps A and E on its own node) CG BF DH 1 2 3 1 2 3 A~H are distributed to each node as follows: Each node obtains its multiplication result based on the data received from the root node. The calculated results are marked in red. CG AE BF DH 0 1 2 3
Data Rolling across Nodes E.g.: Rolling A, B, C, D to their neighbors, and calculate the results. The red letters pass to their neighbors. BG AF BF AE DE CH CG DH Before rolling the data After rolling the data Blue: Calculation done Red: Calculating Black: Calculation does not start Keep rolling the data across all the nodes. Finally we can obtain all the multiplication results as shown in the table on the right.
Data Collection 0 Gathered results results results results 1 3 2 Root node 0 gathers all the results from slave nodes as the final result.
Performance Model • p: Number of Nodes (1 thread per node) • n: Square matrix size • T: Time cost (unit: second)
Experimental Results (1) • Software implementation: C and MPI • Hardware: UMBC Bluegritcluster • Configuration • p = 2, 3, 4, 5, 6, 7, 8, 9 • n = 360, 504, 720, 1008, 1080 Coefficients : • Ccalc= 1.801e-008 • Ccomm= 9.605e-009
Experiment Result (2) • Goodness of fit: • SSE: 1.17 • R-square: 0.9957 • RMSE: 0.1755 Real Data Fitting Data
Experimental Results (3) • Matrix size: 360 • x axis: node number • y axis: time in second Fitting line (red) indicates the optimal node number is 25
Experiment Results (4) • Matrix size: 720 • x axis: node number • y axis: time in second Fitting line (red) indicates the optimal node number 37
Experiment Results (5) • Matrix size: 1080 • x axis: node number • y axis: time in second Fitting line (red) indicates the optimal node number is 43
Experiment Results (6) • Predict optimal node number for matrix in a billion of rows and columns • Matrix size: 109 • x axis: node number • y axis: time in second Fitting line indicates the optimal node number is 43,302
Next Step • Replace MPI with HPCC/ECL and measure the parameters of performance model • Find a way of distributing the data to each node. However, need to find the node id etc • Need to find out whether HPCC/ECL allows for data exchange across node during the calculation • In each node, use BLAS library to speed up matrix calculations if possible
Summary • We have developed a performance model for matrix multiplication in a cluster • Developed a MPI program, measured its performance, and obtained fitting parameters of the performance model • Found the optimal node numbers for the matrix sizes from 360, 720, up to 1 billion
Acknowledgement • This project is sponsored by Lexisnexis through NSF CHMPR. We would like to thank John Holt and FlavioVillanustre for helpful discussion