Shujia Zhou, Zhiguang Wang, Ran Qi, Yelena Yesha IAB Meeting Research Report June 13, 2013

Big Data Analytics with Lexisnexis/HPCC System:Optimizing HPCCPerformance for BIG Matrix Calculations ShujiaZhou, Zhiguang Wang, Ran Qi, Yelena Yesha IAB Meeting Research Report June 13, 2013

Introduction • Numerical Linear Algebra procedures are among the primary building blocks of • machine learning algorithms. Conversely, machine learning problems are posing new mathematical and algorithmic questions. • Big data analytics requires matrix calculations in a cluster for matrices with billions of rows • LexisNexis is adding machine learning capabilities in its HPCC/ECL system • It is highly desirable to increase the performance of matrix calculations of HPCC/ECL in a multicore cluster

Problems • Increasing partition size will increase the efficiency of computation in a node. However, it also increases the across-node communication cost. Therefore, there is an optimal partition size • It is challenging to apply the trial-and-error approach to find such an optimal partition size, in particular for matrices with billions of rows

Objectives • Optimizing the balance between computational load and communication cost in partitioning matrices • Develop a performance model • Use MPI to implement a distributed matrix calculation algorithm and measure the parameters of the performance model and derive the optimal matrix partition size • Replace MPI with HPCC/ECL in the above method

The Detail of the Partition Algorithm

Matrix Decomposition Assuming the number of nodes is 4. The calculation is A * B = C A X B Matrix can be divided into 4 chucks Matrix can be divided into 4 chucks C X = FINAL RESULT

Data Distribution and Calculation 0 A~H AE 0 Root node 0 distributes the data chucks A~H to 4 nodes. (root node keeps A and E on its own node) CG BF DH 1 2 3 1 2 3 A~H are distributed to each node as follows: Each node obtains its multiplication result based on the data received from the root node. The calculated results are marked in red. CG AE BF DH 0 1 2 3

Data Rolling across Nodes E.g.: Rolling A, B, C, D to their neighbors, and calculate the results. The red letters pass to their neighbors. BG AF BF AE DE CH CG DH Before rolling the data After rolling the data Blue: Calculation done Red: Calculating Black: Calculation does not start Keep rolling the data across all the nodes. Finally we can obtain all the multiplication results as shown in the table on the right.

Data Collection 0 Gathered results results results results 1 3 2 Root node 0 gathers all the results from slave nodes as the final result.

The Performance Model and Results

Performance Model • p: Number of Nodes (1 thread per node) • n: Square matrix size • T: Time cost (unit: second)

Experimental Results (1) • Software implementation: C and MPI • Hardware: UMBC Bluegritcluster • Configuration • p = 2, 3, 4, 5, 6, 7, 8, 9 • n = 360, 504, 720, 1008, 1080 Coefficients : • Ccalc= 1.801e-008 • Ccomm= 9.605e-009

Experiment Result (2) • Goodness of fit: • SSE: 1.17 • R-square: 0.9957 • RMSE: 0.1755 Real Data Fitting Data

Experimental Results (3) • Matrix size: 360 • x axis: node number • y axis: time in second Fitting line (red) indicates the optimal node number is 25

Experiment Results (4) • Matrix size: 720 • x axis: node number • y axis: time in second Fitting line (red) indicates the optimal node number 37

Experiment Results (5) • Matrix size: 1080 • x axis: node number • y axis: time in second Fitting line (red) indicates the optimal node number is 43

Experiment Results (6) • Predict optimal node number for matrix in a billion of rows and columns • Matrix size: 109 • x axis: node number • y axis: time in second Fitting line indicates the optimal node number is 43,302

Memory Issue

Next Step • Replace MPI with HPCC/ECL and measure the parameters of performance model • Find a way of distributing the data to each node. However, need to find the node id etc • Need to find out whether HPCC/ECL allows for data exchange across node during the calculation • In each node, use BLAS library to speed up matrix calculations if possible

Summary • We have developed a performance model for matrix multiplication in a cluster • Developed a MPI program, measured its performance, and obtained fitting parameters of the performance model • Found the optimal node numbers for the matrix sizes from 360, 720, up to 1 billion

Acknowledgement • This project is sponsored by Lexisnexis through NSF CHMPR. We would like to thank John Holt and FlavioVillanustre for helpful discussion

Shujia Zhou, Zhiguang Wang, Ran Qi, Yelena Yesha IAB Meeting Research Report June 13, 2013

Shujia Zhou, Zhiguang Wang, Ran Qi, Yelena Yesha IAB Meeting Research Report June 13, 2013

Presentation Transcript

Yelena Yesha Olga Streltchenko

13 th June 2013

June 13 th 2013

Ran Chen Yubao Wang Liang Guo

13 June, 2013 Dymecs Meeting, Reading

Yesha Gupta

IAB Report

13 June 2013

IAB Report

Dongqian Wang Bing Zhou Chenghu Sun

QUESTOR IAB Meeting

June 2013 LCCU Meeting

QUESTOR IAB Meeting

Oracle Report Manager June 13, 2013

IAB Report

WINLAB IAB Meeting June 10, 2005

IAB Chair Report

IAB Report

IAB Report

IAB Chair Report

IAB Report

Fahad Zafar , Dr. Yaacov Yesha , Dr. Aldo Badano IAB Meeting Research Report June 2013