1 / 21

Shujia Zhou, Zhiguang Wang, Ran Qi, Yelena Yesha IAB Meeting Research Report June 13, 2013

Big Data Analytics with Lexisnexis /HPCC System: Optimizing HPCC Performance for BIG Matrix Calculations. Shujia Zhou, Zhiguang Wang, Ran Qi, Yelena Yesha IAB Meeting Research Report June 13, 2013. Introduction.

olisa
Download Presentation

Shujia Zhou, Zhiguang Wang, Ran Qi, Yelena Yesha IAB Meeting Research Report June 13, 2013

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Data Analytics with Lexisnexis/HPCC System:Optimizing HPCCPerformance for BIG Matrix Calculations ShujiaZhou, Zhiguang Wang, Ran Qi, Yelena Yesha IAB Meeting Research Report June 13, 2013

  2. Introduction • Numerical Linear Algebra procedures are among the primary building blocks of • machine learning algorithms. Conversely, machine learning problems are posing new mathematical and algorithmic questions. • Big data analytics requires matrix calculations in a cluster for matrices with billions of rows • LexisNexis is adding machine learning capabilities in its HPCC/ECL system • It is highly desirable to increase the performance of matrix calculations of HPCC/ECL in a multicore cluster

  3. Problems • Increasing partition size will increase the efficiency of computation in a node. However, it also increases the across-node communication cost. Therefore, there is an optimal partition size • It is challenging to apply the trial-and-error approach to find such an optimal partition size, in particular for matrices with billions of rows

  4. Objectives • Optimizing the balance between computational load and communication cost in partitioning matrices • Develop a performance model • Use MPI to implement a distributed matrix calculation algorithm and measure the parameters of the performance model and derive the optimal matrix partition size • Replace MPI with HPCC/ECL in the above method

  5. The Detail of the Partition Algorithm

  6. Matrix Decomposition Assuming the number of nodes is 4. The calculation is A * B = C A X B Matrix can be divided into 4 chucks Matrix can be divided into 4 chucks C X = FINAL RESULT

  7. Data Distribution and Calculation 0 A~H AE 0 Root node 0 distributes the data chucks A~H to 4 nodes. (root node keeps A and E on its own node) CG BF DH 1 2 3 1 2 3 A~H are distributed to each node as follows: Each node obtains its multiplication result based on the data received from the root node. The calculated results are marked in red. CG AE BF DH 0 1 2 3

  8. Data Rolling across Nodes E.g.: Rolling A, B, C, D to their neighbors, and calculate the results. The red letters pass to their neighbors. BG AF BF AE DE CH CG DH Before rolling the data After rolling the data Blue: Calculation done Red: Calculating Black: Calculation does not start Keep rolling the data across all the nodes. Finally we can obtain all the multiplication results as shown in the table on the right.

  9. Data Collection 0 Gathered results results results results 1 3 2 Root node 0 gathers all the results from slave nodes as the final result.

  10. The Performance Model and Results

  11. Performance Model • p: Number of Nodes (1 thread per node) • n: Square matrix size • T: Time cost (unit: second)

  12. Experimental Results (1) • Software implementation: C and MPI • Hardware: UMBC Bluegritcluster • Configuration • p = 2, 3, 4, 5, 6, 7, 8, 9 • n = 360, 504, 720, 1008, 1080 Coefficients : • Ccalc= 1.801e-008 • Ccomm= 9.605e-009

  13. Experiment Result (2) • Goodness of fit: • SSE: 1.17 • R-square: 0.9957 • RMSE: 0.1755 Real Data Fitting Data

  14. Experimental Results (3) • Matrix size: 360 • x axis: node number • y axis: time in second Fitting line (red) indicates the optimal node number is 25

  15. Experiment Results (4) • Matrix size: 720 • x axis: node number • y axis: time in second Fitting line (red) indicates the optimal node number 37

  16. Experiment Results (5) • Matrix size: 1080 • x axis: node number • y axis: time in second Fitting line (red) indicates the optimal node number is 43

  17. Experiment Results (6) • Predict optimal node number for matrix in a billion of rows and columns • Matrix size: 109 • x axis: node number • y axis: time in second Fitting line indicates the optimal node number is 43,302

  18. Memory Issue

  19. Next Step • Replace MPI with HPCC/ECL and measure the parameters of performance model • Find a way of distributing the data to each node. However, need to find the node id etc • Need to find out whether HPCC/ECL allows for data exchange across node during the calculation • In each node, use BLAS library to speed up matrix calculations if possible

  20. Summary • We have developed a performance model for matrix multiplication in a cluster • Developed a MPI program, measured its performance, and obtained fitting parameters of the performance model • Found the optimal node numbers for the matrix sizes from 360, 720, up to 1 billion

  21. Acknowledgement • This project is sponsored by Lexisnexis through NSF CHMPR. We would like to thank John Holt and FlavioVillanustre for helpful discussion

More Related