1 / 49

Implementing Parallel CG Algorithm on the EARTH Multi-threaded Architecture

Fei Chen Kevin B. Theobald Guang R. Gao CAPSL Electrical and Computer Engineering University of Delaware. Implementing Parallel CG Algorithm on the EARTH Multi-threaded Architecture. Cluster 2004 Thursday, September 23rd, 2004. Outline. Introduction Algorithm Results Conclusion

fran
Download Presentation

Implementing Parallel CG Algorithm on the EARTH Multi-threaded Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fei Chen Kevin B. Theobald Guang R. Gao CAPSL Electrical and Computer Engineering University of Delaware Implementing Parallel CG Algorithm on the EARTH Multi-threaded Architecture Cluster 2004 Thursday, September 23rd, 2004

  2. Outline • Introduction • Algorithm • Results • Conclusion • Future Work

  3. Introduction The Conjugate Gradient (CG) is the most popular iterative method for solving large systems of linear equations Ax = b [1].

  4. Introduction (continued) The matrix A is usually big and sparse, and as previous studies showed, matrix vector multiply (MVM) costs 95% CPU time and the other 5% for vector-vector products (VVP) [2]. Parallel CG Algorithm Distribute A & x among nodes ... A1x1 A2x2 Anxn Local MVM Scale variables reduction & broadcast ... Calculate new local vectors A1x1 A2x2 Anxn Redistribute new local vectors

  5. EARTH supports fibers, which are non-preemptive and scheduled in response to dataflow-like synchronization operations. Data and control dependences are explicitly programed (Threaded-C) with EARTH operations among those fibers [4]. Introduction (continued) EARTH (Efficient Architecture for Running Threads) architecture[3]

  6. Design objective Find a matrix blocking method which can reduce overall communication cost. Overlap communication and computation to further reduce communication cost. We proposed Two-dimensional Pipelined method with EARTH multi-threading technique. Algorithm

  7. Algorithm (continued) Horizontal Blocking Method 1 2 3 4 1 2 3 4

  8. Algorithm (continued) Horizontal Blocking Method 1 2 3 4 1 2 3 4 1 2 3 4

  9. Algorithm (continued) Horizontal Blocking Method 1 2 3 4 1 2 3 4 1 2 3 4

  10. Algorithm (continued) Horizontal Blocking Method 1 2 3 4 1 2 3 4 1 2 3 4

  11. Algorithm (continued) Horizontal Blocking Method 1 2 3 4 1 2 3 4 1 2 3 4

  12. Algorithm (continued) Horizontal Blocking Method 1 2 3 4 1 2 3 4 1 2 3 4

  13. Algorithm (continued) Horizontal Blocking Method 1 2 3 4 1 2 3 4 1 2 3 4 P is the nodes number, and N is the size of the matrix. Inner-phase communication cost Cn = 0 Inter-phase communication cost Ct = P (P – 1) N / P = N (P – 1) Overall communication cost C = Cn + Ct = N (P – 1) = NP

  14. Algorithm (continued) Vertical Blocking Method (Pipelined) 1 2 3 4 1 2 3 4

  15. Algorithm (continued) Vertical Blocking Method (Pipelined) 1 2 3 4 1 2 3 4 1 1 1 1

  16. Algorithm (continued) Vertical Blocking Method (Pipelined) 1 2 3 4 1 2 3 4 1 1 1 1

  17. Algorithm (continued) Vertical Blocking Method (Pipelined) 1 2 3 4 1 2 3 4 1 1 1 1

  18. Algorithm (continued) Vertical Blocking Method (Pipelined) 1 2 3 4 1 2 3 4 2 2 2 2

  19. Algorithm (continued) Vertical Blocking Method (Pipelined) 1 2 3 4 1 2 3 4 2 2 2 2

  20. Algorithm (continued) Vertical Blocking Method (Pipelined) 1 2 3 4 1 2 3 4 2 2 2 2

  21. Algorithm (continued) Vertical Blocking Method (Pipelined) 1 2 3 4 1 2 3 4 3 3 3 3

  22. Algorithm (continued) Vertical Blocking Method (Pipelined) 1 2 3 4 1 2 3 4 3 3 3 3

  23. Algorithm (continued) Vertical Blocking Method (Pipelined) 1 2 3 4 1 2 3 4 3 3 3 3

  24. Algorithm (continued) Vertical Blocking Method (Pipelined) 1 2 3 4 1 2 3 4 4 4 4 4

  25. Algorithm (continued) Vertical Blocking Method (Pipelined) 1 2 3 4 1 2 3 4 4 4 4 4

  26. Algorithm (continued) Vertical Blocking Method (Pipelined) 1 2 3 4 1 2 3 4 4 4 4 4

  27. Algorithm (continued) Vertical Blocking Method (Pipelined) 1 2 3 4 1 2 3 4 4 4 4 4

  28. Algorithm (continued) Vertical Blocking Method (Pipelined) 1 2 3 4 1 2 3 4 4 4 4 4 P is the nodes number, and N is the size of the matrix. Inner-phase communication cost Cn = (N / P) * P * P = NP Inter-phase communication cost Ct = 0 Overall communication cost C = Cn + Ct = NP

  29. Algorithm (continued) Two-dimensional Pipelined Method 1 2 1 2 3 4 3 4

  30. Algorithm (continued) Two-dimensional Pipelined Method 1 2 1 2 3 4 1 1 1 1 3 4

  31. Algorithm (continued) Two-dimensional Pipelined Method 1 2 1 2 3 4 1 1 1 1 3 4

  32. Algorithm (continued) Two-dimensional Pipelined Method 1 2 1 2 3 4 1 1 1 1 3 4

  33. Algorithm (continued) Two-dimensional Pipelined Method 1 2 1 2 3 4 2 2 2 2 3 4

  34. Algorithm (continued) Two-dimensional Pipelined Method 1 2 1 2 3 4 2 2 2 2 3 4

  35. Algorithm (continued) Two-dimensional Pipelined Method 1 2 1 2 3 4 2 2 2 2 3 4 P is the nodes number, and N is the size of the matrix. Inner-phase communication cost Cn = (N / P) * sqrt(P) * P = N sqrt(P)

  36. Algorithm (continued) Two-dimensional Pipelined Method 1 2 1 2 3 4 2 2 2 2 2 3 4 P is the nodes number, and N is the size of the matrix. Inner-phase communication cost Cn = (N / P) * sqrt(P) * P = N sqrt(P)

  37. Algorithm (continued) Two-dimensional Pipelined Method 1 2 1 2 3 4 2 2 2 2 2 2 3 4 P is the nodes number, and N is the size of the matrix. Inner-phase communication cost Cn = (N / P) * sqrt(P) * P = N sqrt(P)

  38. Algorithm (continued) Two-dimensional Pipelined Method 1 2 1 2 3 4 2 2 2 2 2 2 2 3 4 P is the nodes number, and N is the size of the matrix. Inner-phase communication cost Cn = (N / P) * sqrt(P) * P = N sqrt(P)

  39. Algorithm (continued) Two-dimensional Pipelined Method 1 2 1 2 3 4 2 2 2 2 2 2 2 2 3 4 P is the nodes number, and N is the size of the matrix. Inner-phase communication cost Cn = (N / P) * sqrt(P) * P = N sqrt(P)

  40. Algorithm (continued) Two-dimensional Pipelined Method 1 2 1 2 3 4 2 2 2 2 2 2 2 2 3 4 P is the nodes number, and N is the size of the matrix. Inner-phase communication cost Cn = (N / P) * sqrt(P) * P = N sqrt(P) Inter-phase communication cost Ct = (N / P) * sqrt(P) * P = N sqrt(P) Overall communication cost C = 2 N sqrt(P)

  41. There is no data dependence between the two halves. When the first half MVM finishes in the Execution Unit (EU) and the request to send out the first half of the result vector is writen to Event Queue (EQ), the second half MVM can start on EU immediately. EARTH system dedicates Synchronization Unit (SU) handling communication requests across the network, hence communication and computation can be overlapped. Algorithm (continued) Multi-threading Technique 1 2 1 2 3 4 3 4

  42. Test platform Chiba City in Argonne National Laboratory (ANL) - a cluster with 256 dual CPU nodes connected with fast ethernet SEMi - a MANNA machine simulator [5] We used the same matrices as NAS parallel CG benchmark [6] Scalability Results

  43. Scalability Results (continued) Threaded_C implementation scalability results on Chiba City

  44. Scalability Results (continued) NAS CG(MPI) benchmark scalability results on Chiba City

  45. Scalability Results (continued) Scalability Comparison with NAS Parallel CG on Chiba City

  46. Scalability Results (continued) Threaded_C implementation scalability results on SEMi

  47. With the two-dimensional pipelined method the overall communication cost can be reduced to 2 / sqrt(P) of one-dimensional blocking method (vertical or horizontal). The underlying EARTH system, which is a adaptive, event-driven multi-threaded execution model, makes it possible to overlap communication and computation in our implementation. Notable scalability improvement was achieved by implementing the two-dimensional pipelined method on EARTH multi-threaded architecture. Conclusion

  48. Port the EARTH runtime system to clusters with Myrinet connection. Provide a set of simple programming interfaces to help users reduce the coding effort. Investigate how to use two-dimensional pipelined method and EARTH system support to improve the performance of parallel scientific computing tools. Future Work

  49. Reference [1] Jonathan R. Shewchuk, AN INTRODUCTION TO THE CONJUGATE GRADIENT METHOD WITHOUT THE AGONIZING PAIN. [2] P. Kloos, P. Blaise, F. Mathey, OPENMP AND MPI PROGRAMMING WITH A CG ALGORITHM, page 5, CEA, http://www.epcc.ed.ac.uk/ewomp2000/Presentations/KloosSlides.pdf. [3] Herber H. J. Hum, Olivier Maquelin, Kevin B. Theobald, Xinmin Tian, Guang R. Gao, and Laurie J. Hendren, A STUDY OF THE EARTH-MANNA MULTI-THREADED SYSTEM, International Journal of Parallel Programming, 24(4):319-347, August 1996. [4] Kevin B. Theobald. EARTH: AN EFFICIENT ARCHITECTURE FOR RUNNING THREADS, PhD thesis, McGill University, Montreal, Quebec, May 1999. [5] Kevin B. Theobald, SEMI: A SIMULATOR FOR EARTH, MANNA, AND I860 (VERSION 0.23), CAPSL Technical Memo 27, March 1, 1999. In ftp://ftp.capsl.udel.edu/pub/doc/memos. [6] R. C. Agarwal, B. Alpern, L. Carter, F. G. Gustavson, D. J. Klepacki, R. Lawrence, M. Zubair, HIGH-PERFORMANCE PARALLEL IMPLEMENTAIONS OF THE NAS KERNEL BENCHMARKS ON THE IBM SP22, IBM SYSTEMS JOURNAL, VOL 34, NO 2, 1995.

More Related