1 / 30

High Performance LU Factorization for Non-dedicated Clusters

High Performance LU Factorization for Non-dedicated Clusters. and the future Grid. Toshio Endo , Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo). Background. Computing nodes on clusters/Grid are shared by multiple applications

cheche
Download Presentation

High Performance LU Factorization for Non-dedicated Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High Performance LU Factorization for Non-dedicated Clusters and the future Grid Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo)

  2. Background • Computing nodes on clusters/Grid are shared by multiple applications • To obtain good performance, HPC applications should struggle with • Background processes • Dynamic changing available nodes • Large latencies on the Grid

  3. Performance limiting factor:background processes • Other processes may run on background • Network daemons, interactive shells, etc. • Many typical applcations are written in synchronous style • In such applications, delay of a single node degrades the overall performance

  4. Performance limitng factor:Large latencies on the Grid • In the future Grid environments, bandwidth will accommodate HPC applications • Large latencies will remain to be obstacles >100ms • Synchronous applications suffer from large latencies

  5. Available nodes change dynamically • Many HPC applications assumes that computing nodes are fixed • If applications support dynamically changing nodes, we can harness computing resources more efficiently!

  6. Overlapping multiple iterations Written in the Phoenix model Data mapping for dynamically changing nodes Goal of this work An LU factorization algorithm that • Tolerates background processes & large latencies • Supports dynamically changing nodes A fast HPC application on non-dedicated clusters and Grid

  7. Outline of this talk • The Phoenix model • Our LU Algorithm • Overlapping multiple iterations • Data mapping for dynamically changing nodes • Performance of our LU and HPL • Related work • Summary

  8. Phoenix model [Taura et al. 03] • A message passing model for dynamically changing environments • Concept of virtual nodes • Virtual nodes as destinations of messages Virtual nodes Physical nodes

  9. Overview of our LU • Like typical implementations, • Based on message passing • The matrix is decomposed into small blocks • A block is updated by its owner node • Unlike typical implementations, • Asynchronous data-driven style for overlapping multiple iterations • Cyclic-like data mapping for any & dynamically changing number of nodes • (Currently, pivoting is not performed)

  10. LU factorization for (k=0; k<B; k++) { Ak,k=fact(Ak,k); for (i=k+1; i<B; i++) Ai,k=update_L(Ai,k,Ak,k); for (j=k+1; j<B; j++) Ak,j=update_U(Ak,j,Ak,k); for (i=k+1; i<B; i++) for (j=k+1; j<B; j++) Ai,j=Ai,j– Ai,k x Ak,j; } Diagonal L part U part Trail part

  11. Naïve implementation and its problem • Iterations are separated Not tolerant to latencies/background processes! (k+1) th iteration (k+2) th iteration k th iteration # of executable tasks time Diagonal U L trail

  12. time Latency Hiding Techniques • Overlapping iterations hides latencies • Diagonal/L/U parts is advanced • If computations of trail parts are separated, only adjacent two iterations are overlapped There is room for further improvement

  13. time Overlapping multiple iterations for more tolerance • We overlap multiple iterations • by computing all blocks, including trail parts asynchronously • Data driven style & prioritizedtask scheduling are used

  14. Prioritized task scheduling • We assign a priority to updating task of each block • k-th update of block Ai,j has a priority of min(i-S, j-S, k) (smaller number is higher) where S is a desired overlap depth • We can control overlapping by changing the value of S

  15. P0 P1 P2 P3 P4 P5 Typical data mapping and its problem • Two dimensional block cyclic distribution matrix • Good load balance and small communication, but • The number of nodes must be fixed and factored into two small numbers How to support dynamically changing nodes?

  16. A00 A01 A02 A03 A04 A05 A06 A07 A00 A01 A02 A03 A04 A05 A06 A07 A00 A01 A02 A03 A04 A05 A06 A07 A44 A00 A43 A01 A47 A02 A41 A03 A45 A04 A40 A05 A46 A06 A42 A07 A10 A11 A12 A13 A14 A15 A16 A17 A10 A11 A12 A13 A14 A15 A16 A17 A10 A11 A12 A13 A14 A15 A16 A17 A34 A10 A33 A11 A37 A12 A31 A13 A35 A14 A30 A15 A36 A16 A32 A17 A20 A21 A22 A23 A24 A25 A26 A27 A20 A21 A22 A23 A24 A25 A26 A27 A20 A21 A22 A23 A24 A25 A26 A27 A74 A20 A73 A21 A77 A22 A71 A23 A75 A24 A70 A25 A76 A26 A72 A27 A44 A43 A47 A41 A45 A40 A46 A42 A30 A31 A32 A33 A34 A35 A36 A37 A30 A31 A32 A33 A34 A35 A36 A37 A30 A31 A32 A33 A34 A35 A36 A37 A14 A30 A13 A31 A17 A32 A11 A33 A15 A34 A10 A35 A16 A36 A12 A37 Random Permutation A34 A33 A37 A31 A35 A30 A36 A32 A40 A41 A42 A43 A44 A45 A46 A47 A40 A41 A42 A43 A44 A45 A46 A47 A40 A41 A42 A43 A44 A45 A46 A47 A54 A40 A53 A41 A57 A42 A51 A43 A55 A44 A50 A45 A56 A46 A52 A47 A74 A73 A77 A71 A75 A70 A76 A72 A50 A51 A52 A53 A54 A55 A56 A57 A50 A51 A52 A53 A54 A55 A56 A57 A50 A51 A52 A53 A54 A55 A56 A57 A04 A50 A03 A51 A07 A52 A01 A53 A05 A54 A00 A55 A06 A56 A02 A57 A14 A13 A17 A11 A15 A10 A16 A12 A60 A61 A62 A63 A64 A65 A66 A67 A60 A61 A62 A63 A64 A65 A66 A67 A60 A61 A62 A63 A64 A65 A66 A67 A64 A60 A63 A61 A67 A62 A61 A63 A65 A64 A60 A65 A66 A66 A62 A67 A54 A53 A57 A51 A55 A50 A56 A52 A70 A71 A72 A73 A74 A75 A76 A77 A70 A71 A72 A73 A74 A75 A76 A77 A70 A71 A72 A73 A74 A75 A76 A77 A24 A70 A23 A71 A27 A72 A21 A73 A25 A74 A20 A75 A26 A76 A22 A77 A04 A03 A07 A01 A05 A00 A06 A02 A64 A63 A67 A61 A65 A60 A66 A62 Permuted matrix A24 A23 A27 A21 A25 A20 A26 A22 Our data mapping for dynamically changing nodes • Permutation is common among all nodes Original matrix

  17. original original original original original original original permuted permuted permuted permuted permuted permuted permuted Dynamically joining nodes • A new node sends a steal message to one of nodes • The receiver abandons some virual nodes, and sends blocks to the new node • The new node undertakes virtual nodes and blocks • For better load balance, stealing process is repeated

  18. Experimental environments (1) • 112 nodes IBM BladeCenter Cluster • Dual 2.4GHz Xeon: 70 nodes + Dual 2.8GHz Xeon: 42 nodes • 1 CPU per node is used • Slower CPU (2.4GHz) determines the overall performance • Gigabit ethernet

  19. Experimental environments (2) • High performance Linpack (HPL) is by Petitet et al. • GOTO BLAS is made by Kazushige Goto (UT-Austin) • Ours (S=0): don’t overlap explicitly • Ours (S=1): overlap with an adjacent iteration • Ours (S=5): overlap multiple (5) iterations

  20. Scalability x72 • Ours(S=5) achieves 190 GFlops with 108 nodes • 65 times speedup • Matrix size N=61440 • Block size NB=240 • Overlap depth S=0 or 5 x65

  21. Tolerance to background processes (1) • We run LU/HPL with background processes • We run 3 background processes per randomely chosen node • The background processes are short term • They move to other random nodes every 10 secs

  22. Tolerance to background processes (2) -16% -26% -31% • HPL slows down heavily • Ours(S=0) and Ours(S=1) also suffer • By overlapping multiple iterations (S=5), Our LU becomes more tolerant ! -36% • 108 nodes for computation • N=46080

  23. Tolerance to large latencies (1) • We emulate the future Grid environment with high bandwidth & large latencies • Experiments are done on a cluster • Large latencies are emulated by software • +0ms, +200ms, +500ms

  24. Tolerance to large latencies (2) • S=0 suffers by 28% • Overlapping of iterations makes our LU more tolerant • Both S=1 and S=5 work well -19% -20% -28% • 108 nodes for computation • N=46080

  25. 64 16 Performance with joining nodes (1) • 16 nodes at first, then 48 nodes are added dynamically

  26. Performance with joining nodes (2) • Flexibility to the number of nodes is useful to obtain higher performance • Comared with Fixed-64, Dynamic suffers migration overhead etc. x1.9 faster • N=30720 • S=5

  27. Related Work Dyn-MPI [Weatherly et al. 03] • An extended MPI library that supports dynamically changing nodes

  28. Summary An LU implementation suitable for non-dedicated clusters and the Grid • Scalable • Support dynamically changing nodes • Tolerate background processes & large latencies

  29. Future Work • Perform pivoting • More data dependencies are introduced • Is our LU still tolerant? • Improve dynamic load balancing • Choose better target nodes for stealing • Take care of CPU speeds • Apply our approach to other HPC applications • CFD applications

  30. Thank you!

More Related