1 / 115

Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator. Kengo Nakajima nakajima@cc.u-tokyo.ac.jp Supercomputing Division, Information Technology Center, The University of Tokyo Japan Agency for Marine-Earth Science and Technology. April 22, 2008. Overview.

jaron
Download Presentation

Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator Kengo Nakajima nakajima@cc.u-tokyo.ac.jp Supercomputing Division, Information Technology Center, The University of Tokyo Japan Agency for Marine-Earth Science and Technology April 22, 2008

  2. 08-APR22 Overview • Background • Vector/Scalar Processors • GeoFEM Project • Earth Simulator • Preconditioned Iterative Linear Solvers • Optimization Strategy on the Earth Simulator • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications • Matrix Assembling • Summary & Future Works

  3. Large-Scale Computing • Scalar Processor • Big gap between clock rate and memory bandwidth. • Very low sustained/peak performance ratio (<10%) 08-APR22

  4. Scalar ProcessorsCPU-Cache-Memory: Hierarchical Structure CPU Register FAST Small Capacity (MB) Expensive Large (100M of Transistors) Cache SLOW Main Memory Large Capacity (GB) Cheap 08-APR22

  5. Large-Scale Computing • Scalar Processor • Big gap between clock rate and memory bandwidth. • Very low sustained/peak performance ratio (<10%) • Vector Processor • Very high sustained/peak performance ratio • e.g. 35 % for FEM applications on the Earth Simulator • requires … • very special tuning • sufficiently long loops (= large-scale problem size) for certain performance • Suitable for simple computation. 08-APR22

  6. Vector ProcessorsVector Register & Fast Memory • Parallel Processing for Simple DO Loops. • Suitable for Simple & Large Computation Vector Processor do i= 1, N A(i)= B(i) + C(i) enddo Vector Register Very FAST Main Memory 08-APR22

  7. Large-Scale Computing • Scalar Processor • Big gap between clock rate and memory bandwidth. • Very low sustained/peak performance ratio (<10%) • Vector Processor • Very high sustained/peak performance ratio • e.g. 35 % for FEM applications on the Earth Simulator • requires … • very special tuning • sufficiently long loops (= large-scale problem size) for certain performance • Suitable for simple computation. 08-APR22

  8. IBM SP3 (LBNL) Hitachi SR11000/J2 (U.Tokyo) 1.50 9.20 0.623 8.00 .072-.076 (4.8-5.1) .880-.973 (9.6-10.6) .122 (8.1) 1.34 (14.5) 0.413 0.870 1.00 12.0 16.3 4.7 * Earth Simulator Hitachi SR8000/MPP (U.Tokyo) Peak Performance GFLOPS 8.00 1.80 Measured Memory BW (STREAM) (GB/sec/PE) 26.6 2.85 Estimated Performance/Core GFLOPS (% of peak) 2.31-3.24 (28.8-40.5) .291-.347 (16.1-19.3) Measured Performance/Core 2.93 (36.6) .335 (18.6) BYTE/FLOP 3.325 1.583 Comm. BW (GB/sec/Node) 12.3 1.60 * IBM p595 J.T.Carter et al. MPI Latency (msec) 5.6-7.7 6-20 08-APR22

  9. Typical Behavior … 40 % of peak 8 % of peak IBM-SP3: Performance is good for small problems due to cache effect. Earth Simulator: Performance is good for large scale problems due to long vector length. 08-APR22

  10. Parallel ComputingStrong Scaling (Fixed Prob. Size) Ideal Ideal Performance Performance PE# PE# IBM-SP3: Super-scalar effect for small number of PE’s. Performance decreases for many PE’s due to comm. overhead. Earth Simulator: Performance decreases for many PE’s due to comm. overhead and small vector length. 08-APR22

  11. Improvement of Memory Performance on IBM SP3 ⇒ Hitachi SR11000/J2 Memory Performance (BWDTH, latency etc.) ■ Flat-MPI/DCRS □ Hybrid/DCRS 2.3 GHz 8.0 GB/sec 18M L3 cache/PE IBM SP-3 POWER3 375 MHz 1.0 GB/sec 8M L2 cache/PE SR11000/J2 POWER5+ 08-APR22

  12. 08-APR22 My History with Vector Computers • Cray-1S (1985-1988) • Mitsubishi Research Institute (MRI) • Cray-YMP (1988-1995) • MRI, University of Texas at Austin • Fujitsu VP, VPP series (1985-1999) • JAERI, PNC • NEC SX-4 (1997-2001) • CCSE/JAERI • Hitachi SR2201 (1997-2004) • University of Tokyo, CCSE/JAERI • Hitachi SR8000 (2000-2007) • University of Tokyo • Earth Simulator (2002-)

  13. 08-APR22 • Background • Vector/Scalar Processors • GeoFEM Project • Earth Simulator • Preconditioned Iterative Linear Solvers • Optimization Strategy on the Earth Simulator • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications • Matrix Assembling • Summary & Future Works

  14. 08-APR22 GeoFEM: FY.1998-2002http://geofem.tokyo.rist.or.jp/ • Parallel FEM platform for solid earth simulation. • parallel I/O, parallel linear solvers, parallel visualization • solid earth: earthquake, plate deformation, mantle/core convection, etc. • Part of national project by STA/MEXT for large-scale earth science simulations using the Earth Simulator. • Strong collaborations between HPC and natural science (solid earth) communities.

  15. 08-APR22 System Configuration ofGeoFEM

  16. 08-APR22 Results on Solid Earth Simulation Complicated Plate Model around Japan Islands Magnetic Field of the Earth : MHD code Dh=5.00 Dh=1.25 Simulation of Earthquake Generation Cycle in Southwestern Japan T=100 T=200 T=300 T=400 T=500 Transportation by Groundwater Flow through Heterogeneous Porous Media TSUNAMI !!

  17. 08-APR22 Results by GeoFEM

  18. 08-APR22 Features of FEM applications (1/2) • Local “element-by-element” operations • sparse coefficient matrices • suitable for parallel computing • HUGE “indirect” accesses • IRREGULAR sparse matrices • memory intensive do i= 1, N jS= index(i-1)+1 jE= index(i) do j= jS, jE in = item(j) Y(i)= Y(i) + AMAT(j)*X(in) enddo enddo

  19. 08-APR22 Features of FEM applications (2/2) • In parallel computation … • comm. with ONLY neighbors (except “dot products” etc.) • amount of messages are relatively small because only values on domain-boundary are exchanged. • communication (MPI) latency is critical

  20. 08-APR22 • Background • Vector/Scalar Processors • GeoFEM Project • Earth Simulator • Preconditioned Iterative Linear Solvers • Optimization Strategy on the Earth Simulator • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications • Matrix Assembling • Summary & Future Works

  21. 08-APR22 Earth Simulator (ES)http://www.es.jamstec.go.jp/ • 640×8= 5,120 Vector Processors • SMP Cluster-Type Architecture • 8 GFLOPS/PE • 64 GFLOPS/Node • 40 TFLOPS/ES • 16 GB Memory/Node, 10 TB/ES • 640×640 Crossbar Network • 16 GB/sec×2 • Memory BWTH with 32 GB/sec. • 35.6 TFLOPS for LINPACK (2002-March) • 26 TFLOPS for AFES (Climate Simulation)

  22. 08-APR22 Motivations • GeoFEM Project (FY.1998-2002) • FEM-type applications with complicated unstructured grids (not LINPACK, FDM …) on the Earth Simulator (ES) • Implicit Linear Solvers • Hybrid vs. Flat MPI Parallel Programming Model

  23. 08-APR22 Memory Memory P E P E P E P E P E P E P E P E Memory Memory P E P E P E P E P E P E P E P E Flat MPI vs. Hybrid Flat-MPI:Each PE -> Independent Hybrid:Hierarchal Structure

  24. 08-APR22 • Background • Vector/Scalar Processors • GeoFEM Project • Earth Simulator • Preconditioned Iterative Linear Solvers • Optimization Strategy on the Earth Simulator • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications • Matrix Assembling • Summary & Future Works

  25. 08-APR22 Direct/Iterative Methods for Linear Equations • Direct Methods • Gaussian Elimination/LU Factorization. • compute A-1 directly. • Robust for wide range of applications. • More expensive than iterative methods (memory, CPU) • Not suitable for parallel and vector computation due to its global operations. • Iterative Methods • CG, GMRES, BiCGSTAB • Less expensive than direct methods, especially in memory. • Suitable for parallel and vector computing. • Convergence strongly depends on problems, boundary conditions (condition number etc.) • Preconditioning is required

  26. 08-APR22 Preconditioning for Iterative Methods • Convergence rate of iterative solvers strongly depends on the spectral properties (eigenvalue distribution) of the coefficient matrix A. • A preconditioner M transforms the linear system into one with more favorable spectral properties • In "ill-conditioned" problems, "condition number" (ratio of max/min eigenvalue if A is symmetric) is large. • M transforms original equation Ax=b into A'x=b' where A'=M-1A, b'=M-1b • ILU (Incomplete LU Factorization) or IC (Incomplete Cholesky Factorization) are well-known preconditioners.

  27. 08-APR22 Strategy in GeoFEM • Iterative method is the ONLY choice for large-scale parallel computing. • Preconditioning is important • general methods, such as ILU(0)/IC(0), cover wide range of applications. • problem specific methods.

  28. 08-APR22 • Background • Vector/Scalar Processors • GeoFEM Project • Earth Simulator • Preconditioned Iterative Linear Solvers • Optimization Strategy on the Earth Simulator • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications • Matrix Assembling • Summary & Future Works

  29. 08-APR22 Block IC(0)-CG Solver on the Earth Simulator • 3D Linear Elastic Problems (SPD) • Parallel Iterative Linear Solver • Node-based Local Data Structure • Conjugate-Gradient Method (CG): SPD • Localized Block IC(0) Preconditioning (Block Jacobi) • Modified IC(0): Non-diagonal components of original [A] are kept • Additive Schwartz Domain Decomposition(ASDD) • http://geofem.tokyo.rist.or.jp/ • Hybrid Parallel Programming Model • OpenMP+MPI • Re-Ordering for Vector/Parallel Performance • Comparison with Flat MPI

  30. 08-APR22 Memory Memory P E P E P E P E P E P E P E P E Memory Memory P E P E P E P E P E P E P E P E Flat MPI vs. Hybrid Flat-MPI:Each PE -> Independent Hybrid:Hierarchal Structure

  31. 08-APR22 PE#1 PE#1 PE#0 4 5 6 12 15 6 7 21 21 21 22 22 22 23 23 23 24 24 24 25 25 25 PE#0 1 11 2 3 14 13 4 5 17 17 17 18 18 18 19 19 19 16 16 16 20 20 20 10 1 2 3 7 8 9 10 12 12 12 13 13 13 14 14 14 11 10 12 11 11 11 15 15 15 8 9 11 12 10 9 11 12 7 7 7 8 8 8 9 9 9 6 6 6 10 10 10 5 9 6 3 8 8 6 4 4 5 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 PE#3 PE#2 1 2 7 7 1 2 3 PE#3 PE#2 Local Data StructureNode-based Partitioninginternal nodes - elements - external nodes

  32. 08-APR22 1 SMP node => 1 domain for Hybrid Programming ModelMPI communication among domains

  33. 08-APR22 Basic Strategy for Parallel Programming on the Earth Simulator • Hypothesis • Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector processors.

  34. 08-APR22 ILU(0)/IC(0) Factorization do i= 2, n do k= 1, i-1 if ((i,k)∈ NonZero(A)) then aik := aik/akk endif do j= k+1, n if ((i,j)∈ NonZero(A)) then aij := aij - aik*akj endif enddo enddo enddo

  35. 08-APR22 ILU/IC Preconditioning M = (L+D)D-1(D+U) Mz= r D-1(D+U)= z1 Forward Substitution (L+D)z1= r : z1= D-1(r-Lz1) Backward Substitution (I+ D-1 U)zNEW = z1 z= z - D-1Uz need to solve this equation

  36. 08-APR22 ILU/IC Preconditioning do i= 1, N WVAL= R(i) do j= 1, INL(i) WVAL= WVAL - AL(i,j) * Z(IAL(i,j)) enddo Z(i)= WVAL / D(i) enddo do i= N, 1, -1 SW = 0.0d0 do j= 1, INU(i) SW= SW + AU(i,j) * Z(IAU(i,j)) enddo Z(i)= Z(i) – SW / D(i) enddo Forward Substitution (L+D)z= r : z= D-1(r-Lz) M =(L+D)D-1(D+U) L,D,U: A Dependency… You need the most recent value of “z” of connected nodes. Vectorization/parallelization is difficult. Backward Substitution (I+ D-1 U)znew= zold : z= z - D-1Uz

  37. 08-APR22 Basic Strategy for Parallel Programming on the Earth Simulator • Hypothesis • Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector processors. • Re-Ordering for Highly Parallel/Vector Performance • Local operation and no global dependency • Continuous memory access • Sufficiently long loops for vectorization

  38. 08-APR22 ILU/IC Preconditioning do i= 1, N WVAL= R(i) do j= 1, INL(i) WVAL= WVAL - AL(i,j) * Z(IAL(i,j)) enddo Z(i)= WVAL / D(i) enddo do i= N, 1, -1 SW = 0.0d0 do j= 1, INU(i) SW= SW + AU(i,j) * Z(IAU(i,j)) enddo Z(i)= Z(i) – SW / D(i) enddo Forward Substitution (L+D)z= r : z= D-1(r-Lz) M =(L+D)D-1(D+U) L,D,U: A Dependency… You need the most recent value of “z” of connected nodes. Vectorization/parallelization is difficult. Reordering: Directly connected nodes do not appear in RHS. Backward Substitution (I+ D-1 U)znew= zold : z= z - D-1Uz

  39. 08-APR22 Basic Strategy for Parallel Programming on the Earth Simulator • Hypothesis • Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector processors. • Re-Ordering for Highly Parallel/Vector Performance • Local operation and no global dependency • Continuous memory access • Sufficiently long loops for vectorization • 3-Way Parallelism for Hybrid Parallel Programming • Inter Node : MPI • Intra Node : OpenMP • Individual PE : Vectorization

  40. 08-APR22 Vector SMP Parallel Vector SMP Parallel These processes can be substituted by traditional multi-coloring (MC). Re-Ordering Technique for Vector/Parallel ArchitecturesCyclic DJDS(RCM+CMC) Re-Ordering(Doi, Washio, Osoda and Maruyama (NEC)) 1. RCM (Reverse Cuthil-Mckee) 2. CMC (Cyclic Multicolor) 3. DJDS re-ordering 4. Cyclic DJDS for SMP unit

  41. 08-APR22 Reordering = Coloring • COLOR: Unit of independent sets. • Elements grouped in the same “color” are independent from each other, thus parallel/vector operation is possible. • Many colors provide faster convergence, but shorter vector length.: Trade-Off !! RCM (Reverse CM) Red-Black (2 colors) 4 colors

  42. 08-APR22 2D-Storage long vector length, many ZERO’s 1D-Storage (CRS) memory saved, short vector length Large Scale Sparse Matrix Storage for Unstructured Grids

  43. 08-APR22 Re-Ordering in each Coloraccording to Non-Zero Off-Diag. Component # Elements on the same color are independent, thereforeintra-hyperplane re-ordering does not affect results DJDS : Descending-Order Jagged Diagonal Storage

  44. 08-APR22 iS+1 iv0+1 iE Cyclic DJDS (MC/CM-RCM)Cyclic Re-Ordering for SMP unitsLoad-balancing among PEs do iv= 1, NCOLORS !$omp parallel do do ip= 1, PEsmpTOT iv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1) do j= 1, NLhyp(iv) iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1) iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip ) !cdir nodep do i= iv0+1, iv0+iE-iS k= i+iS - iv0 kk= IAL(k) (Important Computations) enddo enddo enddo enddo npLX1= NLmax * PEsmpTOT INL(0:NLmax*PEsmpTOT*NCOLORS)

  45. 08-APR22 Difference betweenFlat MPI & Hybrid • Most of the efforts of re-ordering are for vectorization. • If you have a long vector, just divide it and distribute the segments to PEs on SMP nodes. • Source codes of Hybrid and Flat MPI are not so different. • Flat MPI corresponds to Hybrid where PE/SMP node=1. • In other words, Flat MPI code is sufficiently complicated.

  46. 08-APR22 Cyclic DJDS (MC/CM-RCM)for Forward/Backward Substitution in BILU Factorization

  47. 08-APR22 z Uniform Distributed Force in z-dirrection @ z=Zmin Uy=0 @ y=Ymin Ux=0 @ x=Xmin Nz-1 Ny-1 y Uz=0 @ z=Zmin Nx-1 x Simple 3D Cubic Model

  48. 08-APR22 Effect of Ordering

  49. 08-APR22 Long Loops Continuous Access Short Loops Irregular Access Short Loops Continuous Access Effect of Re-Ordering

  50. 08-APR22 Matrix Storage, Loops do iv= 1, NVECT iv0= STACKmc(iv-1) do j= 1, NLhyp(iv) iS= index_L(NL*(iv-1)+ j-1) iE= index_L(NL*(iv-1)+ j ) do i= iv0+1, iv0+iE-iS k= i+iS - iv0 kk= item_L(k) Z(i)= Z(i) - AL(k)*Z(kk) enddo iS= STACKmc(iv-1) + 1 iE= STACKmc(iv ) do i= iS, iE Z(i)= Z(i)/DD(i) enddo enddo enddo DJDS • DJDS (Descending order Jagged Diagonal Storage) with long innermost loops is suitable for vector processors. DCRS do i= 1, N SW= WW(i,Z) isL= index_L(i-1)+1 ieL= index_L(i) do j= isL, ieL k= item_L(j) SW= SW - AL(j)*Z(k) enddo Z(i)= SW/DD(i) enddo • Reduction type loop of DCRS is more suitable for cache-based scalar processor because of its localized operation.

More Related