1.15k likes | 1.26k Views
Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator. Kengo Nakajima nakajima@cc.u-tokyo.ac.jp Supercomputing Division, Information Technology Center, The University of Tokyo Japan Agency for Marine-Earth Science and Technology. April 22, 2008. Overview.
E N D
Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator Kengo Nakajima nakajima@cc.u-tokyo.ac.jp Supercomputing Division, Information Technology Center, The University of Tokyo Japan Agency for Marine-Earth Science and Technology April 22, 2008
08-APR22 Overview • Background • Vector/Scalar Processors • GeoFEM Project • Earth Simulator • Preconditioned Iterative Linear Solvers • Optimization Strategy on the Earth Simulator • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications • Matrix Assembling • Summary & Future Works
Large-Scale Computing • Scalar Processor • Big gap between clock rate and memory bandwidth. • Very low sustained/peak performance ratio (<10%) 08-APR22
Scalar ProcessorsCPU-Cache-Memory: Hierarchical Structure CPU Register FAST Small Capacity (MB) Expensive Large (100M of Transistors) Cache SLOW Main Memory Large Capacity (GB) Cheap 08-APR22
Large-Scale Computing • Scalar Processor • Big gap between clock rate and memory bandwidth. • Very low sustained/peak performance ratio (<10%) • Vector Processor • Very high sustained/peak performance ratio • e.g. 35 % for FEM applications on the Earth Simulator • requires … • very special tuning • sufficiently long loops (= large-scale problem size) for certain performance • Suitable for simple computation. 08-APR22
Vector ProcessorsVector Register & Fast Memory • Parallel Processing for Simple DO Loops. • Suitable for Simple & Large Computation Vector Processor do i= 1, N A(i)= B(i) + C(i) enddo Vector Register Very FAST Main Memory 08-APR22
Large-Scale Computing • Scalar Processor • Big gap between clock rate and memory bandwidth. • Very low sustained/peak performance ratio (<10%) • Vector Processor • Very high sustained/peak performance ratio • e.g. 35 % for FEM applications on the Earth Simulator • requires … • very special tuning • sufficiently long loops (= large-scale problem size) for certain performance • Suitable for simple computation. 08-APR22
IBM SP3 (LBNL) Hitachi SR11000/J2 (U.Tokyo) 1.50 9.20 0.623 8.00 .072-.076 (4.8-5.1) .880-.973 (9.6-10.6) .122 (8.1) 1.34 (14.5) 0.413 0.870 1.00 12.0 16.3 4.7 * Earth Simulator Hitachi SR8000/MPP (U.Tokyo) Peak Performance GFLOPS 8.00 1.80 Measured Memory BW (STREAM) (GB/sec/PE) 26.6 2.85 Estimated Performance/Core GFLOPS (% of peak) 2.31-3.24 (28.8-40.5) .291-.347 (16.1-19.3) Measured Performance/Core 2.93 (36.6) .335 (18.6) BYTE/FLOP 3.325 1.583 Comm. BW (GB/sec/Node) 12.3 1.60 * IBM p595 J.T.Carter et al. MPI Latency (msec) 5.6-7.7 6-20 08-APR22
Typical Behavior … 40 % of peak 8 % of peak IBM-SP3: Performance is good for small problems due to cache effect. Earth Simulator: Performance is good for large scale problems due to long vector length. 08-APR22
Parallel ComputingStrong Scaling (Fixed Prob. Size) Ideal Ideal Performance Performance PE# PE# IBM-SP3: Super-scalar effect for small number of PE’s. Performance decreases for many PE’s due to comm. overhead. Earth Simulator: Performance decreases for many PE’s due to comm. overhead and small vector length. 08-APR22
Improvement of Memory Performance on IBM SP3 ⇒ Hitachi SR11000/J2 Memory Performance (BWDTH, latency etc.) ■ Flat-MPI/DCRS □ Hybrid/DCRS 2.3 GHz 8.0 GB/sec 18M L3 cache/PE IBM SP-3 POWER3 375 MHz 1.0 GB/sec 8M L2 cache/PE SR11000/J2 POWER5+ 08-APR22
08-APR22 My History with Vector Computers • Cray-1S (1985-1988) • Mitsubishi Research Institute (MRI) • Cray-YMP (1988-1995) • MRI, University of Texas at Austin • Fujitsu VP, VPP series (1985-1999) • JAERI, PNC • NEC SX-4 (1997-2001) • CCSE/JAERI • Hitachi SR2201 (1997-2004) • University of Tokyo, CCSE/JAERI • Hitachi SR8000 (2000-2007) • University of Tokyo • Earth Simulator (2002-)
08-APR22 • Background • Vector/Scalar Processors • GeoFEM Project • Earth Simulator • Preconditioned Iterative Linear Solvers • Optimization Strategy on the Earth Simulator • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications • Matrix Assembling • Summary & Future Works
08-APR22 GeoFEM: FY.1998-2002http://geofem.tokyo.rist.or.jp/ • Parallel FEM platform for solid earth simulation. • parallel I/O, parallel linear solvers, parallel visualization • solid earth: earthquake, plate deformation, mantle/core convection, etc. • Part of national project by STA/MEXT for large-scale earth science simulations using the Earth Simulator. • Strong collaborations between HPC and natural science (solid earth) communities.
08-APR22 System Configuration ofGeoFEM
08-APR22 Results on Solid Earth Simulation Complicated Plate Model around Japan Islands Magnetic Field of the Earth : MHD code Dh=5.00 Dh=1.25 Simulation of Earthquake Generation Cycle in Southwestern Japan T=100 T=200 T=300 T=400 T=500 Transportation by Groundwater Flow through Heterogeneous Porous Media TSUNAMI !!
08-APR22 Results by GeoFEM
08-APR22 Features of FEM applications (1/2) • Local “element-by-element” operations • sparse coefficient matrices • suitable for parallel computing • HUGE “indirect” accesses • IRREGULAR sparse matrices • memory intensive do i= 1, N jS= index(i-1)+1 jE= index(i) do j= jS, jE in = item(j) Y(i)= Y(i) + AMAT(j)*X(in) enddo enddo
08-APR22 Features of FEM applications (2/2) • In parallel computation … • comm. with ONLY neighbors (except “dot products” etc.) • amount of messages are relatively small because only values on domain-boundary are exchanged. • communication (MPI) latency is critical
08-APR22 • Background • Vector/Scalar Processors • GeoFEM Project • Earth Simulator • Preconditioned Iterative Linear Solvers • Optimization Strategy on the Earth Simulator • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications • Matrix Assembling • Summary & Future Works
08-APR22 Earth Simulator (ES)http://www.es.jamstec.go.jp/ • 640×8= 5,120 Vector Processors • SMP Cluster-Type Architecture • 8 GFLOPS/PE • 64 GFLOPS/Node • 40 TFLOPS/ES • 16 GB Memory/Node, 10 TB/ES • 640×640 Crossbar Network • 16 GB/sec×2 • Memory BWTH with 32 GB/sec. • 35.6 TFLOPS for LINPACK (2002-March) • 26 TFLOPS for AFES (Climate Simulation)
08-APR22 Motivations • GeoFEM Project (FY.1998-2002) • FEM-type applications with complicated unstructured grids (not LINPACK, FDM …) on the Earth Simulator (ES) • Implicit Linear Solvers • Hybrid vs. Flat MPI Parallel Programming Model
08-APR22 Memory Memory P E P E P E P E P E P E P E P E Memory Memory P E P E P E P E P E P E P E P E Flat MPI vs. Hybrid Flat-MPI:Each PE -> Independent Hybrid:Hierarchal Structure
08-APR22 • Background • Vector/Scalar Processors • GeoFEM Project • Earth Simulator • Preconditioned Iterative Linear Solvers • Optimization Strategy on the Earth Simulator • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications • Matrix Assembling • Summary & Future Works
08-APR22 Direct/Iterative Methods for Linear Equations • Direct Methods • Gaussian Elimination/LU Factorization. • compute A-1 directly. • Robust for wide range of applications. • More expensive than iterative methods (memory, CPU) • Not suitable for parallel and vector computation due to its global operations. • Iterative Methods • CG, GMRES, BiCGSTAB • Less expensive than direct methods, especially in memory. • Suitable for parallel and vector computing. • Convergence strongly depends on problems, boundary conditions (condition number etc.) • Preconditioning is required
08-APR22 Preconditioning for Iterative Methods • Convergence rate of iterative solvers strongly depends on the spectral properties (eigenvalue distribution) of the coefficient matrix A. • A preconditioner M transforms the linear system into one with more favorable spectral properties • In "ill-conditioned" problems, "condition number" (ratio of max/min eigenvalue if A is symmetric) is large. • M transforms original equation Ax=b into A'x=b' where A'=M-1A, b'=M-1b • ILU (Incomplete LU Factorization) or IC (Incomplete Cholesky Factorization) are well-known preconditioners.
08-APR22 Strategy in GeoFEM • Iterative method is the ONLY choice for large-scale parallel computing. • Preconditioning is important • general methods, such as ILU(0)/IC(0), cover wide range of applications. • problem specific methods.
08-APR22 • Background • Vector/Scalar Processors • GeoFEM Project • Earth Simulator • Preconditioned Iterative Linear Solvers • Optimization Strategy on the Earth Simulator • BIC(0)-CG Solvers for Simple 3D Linear Elastic Applications • Matrix Assembling • Summary & Future Works
08-APR22 Block IC(0)-CG Solver on the Earth Simulator • 3D Linear Elastic Problems (SPD) • Parallel Iterative Linear Solver • Node-based Local Data Structure • Conjugate-Gradient Method (CG): SPD • Localized Block IC(0) Preconditioning (Block Jacobi) • Modified IC(0): Non-diagonal components of original [A] are kept • Additive Schwartz Domain Decomposition(ASDD) • http://geofem.tokyo.rist.or.jp/ • Hybrid Parallel Programming Model • OpenMP+MPI • Re-Ordering for Vector/Parallel Performance • Comparison with Flat MPI
08-APR22 Memory Memory P E P E P E P E P E P E P E P E Memory Memory P E P E P E P E P E P E P E P E Flat MPI vs. Hybrid Flat-MPI:Each PE -> Independent Hybrid:Hierarchal Structure
08-APR22 PE#1 PE#1 PE#0 4 5 6 12 15 6 7 21 21 21 22 22 22 23 23 23 24 24 24 25 25 25 PE#0 1 11 2 3 14 13 4 5 17 17 17 18 18 18 19 19 19 16 16 16 20 20 20 10 1 2 3 7 8 9 10 12 12 12 13 13 13 14 14 14 11 10 12 11 11 11 15 15 15 8 9 11 12 10 9 11 12 7 7 7 8 8 8 9 9 9 6 6 6 10 10 10 5 9 6 3 8 8 6 4 4 5 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 PE#3 PE#2 1 2 7 7 1 2 3 PE#3 PE#2 Local Data StructureNode-based Partitioninginternal nodes - elements - external nodes
08-APR22 1 SMP node => 1 domain for Hybrid Programming ModelMPI communication among domains
08-APR22 Basic Strategy for Parallel Programming on the Earth Simulator • Hypothesis • Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector processors.
08-APR22 ILU(0)/IC(0) Factorization do i= 2, n do k= 1, i-1 if ((i,k)∈ NonZero(A)) then aik := aik/akk endif do j= k+1, n if ((i,j)∈ NonZero(A)) then aij := aij - aik*akj endif enddo enddo enddo
08-APR22 ILU/IC Preconditioning M = (L+D)D-1(D+U) Mz= r D-1(D+U)= z1 Forward Substitution (L+D)z1= r : z1= D-1(r-Lz1) Backward Substitution (I+ D-1 U)zNEW = z1 z= z - D-1Uz need to solve this equation
08-APR22 ILU/IC Preconditioning do i= 1, N WVAL= R(i) do j= 1, INL(i) WVAL= WVAL - AL(i,j) * Z(IAL(i,j)) enddo Z(i)= WVAL / D(i) enddo do i= N, 1, -1 SW = 0.0d0 do j= 1, INU(i) SW= SW + AU(i,j) * Z(IAU(i,j)) enddo Z(i)= Z(i) – SW / D(i) enddo Forward Substitution (L+D)z= r : z= D-1(r-Lz) M =(L+D)D-1(D+U) L,D,U: A Dependency… You need the most recent value of “z” of connected nodes. Vectorization/parallelization is difficult. Backward Substitution (I+ D-1 U)znew= zold : z= z - D-1Uz
08-APR22 Basic Strategy for Parallel Programming on the Earth Simulator • Hypothesis • Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector processors. • Re-Ordering for Highly Parallel/Vector Performance • Local operation and no global dependency • Continuous memory access • Sufficiently long loops for vectorization
08-APR22 ILU/IC Preconditioning do i= 1, N WVAL= R(i) do j= 1, INL(i) WVAL= WVAL - AL(i,j) * Z(IAL(i,j)) enddo Z(i)= WVAL / D(i) enddo do i= N, 1, -1 SW = 0.0d0 do j= 1, INU(i) SW= SW + AU(i,j) * Z(IAU(i,j)) enddo Z(i)= Z(i) – SW / D(i) enddo Forward Substitution (L+D)z= r : z= D-1(r-Lz) M =(L+D)D-1(D+U) L,D,U: A Dependency… You need the most recent value of “z” of connected nodes. Vectorization/parallelization is difficult. Reordering: Directly connected nodes do not appear in RHS. Backward Substitution (I+ D-1 U)znew= zold : z= z - D-1Uz
08-APR22 Basic Strategy for Parallel Programming on the Earth Simulator • Hypothesis • Explicit ordering is required for unstructured grids in order to achieve higher performance in factorization processes on SMP node and vector processors. • Re-Ordering for Highly Parallel/Vector Performance • Local operation and no global dependency • Continuous memory access • Sufficiently long loops for vectorization • 3-Way Parallelism for Hybrid Parallel Programming • Inter Node : MPI • Intra Node : OpenMP • Individual PE : Vectorization
08-APR22 Vector SMP Parallel Vector SMP Parallel These processes can be substituted by traditional multi-coloring (MC). Re-Ordering Technique for Vector/Parallel ArchitecturesCyclic DJDS(RCM+CMC) Re-Ordering(Doi, Washio, Osoda and Maruyama (NEC)) 1. RCM (Reverse Cuthil-Mckee) 2. CMC (Cyclic Multicolor) 3. DJDS re-ordering 4. Cyclic DJDS for SMP unit
08-APR22 Reordering = Coloring • COLOR: Unit of independent sets. • Elements grouped in the same “color” are independent from each other, thus parallel/vector operation is possible. • Many colors provide faster convergence, but shorter vector length.: Trade-Off !! RCM (Reverse CM) Red-Black (2 colors) 4 colors
08-APR22 2D-Storage long vector length, many ZERO’s 1D-Storage (CRS) memory saved, short vector length Large Scale Sparse Matrix Storage for Unstructured Grids
08-APR22 Re-Ordering in each Coloraccording to Non-Zero Off-Diag. Component # Elements on the same color are independent, thereforeintra-hyperplane re-ordering does not affect results DJDS : Descending-Order Jagged Diagonal Storage
08-APR22 iS+1 iv0+1 iE Cyclic DJDS (MC/CM-RCM)Cyclic Re-Ordering for SMP unitsLoad-balancing among PEs do iv= 1, NCOLORS !$omp parallel do do ip= 1, PEsmpTOT iv0= STACKmc(PEsmpTOT*(iv-1)+ip- 1) do j= 1, NLhyp(iv) iS= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip-1) iE= INL(npLX1*(iv-1)+PEsmpTOT*(j-1)+ip ) !cdir nodep do i= iv0+1, iv0+iE-iS k= i+iS - iv0 kk= IAL(k) (Important Computations) enddo enddo enddo enddo npLX1= NLmax * PEsmpTOT INL(0:NLmax*PEsmpTOT*NCOLORS)
08-APR22 Difference betweenFlat MPI & Hybrid • Most of the efforts of re-ordering are for vectorization. • If you have a long vector, just divide it and distribute the segments to PEs on SMP nodes. • Source codes of Hybrid and Flat MPI are not so different. • Flat MPI corresponds to Hybrid where PE/SMP node=1. • In other words, Flat MPI code is sufficiently complicated.
08-APR22 Cyclic DJDS (MC/CM-RCM)for Forward/Backward Substitution in BILU Factorization
08-APR22 z Uniform Distributed Force in z-dirrection @ z=Zmin Uy=0 @ y=Ymin Ux=0 @ x=Xmin Nz-1 Ny-1 y Uz=0 @ z=Zmin Nx-1 x Simple 3D Cubic Model
08-APR22 Effect of Ordering
08-APR22 Long Loops Continuous Access Short Loops Irregular Access Short Loops Continuous Access Effect of Re-Ordering
08-APR22 Matrix Storage, Loops do iv= 1, NVECT iv0= STACKmc(iv-1) do j= 1, NLhyp(iv) iS= index_L(NL*(iv-1)+ j-1) iE= index_L(NL*(iv-1)+ j ) do i= iv0+1, iv0+iE-iS k= i+iS - iv0 kk= item_L(k) Z(i)= Z(i) - AL(k)*Z(kk) enddo iS= STACKmc(iv-1) + 1 iE= STACKmc(iv ) do i= iS, iE Z(i)= Z(i)/DD(i) enddo enddo enddo DJDS • DJDS (Descending order Jagged Diagonal Storage) with long innermost loops is suitable for vector processors. DCRS do i= 1, N SW= WW(i,Z) isL= index_L(i-1)+1 ieL= index_L(i) do j= isL, ieL k= item_L(j) SW= SW - AL(j)*Z(k) enddo Z(i)= SW/DD(i) enddo • Reduction type loop of DCRS is more suitable for cache-based scalar processor because of its localized operation.