590 likes | 765 Views
Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures. Ananth Grama and Ahmed Sameh Department of Computer Science Purdue University. http://www.cs.purdue.edu/people/{sameh/ayg}. Linear Solvers Grant Kickoff Meeting, 9/26/06.
E N D
Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures Ananth Grama and Ahmed Sameh Department of Computer Science Purdue University. http://www.cs.purdue.edu/people/{sameh/ayg} Linear Solvers Grant Kickoff Meeting, 9/26/06.
Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures • Project Overview • Identify sweet spots in algorithm-architecture-programming model space for efficient sparse linear system solvers. • Design a new class of highly scalable sparse solvers suited to petascale HPC systems. • Develop analytical frameworks for performance prediction and projection. • Methodology • Design generalized sparse solvers (direct, iterative, and hybrid) and evaluate their scaling/communication characteristics. • Evaluate architectural features and their impact on scalable solver performance. • Evaluate performance and productivity aspects of programming models -- PGAs (CAF, UPC) and MPI. • Challenges and Impact • Generalizing the space of parallel sparse linear system solvers. • Analysis and implementation on parallel platforms • Performance projection to the petascale • Guidance for architecture and programming model design / performance envelope. • Benchmarks and libraries for HPCS. • Milestones / Schedule • Final deliverable: Comprehensive evaluation of scaling properties of existing (and novel sparse solvers). • Six month target: Comparative performance of solvers on multicore SMPs and clusters. • 12-month target: Evaluation on these solvers on Cray X1, BG, JS20/21, for CAF/UPC/MPI implementations.
Introduction • A critical aspect of High-Productivity is the identification of points/regions in the algorithm/ architecture/ programming model space that are amenable for implementation on petascale systems. • This project aims at identifying such points for commonly used sparse linear system solvers, and at developing more robust novel solvers. • These novel solvers emphasize reduction in memory/remote accesses at the expense of (possibly) higher FLOP counts – yielding much better actual performance.
Project Rationale • Sparse system solvers govern the overall performance of many CSE applicationson HPC systems. • Design of HPC architectures and programming models should be influenced by their suitability for such solvers and related kernels. • Extreme need for concurrency on novel architectural models require fundamental re-examination of conventional sparse solvers.
Typical Computational Kernels for PDE’s Integration Newton Iteration Linear system solvers k k t
NESSIE – Nanoelectronics Simulation Environment Mathematical Methodologies: Finite Element Method, mode decomposition, multi-scale, non-linear numerical schemes,… Numerical Parallel Algorithms: Linear solver (SPIKE), Eigenpairs solver (TraceMin), preconditioning strategies,… Transport/Electrostatics Multi-scale, Multi-physics Multi-method 3D CNTs-3D Molec. Devices 3D Si Nanowires 3D III/V Devices 2D MOSFETs E. Polizzi (1998-2005)
Simulation, Model Reduction and Real-Time Control of Structures
Project Goals • Develop generalizations of direct and iterative solvers – e.g. the Spike polyalgorithm. • Implement such generalizations on various architectures (multicore, multicore SMPs, multicore SMP aggregates) and programming models (PGAs, Messaging APIs) • Analytically quantify performance and project to petascale platforms. • Compare relative performance, identify architecture/programming model features, and guide algorithm/ architecture/ programming model co-design.
Background • Personnel: • Ahmed Sameh, Samuel Conte Professor of Computer Science, has worked on development of parallel numerical algorithms for four decades. • Ananth Grama, Professor and University Scholar, has worked both on numerical aspects of sparse solvers, as well as analytical frameworks for parallel systems. • (To be named – Postdoctoral Researcher)* will be primarily responsible for implementation and benchmarking. *We have identified three candidates for this position and will shortly be hiring one of them.
Background… • Technical • We have built extensive infrastructure for parallel sparse solvers – including the Spike parallel toolkit, augmented-spectral ordering techniques, and multipole-based preconditioners • We have diverse hardware infrastructure, including Intel/AMP multicore SMP clusters, JS20/21 Blade servers, BlueGene/L, Cray X1.
Background… • Technical… • We have initiated installation of Co-Array Fortran and Unified Parallel C on our machines and porting our toolkits to these PGAs. • We have extensive experience in analysis of performance and scalability of parallel algorithms, including development of the isoefficiency metric for scalability.
Technical Highlights The SPIKE Toolkit
SPIKE: Introduction after RCM reordering • Engineering problems usually produce large sparse linear systems • Banded (or banded with low-rank perturbations) structure is often obtained after reordering • SPIKE partitions the banded matrix into a block tridiagonal form • Each partition is associated with one CPU, or one node multilevel parallelism
SPIKE: Introduction … AX=F SX=diag(A1-1,…,Ap-1) F Reduced system ((p-1) x 2m) V1 W2 V2 V3 W3 W4 Retrieve solution A(n x n) Bj, Cj (m x m), m<<n
SPIKE: A Hybrid Algorithm Different choices depending on the properties of the matrix and/or platform architecture • The spikes can be computed: • Explicitly (fully or partially) • On the Fly • Approximately • The diagonal blocks can be solved: • Directly (dense LU, Cholesky, or sparse counterparts) • Iteratively (with a preconditioning strategy) • The reduced system can be solved: • Directly (Recursive SPIKE) • Iteratively (with a preconditioning scheme) • Approximately (Truncated SPIKE)
The SPIKE algorithm Hierarchy of Computational Modules (systems dense within the band)
SPIKE Hybrids • SPIKE versions R = recursive E = explicit F = on-the-fly T = truncated 2.Factorization No pivoting: L = LU U = LU & UL A = alternate LU & UL Pivoting: P = LU 3. Solution improvement: • 0 direct solver only • 2 iterative refinement • 3 outer Bicgstab iterations
é ù I E 0 1 ê ú F I 0 G ê ú 2 2 ê ú H 0 I E 0 2 2 = S ê ú 0 0 F I G ê ú 3 3 ê ú H 0 I E 3 3 ê ú ê ú 0 F I ë û 4 SPIKE “on-the-fly” • Does not require generating the spikes explicitly • Ideally suited for banded systems with large sparse bands • The reduced system is solved iteratively with/without a preconditioning strategy
Numerical Experiments(dense within the band) • Computing platforms • 4 nodes Linux Xeon cluster • 512 processor IBM-SP • Performance comparison w/ LAPACK, and ScaLAPACK
SPIKE: Scalability b=401; RHS=1; IBM-SP Spike (RL0) Speed improvement Spike vs Scalapack
A1 Factorizations (w/o pivoting) Processors 1 2 3 4 LU LU and UL LU and UL UL B1 C2 A2 B2 C3 A3 B3 C4 A4 SPIKE partitioning - 4 processor example Partitioning -- 1 Partitioning -- 2 Factorizations (w/o pivoting) Processors A1 LU UL (p=2); LU (p=3) UL 1 2,3 4 B1 A2 C2 B2 A3 C3
SPIKE:Small number of processors ScaLAPACK needs at least 4-8 processors to perform as well as LAPACK. SPIKE benefits from the LU-UL strategy and realizes speed improvement over LAPACK on 2 or more processors. n=960,000 b=201 4-node Xeon Intel Linux cluster with infiniband interconnection - Two 3.2 Ghz processors per node; 4 GB of memory/ node; 1 MB cache; 64-bit arithmetic. Intel fortran, Intel MPI Intel MKL libraries for LAPACK, SCALAPACK System is diag. dominant
After RCM reordering General sparse systems • Science and engineering problems often produce large sparse linear systems • Banded (or banded with low-rank perturbations) structure is often obtained after reordering • While most ILU preconditioners depend on reordering strategies that minimize the fill-in in the factorization stage, we propose to: • extract a narrow banded matrix, via a reordering-dropping strategy, to be used as a preconditioner. • make use of an improved SPIKE “on-the-fly” scheme that is ideally suited for banded systems that are sparse within the band.
Sparse parallel direct solvers and banded systems N=432,000, b= 177, nnz= 7, 955, 116, fill-in of the band: 10.4% * Memory swap –too much fill-in
A1 B1 C2 A2 SPIKE B2 C3 A3 Node 1 Node 2 Node 4 Node 3 B3 C4 A4 Pardiso Pardiso Pardiso Pardiso Multilevel Parallelism:SPIKE calling MKL-PARDISO for banded systems that are sparse within the band • This SPIKE hybrid scheme exhibits better performance than other parallel direct sparse solvers used alone.
SPIKE-MKL”on-the-fly”for systems that are sparse within the band N=432,000, b= 177, nnz= 7, 955, 116, sparsity of the band: 10.4% • MUMPS: time (2-nodes) = 21.35 s; time (4-nodes) = 39.6 s • For narrow banded systems, SPIKE will consider the matrix dense within the band. Reordering schemes for minimizing the bandwidth can be used if necessary. N=471,800, b= 1455, nnz= 9, 499, 744, sparsity of the band: 1.4% • Good scalability using “on-the-fly” SPIKE scheme
A Preconditioning Approach • Reorder the matrix to bring most of the elements within a band via • HSL’s MC64 (to maximize sum or product of the diagonal elements) • RCM (Reverse Cuthill-McKee) or MD (Minimum Degree) • Extract a band from the reordered matrix to use as preconditioner. • Use an iterative method (e.g. Bicgstab) to solve the linear system (outer solver). • Use SPIKE to solve the system involving the preconditioner (inner solver).
Matrix DW8192, order N = 8192, NNZ = 41746, Condest (A) = 1.39e+07,Square Dielectric WaveguideSparsity = 0.0622% , (A+A’) indefinite
GMRES + ILU preconditionervs. Bicgstab with banded preconditioner • Gmres w/o precond: >5000 iterations • Gmres + ILU (no fill-in): Fails • Gmres + ILU (1 e-3): Fails • Gmres + ILU (1 e-4): 16 iters., ||r|| ~ 10-5 • NNZ(L) = 612,889 • NNZ(U) = 414,661 • Spike - Bicgstab: • preconditioner of bandwidth = 70 • 16 iters., ||r|| ~ 10-7 • NNZ = 573,440
Comparisons on a 4-node Xeon Linux Cluster (2 CPUs/node) • SuperLU: • 1 node -- 3.56 s • 2 nodes -- 1.87 s • 4 nodes -- 3.02 s (memory limitation) • MUMPS: • 1 node -- 0.26 s • 2 nodes -- 0.36 s • 4 nodes -- 0.34 s • SPIKE – RL3 (bandwidth = 129): • 1 node -- 0.28 s • 2 nodes -- 0.27 s • 4 nodes -- 0.20 s • Bicgstab required 6 iterations only • Norm of rel. res. ~ 10-6 Matrix DW8192 Matrix DW8192 Sparse Matvec kernel needs fine-tuning Sparse Matvec kernel needs fine-tuning
Matrix: BMW7st_1(stiffness matrix) Order N =141,347, NNZ = 7,318,399 after RCM reordering
Comparisons on a 4-node Xeon Linux Cluster (2 CPUs/node) • SuperLU: • 1 node -- 26.56 s • 2 nodes -- 21.17 s • 4 nodes -- 45.03 s • MUMPS: • 1 node -- 09.71 s • 2 nodes -- 08.03 s • 4 nodes -- 08.05 s • SPIKE – TL3 (bandwidth = 101): • 1 node -- 02.38 s • 2 nodes -- 01.62 s • 4 nodes -- 01.33 s • Bicgstab required 5 iterations only • Norm of rel. res. ~ 10-6 Matrix: BMW7st_1 is diag. dominant Sparse Matvec kernel needs fine-tuning
Driven Cavity Example • Square domain: 1 x, y 1 B.CS.: ux = uy = 0on x, y = 1 ; x = 1 ux = 1, uy = 0 on y = 1 • Picard’s iterations; Q2/Q1 elements: Linearized equations (Oseen problem) • G 0 • E = As = (A + AT)/2 • viscosity: 0.1, 0.02, 0.01,0.002
MA48 as a preconditioner for GMRES on a uniprocessor ** for drop tolerance = .0001, total time ~ 117 sec. ** No convergence for a drop tolerance of 10-2
Spike – Pardisofor solving systems involving the banded preconditioner • On 4 nodes (8 CPU’s) of a Xeon-Intel cluster: • bandwidth of extracted preconditioner = 1401 • total time = 16 sec. • # of Gmres iters. = 131 • 2-norm of residual = 10-7 • Speed improvement over sequential procedure with drop tolerance = .0001: • 117/16 ~ 7.3
Technical Highlights… • Analysis of Scaling Properties • In early work, we developed the Isoefficiency metric for scalability. • With the likely scenario of utilizing up to 100K processing cores, this work becomes critical. • Isoefficiency quantifies the performance of a parallel system (a parallel program and the underlying architecture) as the number of processors is increased.
Technical Highlights… • Isoefficiency Analysis • The efficiency of any parallel program for a given problem instance goes down with increasing number of processors. • For a family of parallel programs (formally referred to as scalable programs), increasing the problem size results in an increase in efficiency.
Technical Highlights… • Isoefficiency is the rate at which problem size must be increased w.r.t. number of processors, to maintain constant efficiency. • This rate is critical, since it is ultimately limited by total memory size. • Isoefficiency is a key indicator of a program’s ability to scale to very large machine configurations. • Isoefficiency analysis will be used extensively for performance projection and scaling properties.
Architecture • We target the following currently available architectures • IBM JS20/21 and BlueGene/L platforms • Cray X1/XT3 • AMD Opteron multicore SMP and SMP clusters • Intel Xeon multicore SMP and SMP clusters • These platforms represent a wide range of currently available architectural extremes. _______________________________________________ Intel, AMD, and IBM JS21 platforms are available locally at Purdue. We intend to get access to the BlueGene at LLNL and the Cray XT3 at ORNL.
Implementation • Current implementations are MPI based. • The Spike tooklit will be ported to • POSIX and OpenMP • UPC and CAF • Titanium and X10 (if releases are available) • These implementations will be comprehensively benchmarked across platforms.