2.39k likes | 2.92k Views
ScaLAPACK Tutorial. Susan Blackford University of Tennessee http://www.netlib.org/scalapack/. Outline. Introduction ScaLAPACK Project Overview Basic Linear Algebra Subprograms (BLAS) Linear Algebra PACKage (LAPACK) Basic Linear Algebra Communication Subprograms (BLACS)
E N D
ScaLAPACK Tutorial Susan Blackford University of Tennessee http://www.netlib.org/scalapack/
Outline • Introduction • ScaLAPACK Project Overview • Basic Linear Algebra Subprograms (BLAS) • Linear Algebra PACKage (LAPACK) • Basic Linear Algebra Communication Subprograms (BLACS) • Parallel BLAS (PBLAS) • Design of ScaLAPACK http://www.netlib.org/scalapack
Outline continued • Contents of ScaLAPACK • Applications • Performance • Example Programs • Issues of Heterogeneous Computing • HPF Interface to ScaLAPACK http://www.netlib.org/scalapack
Outline continued • Future Directions • Conclusions • Hands-On Exercises http://www.netlib.org/scalapack
Introduction http://www.netlib.org/scalapack
High-Performance Computing Today • In the past decade, the world has experienced one of the most exciting periods in computer development. • Microprocessors have become smaller, denser, and more powerful. • The result is that microprocessor-based supercomputing is rapidly becoming the technology of preference in attacking some of the most important problems of science and engineering. http://www.netlib.org/scalapack
GrowthofMicroprocessorPerformance Micros Supers 10000 Cray T90 CrayC90 1000 Cray 2 Cray Y-MP Alpha RS6000/590 Cray X-MP Alpha 100 RS6000/540 Cray 1S i860 Performance in Mflop/s 10 R2000 1 80387 0.1 6881 80287 8087 0.01 1980 1982 1986 1988 1990 1992 1994 1996 Year
The Maturation of Highly Parallel Technology • Affordable parallel systems now out-perform the best conventional supercomputers. • Performance per dollar is particularly favorable. • The field is thinning to a few very capable systems. • Reliability is greatly improved. • Third-party scientific and engineering applications are appearing. • Business applications are appearing. • Commercial customers, not just research labs, are acquiring systems. http://www.netlib.org/scalapack
Architecture Alternatives • Distributed, private memories using message passing to communicate among processors • Single address space implemented with physically distributed memories. • Both approaches require: • Attention to locality of access. • Scalable interconnection technology. http://www.netlib.org/scalapack
Directions • Move toward shared memory • SMPs and Distributed Shared Memory • Shared address space w/deep memory hierarchy • Clustering of shared memory machines for scalability • Efficiency of message passing and data parallel programming • Helped by standards efforts such as MPI and HPF (PVM & OpenMP) http://www.netlib.org/scalapack
How to integrate software? Until recently no standards Many parallel languages Various parallel programming models Assumptions about the parallel environment granularity topology overlapping of communication/computation development tools Where is the data Who owns it? Opt data distribution Who determines data layout Determined by user? Determined by library developer? Allow dynamic data dist. Load balancing Challenges in Developing Distributed Memory Libraries http://www.netlib.org/scalapack
ScaLAPACK Project Overview http://www.netlib.org/scalapack
Susan Blackford, UT Jaeyoung Choi, Soongsil University, Korea (UT) Andy Cleary, LLNL (UT) Tony Chan, UCLA Ed D'Azevedo, ORNL Jim Demmel, UC-B Inder Dhillon, IBM (UC-B) Jack Dongarra, UT/ORNL Victor Eijkhout, UT http://www.netlib.org/scalapack Sven Hammarling, NAG (UT) Mike Heath, UIUC Greg Henry, Intel Antoine Petitet, UT Padma Raghavan, UT Dan Sorensen, Rice U Ken Stanley, UC-B David Walker, Cardiff (ORNL) Clint Whaley, UT Plus others not funded by DARPA scalapack@cs.utk.edu ScaLAPACK Team
Scalable Parallel Library for Numerical Linear Algebra • ScaLAPACK Software Library for Dense, Banded, Sparse • University of Tennessee • Oak Ridge National Laboratory • University of California, Berkeley • P_ARPACK Large Sparse Non-Symmetric Eigenvalues • Rice University • CAPSS Direct Sparse Systems Solvers • University of Illinois, U/UC • University of Tennessee • ParPre Parallel Preconditioners for Iterative Methods • University of California, Los Angeles • University of Tennessee http://www.netlib.org/scalapack
NLA - Software Development • BLAS, LINPACK, EISPACK • LAPACK • From hand-held calculators to most powerful vector supercomputers • RISC, Vector, and SMP target • In use by CRAY, IBM, SGI, HP-Convex, SUN, Fujitsu, Hitachi, NEC, NAG, IMSL, KAI • Still under development • High accuracy routines, parameterizing for memory hierarchies, annotated libraries http://www.netlib.org/scalapack
NLA - ScaLAPACK • Follow on to LAPACK • Designed for distributed parallel computing (MPP & Clusters) • First math software package to do this • Numerical software that will work on a heterogeneous platform • Latent tolerant algorithms on systems with deep memory hierarchy • “Out of Core” and fault-tolerant implementations • In use by Cray, IBM, HP-Convex, Fujitsu, NEC, NAG, IMSL • Tailor performance & provide support • Preparing final release. http://www.netlib.org/scalapack
Goals - Port LAPACK to Distributed-Memory Environments. • Efficiency • Optimized compute and communication engines • Block-partitioned algorithms (Level 3 BLAS) utilize memory hierarchy and yield good node performance • Reliability • Whenever possible, use LAPACK algorithms and error bounds. • Scalability • As the problem size and number of processors grow • Replace LAPACK algorithm that did not scale; New ones into LAPACK • Portability • Isolate machine dependencies to BLAS and the BLACS • Flexibility • Modularity: Build rich set of linear algebra tools: BLAS, BLACS, PBLAS • Ease-of-Use • Calling interface similar to LAPACK http://www.netlib.org/scalapack
Susan Blackford, UT Jaeyoung Choi, Soongsil University Andy Cleary, LLNL Ed D'Azevedo, ORNL Jim Demmel, UC-B Inder Dhillon, IBM (UC-B) Jack Dongarra, UT/ORNL Ray Fellers, UC-B http://www.netlib.org/scalapack Sven Hammarling, NAG Greg Henry, Intel Osni Marques, LBNL/NERSC Caroline Papadopoulos, UCSD Antoine Petitet, UT Ken Stanley, UC-B Francoise Tisseur, U Man David Walker, Cardiff U Clint Whaley, UT scalapack@cs.utk.edu ScaLAPACK Team
Programming Style • SPMD Fortran 77 using an object based design • Built on various modules • PBLAS Interprocessor communication • BLACS • PVM, MPI, IBM SP, CRI T3, Intel, TMC • Provides right level of notation. • BLAS • LAPACK software expertise/quality • software approach • numerical methods
Overall Structure of Software • Each global data object is assigned an array descriptor. • The array descriptor • Contains information required to establish mapping between a global array entry and its corresponding process and memory location. • Is differentiated by the DTYPE_ (first entry) in the descriptor. • Provides a flexible framework to easily specify additional data distributions or matrix types. • Using concept of context http://www.netlib.org/scalapack
PBLAS • Similar to the BLAS in functionality and naming. • Built on the BLAS and BLACS • Provide global view of matrix CALL DGEXXX ( M, N, A( IA, JA ), LDA,... ) CALL PDGEXXX( M, N, A, IA, JA, DESCA,... ) http://www.netlib.org/scalapack
ScaLAPACK PBLAS LAPACK BLACS PVM/MPI/... BLAS ScaLAPACK Structure Global Local http://www.netlib.org/scalapack
Level 3 BLAS block operations All the reduction routines Pipelining QR Algorithm, Triangular Solvers, classic factorizations Redundant computations Condition estimators QR based eigensolvers Static work assignment Bisection Task parallelism Sign function eigenvalue computations Divide and Conquer Tridiagonal and band solvers, symmetric eigenvalue problem and Sign function Cyclic reduction Reduced system in the band solver Parallelism in ScaLAPACK http://www.netlib.org/scalapack
Heterogeneous Computing • Software intended to be used in this context • Machine precision and other machine specific parameters • Communication of ft. pt. numbers between processors • Repeatability - • run to run differences, e.g. order in which quantities summed • Coherency - • within run differences, e.g. arithmetic varies from proc to proc • Iterative convergence across clusters of processors • Defensive programming required • Important for the “Computational Grid” http://www.netlib.org/scalapack
Prototype Codes http://www.netlib.org/scalapack/prototype/ • PBLAS (version 2.0 ALPHA) • Packed Storage routines for LLT, SEP, GSEP • Out-of-Core Linear Solvers for LU, LLT, and QR • Matrix Sign Function for Eigenproblems • SuperLU and SuperLU_MT • HPF Interface to ScaLAPACK • Distributed memory SuperLU coming soon! http://www.netlib.org/scalapack
Out of Core Software Approach • High-level I/O Interface • ScaLAPACK uses a `Right-looking’ variant for LU, QR and Cholesky factorizations. • A `Left-looking’ variant is used for Out-of-core factorization to reduce I/O traffic. • Requires two in-core column panels. http://www.netlib.org/scalapack
Out-of-Core Performance http://www.netlib.org/scalapack
HPF Version • Interface provided for a subset of routines • Linear solvers (general, pos. def., least squares) • Eigensolver (pos. def.) • matrix multiply, & triangular solve
HPF Version • Given these declarations: REAL, DIMENSION(1000,1000) :: A, B, C !HPF$ DISTRIBUTE (CYCLIC(64), CYCLIC(64)) :: A, B, C • Calls could be made as simple as: CALL LA_GESV(A, B) CALL LA_POSV(A, B) CALL LA_GELS(A, B) CALL LA_SYEV(A, W, Z) CALL LA_GEMM(A, B, C) CALL LA_TRSM(A,B) • Fortran 90 version follows these ideas
ScaLAPACK - Ongoing Work • Increasing flexibility and usability • Algorithm redistribution in PBLAS • Removal of alignment restrictions • Algorithmic blocking • Increased functionality • Divide and conquer for SEP and SVD • “Holy Grail” for Symmetric Eigenvalue Problem • Sparse Direct Solver • C++ and Java interface
Direct Sparse Solvers • CAPSS is a package to solve Ax=b on a message passing multiprocessor; the matrix A is SPD and associated with a mesh in 2 or 3D. (Version for Intel & MPI) • MFACT -- multifrontal sparse solver http://www.cs.utk.edu/~padma/mfact.html • SuperLU - sequential and parallel implementations of Gaussian elimination with partial pivoting. http://www.netlib.org/scalapack
Sparse Gaussian Elimination • Super LU_MT, supernodal SMP version • Designed to exploit memory hierarchy • Serial Supernode-panel organization permits "BLAS 2.5" performance • Up to 40% of machine peak on large sparse matrices on IBM RS6000/590, MIPS R8000, 25% on Alpha 21164 http://www.netlib.org/scalapack
Super LU • Several data structures redefined, • redesigned for parallel access Barrier-less, nondeterministic implementation • Task queue for load balance (over 95% on most large problems) • For n=16614 matrix from 3D flow calculation, 66 nonzeros/row http://www.netlib.org/scalapack
Parallel Sparse Eigenvalue Solvers • P_ARPACK (D. Sorensen et al) • Designed to compute a few values and corresponding vectors of a general matrix. • Appropriate for large sparse or structured matrices A • This software is based Arnoldi process called the Implicitly Restarted Arnoldi Method. • Reverse Communication Interface. http://www.netlib.org/scalapack
ParPre • Library of parallel preconditioners for iterative solution methods for linear systems of equations http://www.netlib.org/scalapack
Netlib downloads for ScaLAPACK material • ScaLAPACK -- 4230 • ScaLAPACK HPF version -- 228 • SuperLU -- 78 • ARPACK -- 1129 • PARPACK -- 292 • CAPSS -- 479 • PARPRE -- 102 • manpages -- 3292 • This is a count of the downloads for the archive,tar, or prebuilt libraries. http://www.netlib.org/scalapack
Related Projects http://www.netlib.org/scalapack
Java • Java likely to be a dominant language. • Provides for machine independent code. • C++ like language • No pointers, goto’s, overloading arith ops, or memory deallocation • Portability achieved via abstract machine • Java is a convenient user interface builder which allows one to quickly develop customized interfaces.
JAVA • Investigating the suitability of Java for Math Software libraries. • Http://math.nist.gov/javanumerics/ • JAMA (Java Matrix Package) • BLAS, LU, QR, eigenvalue routines, SVD • LAPACK to Java translator • http://www.cs.utk.edu/f2j/download.html • Involved in the community discussion
LAPACK to JAVA • Allows Java programmers to access to BLAS/LAPACK routines. • Translator to go from LAPACK to Java Byte Code • f2j: formal compiler of a subset of f77 sufficient for BLAS & LAPACK • Plan to enable all of LAPACK • Compiler provides quick, reliable translation. • Focus on LAPACK Fortran • Simple - no COMMON, EQUIVALANCE, SAVE, I/O
Parameterized Libraries • Architecture features for performance • Cache size • Latency • Bandwidth • Latency tolerant algorithms • Latency reduction • Granularity management • High levels of concurrency • Issues of precision • Grid aware
Computational Resources Reply Choice Client Agent Request Motivation for Network Enabled Solvers Design an easy-to-use tool to provide efficient and uniform access to a variety of scientific packages on UNIX platforms • Client-Server Design • Non-hierarchical system • Load Balancing • Fault Tolerance • Heterogeneous Environment Supported
NetSolve -- References • Software and documentation can be obtained via the WWW or anonymous ftp: • http://www.cs.utk.edu/netsolve/ • Comments/questions to netsolve@cs.utk.edu.
ATLAS atlas@cs.utk.edu
Currently provided Level 3 BLAS Generated GEMM 1-2 hours install time per precision Recursive GEMM-based L3 BLAS Antoine Petitet Next Release: Real matvec Various L1 Threading What is ATLAS • A package that adapts to differing architectures via code generation + timing • Initially, supply BLAS • Package contains: • Code generators • Sophisticated timers • Robust search routines
Why ATLAS is needed • BLAS require many man-hours/platform • Only done if financial incentive is there • Many platforms will never have an optimal version • Lags behind hardware • May not be affordable by everyone • Improves Vendor code • Operations may be important, but not general enough for standard • Allows for portably optimal codes
N K N A M B M K * NB C Algorithmic approach for Lvl 3 • Only generated code is on-chip multiply • All BLAS operations written in terms of generated on-chip multiply • All transpose cases coerced through data copy to 1 case of on-chip multiply • Only 1 case generated per platform
On-chip multiply optimizes for: TLB access L1 cache reuse FP unit usage Memory fetch Register reuse Loop overhead minimization Code generation strategy • Code is iteratively generated & timed until optimal case is found. We try: • Differing NBs • Breaking false dependencies • M, N and K loop unrolling