The Future of LAPACK and ScaLAPACK www.netlib.org/lapack-dev

The Future of LAPACK and ScaLAPACKwww.netlib.org/lapack-dev Jim Demmel UC Berkeley 21 June 2006 PARA 06

Outline • Motivation for new Sca/LAPACK • Challenges (or research opportunities…) • Goals of new Sca/LAPACK • Highlights of progress

Motivation • LAPACK and ScaLAPACK are widely used • Adopted by Cray, Fujitsu, HP, IBM, IMSL, MathWorks, NAG, NEC, SGI, … • >60M web hits @ Netlib (incl. CLAPACK, LAPACK95)

Impact (with NERSC, LBNL) Cosmic Microwave Background Analysis, BOOMERanG collaboration, MADCAP code (Apr. 27, 2000). ScaLAPACK

Motivation • LAPACK and ScaLAPACK are widely used • Adopted by Cray, Fujitsu, HP, IBM, IMSL, MathWorks, NAG, NEC, SGI, … • >60M web hits @ Netlib (incl. CLAPACK, LAPACK95) • Many ways to improve them, based on • Own algorithmic research • Enthusiastic participation of research community • User/vendor survey • Opportunities and demands of new architectures, programming languages • New releases planned (NSF support)

Participants • UC Berkeley: • Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye Li, Osni Marques, Christof Voemel, David Bindel, Yozo Hida, Jason Riedy, Jianlin Xia, Jiang Zhu, undergrads… • U Tennessee, Knoxville • Jack Dongarra, Julien Langou, Julie Langou, Piotr Luszczek, Stan Tomov, Alfredo Buttari, Jakub Kurzak • Other Academic Institutions • UT Austin, UC Davis, Florida IT, U Kansas, U Maryland, North Carolina SU, San Jose SU, UC Santa Barbara • TU Berlin, U Electrocomm. (Japan), FU Hagen, U Carlos III Madrid, U Manchester, U Umeå, U Wuppertal, U Zagreb • Research Institutions • CERFACS, LBL • Industrial Partners • Cray, HP, Intel, MathWorks, NAG, SGI

Challenges • For all large scale computing, not just linear algebra! • Example …

Parallelism in the Top500

Challenges • For all large scale computing, not just linear algebra! • Example … your laptop • 256 Threads/multicore chip by 2010

Challenges • For all large scale computing, not just linear algebra! • Example … your laptop • 256 Threads/multicore chip by 2010 • Exponentially growing gaps between • Floating point time << 1/Memory BW << Memory Latency • Floating point time << 1/Network BW << Network Latency

Challenges • For all large scale computing, not just linear algebra! • Example … your laptop • 256 Threads/multicore chip by 2010 • Exponentially growing gaps between • Floating point time << 1/Memory BW << Memory Latency • Floating point time << 1/Network BW << Network Latency • Heterogeneity (performance and semantics) • Asynchrony • Unreliability

What do users want? • High performance, ease of use, … • Survey results at www.netlib.org/lapack-dev • Small but interesting sample • What matrix sizes do you care about? • 1000s: 34% • 10,000s: 26% • 100,000s or 1Ms: 26% • How many processors, on distributed memory? • >10: 34%, >100: 31%, >1000: 19% • Do you use more than double precision? • Sometimes or frequently: 16% • Would Automatic Memory Allocation help? • Very useful: 72%, Not useful: 14%

Goals of next Sca/LAPACK • Better algorithms • Faster, more accurate • Expand contents • More functions, more parallel implementations • Automate performance tuning • Improve ease of use • Better software engineering • Increased community involvement

Goal 1: Better Algorithms • Faster • But provide “usual” accuracy, stability • … Or accurate for an important subclass • More accurate • But provide “usual” speed • … Or at any cost

Goal 1a – Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem / SVD: • Parlett / Dhillon / Voemel / Marques / Willems (MS 19) • Up to 10x faster HQR: • Byers / Mathias / Braman • Extensions to QZ: • Kågström / Kressner / Adlerborn (MS 19) • Faster Hessenberg, tridiagonal, bidiagonal reductions: • van de Geijn/Quintana-Orti, Howell / Fulton, Bischof / Lang • Novel Data Layouts: • Gustavson / Kågström / Elmroth / Jonsson / Wasniewski (MS 15/23/30)

Goal 1a – Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem / SVD: • Parlett / Dhillon / Voemel / Marques / Willems (MS 19) • Faster and more accurate than previous algorithms • SIAM SIAG/LA Prize in 2006 • New sequential, first parallel versions out in 2006

Flop Counts of Eigensolvers(2.2 GHz Opteron + ACML)

Flop Count Ratios of Eigensolvers(2.2 GHz Opteron + ACML)

Run Time Ratios of Eigensolvers(2.2 GHz Opteron + ACML)

MFlop Rates of Eigensolvers(2.2 GHz Opteron + ACML)

Parallel Runtimes of Eigensolvers(2.4 GHz Xeon Cluster + Ethernet)

Accuracy of Eigensolvers || QQT – I || / (n e ) maxi ||Tqi – li qi || / ( n e )

Accuracy of Eigensolvers: Old vs New Grail || QQT – I || / (n e ) maxi ||Tqi – li qi || / ( n e )

Goal 1a – Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem / SVD: • Parlett / Dhillon / Voemel / Marques / Willems (MS 19) • Faster and more accurate than previous algorithms • SIAM SIAG/LA Prize in 2006 • New sequential, first parallel versions out in 2006 • Both DC and MR are important

Goal 1a – Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem / SVD: • Parlett / Dhillon / Voemel / Marques / Willems (MS19) • Up to 10x faster HQR: • Byers / Mathias / Braman • SIAM SIAG/LA Prize in 2003 • Sequential version out in 2006 • More on performance later

Goal 1a – Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem / SVD: • Parlett / Dhillon / Voemel / Marques / Willems (MS19) • Up to 10x faster HQR: • Byers / Mathias / Braman • Extensions to QZ: • Kågström / Kressner / Adlerborn (MS 19) • LAPACK Working Note (LAWN) #173 • On 26 real test matrices, speedups up to 14.7x, 4.4x average

Comparison of ScaLAPACK QR and new parallel multishift QZ Execution times in secsfor 4096 x 4096 random problems Ax = sx and Ax = sBx, using processor grids including 1-16 processors. Note: work(QZ) > 2 * work(QR) but Time(// QZ) << Time (//QR)!! Times include cost for computing eigenvalues and transformation matrices. Adlerborn-Kågström-Kressner, SIAM PP’2006, also MS19

Goal 1a – Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem / SVD: • Parlett / Dhillon / Voemel / Marques / Willems (MS19) • Up to 10x faster HQR: • Byers / Mathias / Braman • Extensions to QZ: • Kågström / Kressner / Adlerborn (MS 19) • Faster Hessenberg, tridiagonal, bidiagonal reductions: • van de Geijn/Quintana-Orti, Howell / Fulton, Bischof / Lang • Full nonsymmetric eigenproblem: n=1500: 3.43x faster • HQR: 5x faster, Reduction: 14% faster • Bidiagonal Reduction (LAWN#174): n=2000: 1.32x faster • Sequential versions out in 2006

Goal 1a – Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem / SVD: • Parlett / Dhillon / Voemel / Marques / Willems (MS 19) • Up to 10x faster HQR: • Byers / Mathias / Braman • Extensions to QZ: • Kågström / Kressner / Adlerborn (MS19) • Faster Hessenberg, tridiagonal, bidiagonal reductions: • van de Geijn/Quintana-Orti, Howell / Fulton, Bischof / Lang • Novel Data Layouts • Gustavson / Kågström / Elmroth / Jonsson / Wasniewski (MS 15/23/30) • SIAM Review Article 2004

Novel Data Layouts and Algorithms • Still merges multiple elimination steps into a few BLAS 3 operations • MS 15/23/30 – Novel Data Formats • Rectangular Packed Format: good speedups for packed storage of symmetric matrices

Goal 1b – More Accurate Algorithms • Iterative refinement for Ax=b, least squares • “Promise” the right answer for O(n2) additional cost • Jacobi-based SVD • Faster than QR, can be arbitrarily more accurate • Arbitrary precision versions of everything • Using your favorite multiple precision package

Goal 1b – More Accurate Algorithms • Iterative refinement for Ax=b, least squares • Kahan, Riedy, Hida, Li • “Promise” the right answer for O(n2) additional cost • Iterative refinement with extra-precise residuals • Extra-precise BLAS needed (LAWN#165)

With extra precise iterative refinement More Accurate: Solve Ax=b Conventional Gaussian Elimination 1/e e e = n1/22-24

Goal 1b – More Accurate Algorithms • Iterative refinement for Ax=b, least squares • “Promise” the right answer for O(n2) additional cost • Iterative refinement with extra-precise residuals • Extra-precise BLAS needed (LAWN#165) • “Guarantees” based on condition number estimates • Condition estimate < 1/(n1/2e) reliable answer and tiny error bounds • No bad bounds in 6.2M tests • Can condition estimators lie?

Can condition estimators lie? • Yes, but rarely, unless they cost as much as matrix multiply = cost of LU factorization • Demmel/Diament/Malajovich (FCM2001) • But what if matrix multiply costs O(n2)? • More later

Goal 1b – More Accurate Algorithms • Iterative refinement for Ax=b, least squares • “Promise” the right answer for O(n2) additional cost • Iterative refinement with extra-precise residuals • Extra-precise BLAS needed (LAWN#165) • “Guarantees” based on condition number estimates • Get tiny componentwise bounds too • Each xi accurate • Slightly different condition number • Extends to Least Squares (Li) • Release in 2006

Goal 1b – More Accurate Algorithms • Iterative refinement for Ax=b, least squares • Promise the right answer for O(n2) additional cost • Jacobi-based SVD • Faster than QR, can be arbitrarily more accurate • Drmac / Veselic (MS 3) • LAWNS # 169, 170 • Can be arbitrarily more accurate on tiny singular values • Yet faster than QR iteration, within 2x of DC

Goal 1b – More Accurate Algorithms • Iterative refinement for Ax=b, least squares • Promise the right answer for O(n2) additional cost • Jacobi-based SVD • Faster than QR, can be arbitrarily more accurate • Arbitrary precision versions of everything • Using your favorite multiple precision package • Quad, Quad-double, ARPREC, MPFR, … • Using Fortran 95 modules

Iterative Refinement: for speed (MS 3) • What if double precision much slower than single? • Cell processor in Playstation 3 • 256 GFlops single, 25 GFlops double • Pentium SSE2: single twice as fast as double • Given Ax=b in double precision • Factor in single, do refinement in double • If k(A) < 1/esingle, runs at speed of single • 1.9x speedup on Intel-based laptop • Applies to many algorithms, if difference large

Exploiting GPUs • Numerous emerging co-processors • Cell, SSE, Grape, GPU, “physics coprocessor,” … • When can we exploit them? • LIttle help if memory is bottleneck • Various attempts to use GPUs for dense linear algebra • Bisection on GPUs for symmetric tridiagonal eigenproblem • Evaluate Count(x) = #(evals < x) for many x • Very little memory traffic • Speedups up to 100x (Volkov)

Goal 2 – Expanded Content • Make content of ScaLAPACK mirror LAPACK as much as possible

Missing Drivers in Sca/LAPACK

More missing drivers

Goal 2 – Expanded Content • Make content of ScaLAPACK mirror LAPACK as much as possible • New functions (highlights) • Updating / downdating of factorizations: • Stewart, Langou • More generalized SVDs: • Bai , Wang, Drmac (MS 3) • More generalized Sylvester/Lyapunov eqns: • Kågström, Jonsson, Granat (MS19) • Structured eigenproblems • Selected matrix polynomials: • Mehrmann, Higham, Tisseur • O(n2) version of roots(p) • Gu, Chandrasekaran, Bindel et al

C(p)= -p1 -p2 … -pd 1 0 … 0 0 1 … 0 … … … … 0 … 1 0 New algorithm for roots(p) • To find roots of polynomial p • Roots(p) does eig(C(p)) • Costs O(n3), stable, reliable • O(n2) Alternatives • Newton, Jenkins-Traub, Laguerre, … • Stable? Reliable? • New: Exploit “semiseparable” structure of C(p) • Low rank of any submatrix of upper triangle of C(p) preserved under QR iteration • Complexity drops from O(n3) to O(n2), stable in practice • Related work: Van Barel (MS3), Gemignani, Bini, Pan, et al • Ming Gu, Shiv Chandrasekaran, Jiang Zhu, Jianlin Xia, David Bindel, David Garmire, Jim Demmel

Goal 2 – Expanded Content • Make content of ScaLAPACK mirror LAPACK as much as possible • New functions (highlights) • Updating / downdating of factorizations: • Stewart, Langou • More generalized SVDs: • Bai , Wang, Drmac (MS 3) • More generalized Sylvester/Lyapunov eqns: • Kågström, Jonsson, Granat (MS19) • Structured eigenproblems • Selected matrix polynomials: • Mehrmann, Higham, Tisseur • O(n2) version of roots(p) • Gu, Chandrasekaran, Bindel et al • How should we prioritize missing functions?

Goal 3 – Automate Performance Tuning • Widely used in performance tuning of Kernels • ATLAS (PhiPAC) – BLAS - www.netlib.org/atlas • FFTW – Fast Fourier Transform – www.fftw.org • Spiral – signal processing - www.spiral.net • OSKI – Sparse BLAS – bebop.cs.berkeley.edu/oski • Integrated into PETSc

The Future of LAPACK and ScaLAPACK www.netlib.org/lapack-dev