1 / 98

Enhancing LAPACK and ScaLAPACK for Future Computing Challenges

Explore the future of LAPACK and ScaLAPACK, widely used linear algebra libraries, addressing challenges, goals, and progress. Learn about the motivation, participants, challenges in large-scale computing, and user requirements.

stevenpaul
Download Presentation

Enhancing LAPACK and ScaLAPACK for Future Computing Challenges

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Future of LAPACK and ScaLAPACKwww.netlib.org/lapack-dev Jim Demmel UC Berkeley 21 June 2006 PARA 06

  2. Outline • Motivation for new Sca/LAPACK • Challenges (or research opportunities…) • Goals of new Sca/LAPACK • Highlights of progress

  3. Motivation • LAPACK and ScaLAPACK are widely used • Adopted by Cray, Fujitsu, HP, IBM, IMSL, MathWorks, NAG, NEC, SGI, … • >60M web hits @ Netlib (incl. CLAPACK, LAPACK95)

  4. Impact (with NERSC, LBNL) Cosmic Microwave Background Analysis, BOOMERanG collaboration, MADCAP code (Apr. 27, 2000). ScaLAPACK

  5. Motivation • LAPACK and ScaLAPACK are widely used • Adopted by Cray, Fujitsu, HP, IBM, IMSL, MathWorks, NAG, NEC, SGI, … • >60M web hits @ Netlib (incl. CLAPACK, LAPACK95) • Many ways to improve them, based on • Own algorithmic research • Enthusiastic participation of research community • User/vendor survey • Opportunities and demands of new architectures, programming languages • New releases planned (NSF support)

  6. Participants • UC Berkeley: • Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye Li, Osni Marques, Christof Voemel, David Bindel, Yozo Hida, Jason Riedy, Jianlin Xia, Jiang Zhu, undergrads… • U Tennessee, Knoxville • Jack Dongarra, Julien Langou, Julie Langou, Piotr Luszczek, Stan Tomov, Alfredo Buttari, Jakub Kurzak • Other Academic Institutions • UT Austin, UC Davis, Florida IT, U Kansas, U Maryland, North Carolina SU, San Jose SU, UC Santa Barbara • TU Berlin, U Electrocomm. (Japan), FU Hagen, U Carlos III Madrid, U Manchester, U Umeå, U Wuppertal, U Zagreb • Research Institutions • CERFACS, LBL • Industrial Partners • Cray, HP, Intel, MathWorks, NAG, SGI

  7. Challenges • For all large scale computing, not just linear algebra! • Example …

  8. Parallelism in the Top500

  9. Challenges • For all large scale computing, not just linear algebra! • Example … your laptop • 256 Threads/multicore chip by 2010

  10. Challenges • For all large scale computing, not just linear algebra! • Example … your laptop • 256 Threads/multicore chip by 2010 • Exponentially growing gaps between • Floating point time << 1/Memory BW << Memory Latency • Floating point time << 1/Network BW << Network Latency

  11. Challenges • For all large scale computing, not just linear algebra! • Example … your laptop • 256 Threads/multicore chip by 2010 • Exponentially growing gaps between • Floating point time << 1/Memory BW << Memory Latency • Floating point time << 1/Network BW << Network Latency • Heterogeneity (performance and semantics) • Asynchrony • Unreliability

  12. What do users want? • High performance, ease of use, … • Survey results at www.netlib.org/lapack-dev • Small but interesting sample • What matrix sizes do you care about? • 1000s: 34% • 10,000s: 26% • 100,000s or 1Ms: 26% • How many processors, on distributed memory? • >10: 34%, >100: 31%, >1000: 19% • Do you use more than double precision? • Sometimes or frequently: 16% • Would Automatic Memory Allocation help? • Very useful: 72%, Not useful: 14%

  13. Goals of next Sca/LAPACK • Better algorithms • Faster, more accurate • Expand contents • More functions, more parallel implementations • Automate performance tuning • Improve ease of use • Better software engineering • Increased community involvement

  14. Goal 1: Better Algorithms • Faster • But provide “usual” accuracy, stability • … Or accurate for an important subclass • More accurate • But provide “usual” speed • … Or at any cost

  15. Goal 1a – Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem / SVD: • Parlett / Dhillon / Voemel / Marques / Willems (MS 19) • Up to 10x faster HQR: • Byers / Mathias / Braman • Extensions to QZ: • Kågström / Kressner / Adlerborn (MS 19) • Faster Hessenberg, tridiagonal, bidiagonal reductions: • van de Geijn/Quintana-Orti, Howell / Fulton, Bischof / Lang • Novel Data Layouts: • Gustavson / Kågström / Elmroth / Jonsson / Wasniewski (MS 15/23/30)

  16. Goal 1a – Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem / SVD: • Parlett / Dhillon / Voemel / Marques / Willems (MS 19) • Faster and more accurate than previous algorithms • SIAM SIAG/LA Prize in 2006 • New sequential, first parallel versions out in 2006

  17. Flop Counts of Eigensolvers(2.2 GHz Opteron + ACML)

  18. Flop Counts of Eigensolvers(2.2 GHz Opteron + ACML)

  19. Flop Counts of Eigensolvers(2.2 GHz Opteron + ACML)

  20. Flop Counts of Eigensolvers(2.2 GHz Opteron + ACML)

  21. Flop Count Ratios of Eigensolvers(2.2 GHz Opteron + ACML)

  22. Run Time Ratios of Eigensolvers(2.2 GHz Opteron + ACML)

  23. MFlop Rates of Eigensolvers(2.2 GHz Opteron + ACML)

  24. Parallel Runtimes of Eigensolvers(2.4 GHz Xeon Cluster + Ethernet)

  25. Accuracy of Eigensolvers || QQT – I || / (n e ) maxi ||Tqi – li qi || / ( n e )

  26. Accuracy of Eigensolvers: Old vs New Grail || QQT – I || / (n e ) maxi ||Tqi – li qi || / ( n e )

  27. Goal 1a – Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem / SVD: • Parlett / Dhillon / Voemel / Marques / Willems (MS 19) • Faster and more accurate than previous algorithms • SIAM SIAG/LA Prize in 2006 • New sequential, first parallel versions out in 2006 • Both DC and MR are important

  28. Goal 1a – Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem / SVD: • Parlett / Dhillon / Voemel / Marques / Willems (MS19) • Up to 10x faster HQR: • Byers / Mathias / Braman • SIAM SIAG/LA Prize in 2003 • Sequential version out in 2006 • More on performance later

  29. Goal 1a – Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem / SVD: • Parlett / Dhillon / Voemel / Marques / Willems (MS19) • Up to 10x faster HQR: • Byers / Mathias / Braman • Extensions to QZ: • Kågström / Kressner / Adlerborn (MS 19) • LAPACK Working Note (LAWN) #173 • On 26 real test matrices, speedups up to 14.7x, 4.4x average

  30. Comparison of ScaLAPACK QR and new parallel multishift QZ Execution times in secsfor 4096 x 4096 random problems Ax = sx and Ax = sBx, using processor grids including 1-16 processors. Note: work(QZ) > 2 * work(QR) but Time(// QZ) << Time (//QR)!! Times include cost for computing eigenvalues and transformation matrices. Adlerborn-Kågström-Kressner, SIAM PP’2006, also MS19

  31. Goal 1a – Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem / SVD: • Parlett / Dhillon / Voemel / Marques / Willems (MS19) • Up to 10x faster HQR: • Byers / Mathias / Braman • Extensions to QZ: • Kågström / Kressner / Adlerborn (MS 19) • Faster Hessenberg, tridiagonal, bidiagonal reductions: • van de Geijn/Quintana-Orti, Howell / Fulton, Bischof / Lang • Full nonsymmetric eigenproblem: n=1500: 3.43x faster • HQR: 5x faster, Reduction: 14% faster • Bidiagonal Reduction (LAWN#174): n=2000: 1.32x faster • Sequential versions out in 2006

  32. Goal 1a – Faster Algorithms (Highlights) • MRRR algorithm for symmetric eigenproblem / SVD: • Parlett / Dhillon / Voemel / Marques / Willems (MS 19) • Up to 10x faster HQR: • Byers / Mathias / Braman • Extensions to QZ: • Kågström / Kressner / Adlerborn (MS19) • Faster Hessenberg, tridiagonal, bidiagonal reductions: • van de Geijn/Quintana-Orti, Howell / Fulton, Bischof / Lang • Novel Data Layouts • Gustavson / Kågström / Elmroth / Jonsson / Wasniewski (MS 15/23/30) • SIAM Review Article 2004

  33. Novel Data Layouts and Algorithms • Still merges multiple elimination steps into a few BLAS 3 operations • MS 15/23/30 – Novel Data Formats • Rectangular Packed Format: good speedups for packed storage of symmetric matrices

  34. Goal 1b – More Accurate Algorithms • Iterative refinement for Ax=b, least squares • “Promise” the right answer for O(n2) additional cost • Jacobi-based SVD • Faster than QR, can be arbitrarily more accurate • Arbitrary precision versions of everything • Using your favorite multiple precision package

  35. Goal 1b – More Accurate Algorithms • Iterative refinement for Ax=b, least squares • Kahan, Riedy, Hida, Li • “Promise” the right answer for O(n2) additional cost • Iterative refinement with extra-precise residuals • Extra-precise BLAS needed (LAWN#165)

  36. With extra precise iterative refinement More Accurate: Solve Ax=b Conventional Gaussian Elimination 1/e e e = n1/22-24

  37. Goal 1b – More Accurate Algorithms • Iterative refinement for Ax=b, least squares • “Promise” the right answer for O(n2) additional cost • Iterative refinement with extra-precise residuals • Extra-precise BLAS needed (LAWN#165) • “Guarantees” based on condition number estimates • Condition estimate < 1/(n1/2e) reliable answer and tiny error bounds • No bad bounds in 6.2M tests • Can condition estimators lie?

  38. Can condition estimators lie? • Yes, but rarely, unless they cost as much as matrix multiply = cost of LU factorization • Demmel/Diament/Malajovich (FCM2001) • But what if matrix multiply costs O(n2)? • More later

  39. Goal 1b – More Accurate Algorithms • Iterative refinement for Ax=b, least squares • “Promise” the right answer for O(n2) additional cost • Iterative refinement with extra-precise residuals • Extra-precise BLAS needed (LAWN#165) • “Guarantees” based on condition number estimates • Get tiny componentwise bounds too • Each xi accurate • Slightly different condition number • Extends to Least Squares (Li) • Release in 2006

  40. Goal 1b – More Accurate Algorithms • Iterative refinement for Ax=b, least squares • Promise the right answer for O(n2) additional cost • Jacobi-based SVD • Faster than QR, can be arbitrarily more accurate • Drmac / Veselic (MS 3) • LAWNS # 169, 170 • Can be arbitrarily more accurate on tiny singular values • Yet faster than QR iteration, within 2x of DC

  41. Goal 1b – More Accurate Algorithms • Iterative refinement for Ax=b, least squares • Promise the right answer for O(n2) additional cost • Jacobi-based SVD • Faster than QR, can be arbitrarily more accurate • Arbitrary precision versions of everything • Using your favorite multiple precision package • Quad, Quad-double, ARPREC, MPFR, … • Using Fortran 95 modules

  42. Iterative Refinement: for speed (MS 3) • What if double precision much slower than single? • Cell processor in Playstation 3 • 256 GFlops single, 25 GFlops double • Pentium SSE2: single twice as fast as double • Given Ax=b in double precision • Factor in single, do refinement in double • If k(A) < 1/esingle, runs at speed of single • 1.9x speedup on Intel-based laptop • Applies to many algorithms, if difference large

  43. Exploiting GPUs • Numerous emerging co-processors • Cell, SSE, Grape, GPU, “physics coprocessor,” … • When can we exploit them? • LIttle help if memory is bottleneck • Various attempts to use GPUs for dense linear algebra • Bisection on GPUs for symmetric tridiagonal eigenproblem • Evaluate Count(x) = #(evals < x) for many x • Very little memory traffic • Speedups up to 100x (Volkov)

  44. Goal 2 – Expanded Content • Make content of ScaLAPACK mirror LAPACK as much as possible

  45. Missing Drivers in Sca/LAPACK

  46. More missing drivers

  47. Goal 2 – Expanded Content • Make content of ScaLAPACK mirror LAPACK as much as possible • New functions (highlights) • Updating / downdating of factorizations: • Stewart, Langou • More generalized SVDs: • Bai , Wang, Drmac (MS 3) • More generalized Sylvester/Lyapunov eqns: • Kågström, Jonsson, Granat (MS19) • Structured eigenproblems • Selected matrix polynomials: • Mehrmann, Higham, Tisseur • O(n2) version of roots(p) • Gu, Chandrasekaran, Bindel et al

  48. C(p)= -p1 -p2 … -pd 1 0 … 0 0 1 … 0 … … … … 0 … 1 0 New algorithm for roots(p) • To find roots of polynomial p • Roots(p) does eig(C(p)) • Costs O(n3), stable, reliable • O(n2) Alternatives • Newton, Jenkins-Traub, Laguerre, … • Stable? Reliable? • New: Exploit “semiseparable” structure of C(p) • Low rank of any submatrix of upper triangle of C(p) preserved under QR iteration • Complexity drops from O(n3) to O(n2), stable in practice • Related work: Van Barel (MS3), Gemignani, Bini, Pan, et al • Ming Gu, Shiv Chandrasekaran, Jiang Zhu, Jianlin Xia, David Bindel, David Garmire, Jim Demmel

  49. Goal 2 – Expanded Content • Make content of ScaLAPACK mirror LAPACK as much as possible • New functions (highlights) • Updating / downdating of factorizations: • Stewart, Langou • More generalized SVDs: • Bai , Wang, Drmac (MS 3) • More generalized Sylvester/Lyapunov eqns: • Kågström, Jonsson, Granat (MS19) • Structured eigenproblems • Selected matrix polynomials: • Mehrmann, Higham, Tisseur • O(n2) version of roots(p) • Gu, Chandrasekaran, Bindel et al • How should we prioritize missing functions?

  50. Goal 3 – Automate Performance Tuning • Widely used in performance tuning of Kernels • ATLAS (PhiPAC) – BLAS - www.netlib.org/atlas • FFTW – Fast Fourier Transform – www.fftw.org • Spiral – signal processing - www.spiral.net • OSKI – Sparse BLAS – bebop.cs.berkeley.edu/oski • Integrated into PETSc

More Related