1 / 18

Adaptive Strassen and ATLAS’s DGEMM

Adaptive Strassen and ATLAS’s DGEMM. Paolo D’Alberto (CMU) and Alexandru Nicolau (UCI). The Problem: Matrix Computations. The evolution of systems is modeled by matrix computations The prediction and evaluation of such models (of complex systems) is fundamental in scientific computing.

brygid
Download Presentation

Adaptive Strassen and ATLAS’s DGEMM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptive Strassen and ATLAS’s DGEMM Paolo D’Alberto (CMU) and Alexandru Nicolau (UCI) HPC Asia

  2. The Problem: Matrix Computations • The evolution of systems is modeled by matrix computations • The prediction and evaluation of such models (of complex systems) is fundamental in scientific computing. • For example, the solution of linear equations or the solution of least square systems. HPC Asia

  3. The Problem: BLAS • The Basic Linear Algebra Subroutines is an interface describing a set of (basic) matrix and vector computations • Historically, the BLAS was a set of algorithms • Library implementing the BLAS are the back-bone of nowadays high performance computations • For ScaLAPACK • ESSL, PHiPac and ATLAS HPC Asia

  4. The Problem: ATLAS • Implementation of BLAS 3 are based on Matrix Multiplication • In practice, ATLAS automatically generates a custom-tailored MM: • It probes the system • It tailors a kernel of MM to a specific system • It uses the MM as a basic routine for the other BLAS-3 routines HPC Asia

  5. C0= A0B0 + A1B2 Matrix Multiplication (basics) C0 A0 B0 C1 A1 B1 * = C2 C3 A2 B3 B2 B3 C1= A0B1 + A1B3 C3= A2B1 + A3B3 C2= A2B0 + A3B2 HPC Asia

  6. The Problem: MM • ATLAS uses this classic matrix multiply • For square matrices of size nxn, the algorithm takes O(n3) • It achieves 80-90% of peak performance • Strassen’s algorithm for large problems. • Because it reduces the number of computations (thus shortening the execution time) • We investigate the effects on single-processor systems HPC Asia

  7. The Problem: Strassen’s • Strassen’s for 2n–size matrices O(nlog 7) • For even-size matrices, one recursive step is always applicable • Otherwise • Dynamic and static padding • Peeling: • For odd-size matrices [Hauss 97 & Luo 2004]: HPC Asia

  8. 2n+1 Odd-Size Square Matrices A B 2n B0 2n A0 2n+1 2n 2n A0 * B0 is an even-size problem. Strassen is applied once more HPC Asia

  9. Our Approach: balanced division • For any matrix size, we apply a balanced Strassen’s division process • This reduces the number of computations further than an odd/even size problem (or padded) • Balanced division = balanced workload • Thus, predictable performance • Balanced sized operands • Better data cache utilization HPC Asia

  10. m Balanced Division Matrices Near Square: m = n+p with min|n-p| B0 A0 A1 B1 n m A3 B2 B3 A2 p n p The quadrants are near square matrices. At any step of the recursion, all sub-matrices are near square matrices HPC Asia

  11. Balanced Matrices (New matrix add and multiplication) • The balanced division with Strassen’s recursion needs a new MA definition • because addition of matrices of different sizes • We generalize the operations such that: • The algorithm is correct • The extra control for the irregular sizes is completely negligible and only for matrix additions HPC Asia

  12. Experimental Results • We considered 14 systems • We hand coded the MA for each specific system • We measure performance of ATLAS’s MM and MA • We specify an adaptive recursion point size for each system • We encode the recursion point in the algorithm • We measured the relative performance Strassen vs ATLAS • We report the details for three systems shortly HPC Asia

  13. HPC Asia

  14. Opteron Strassen + ATLAS ATLAS’s Performance (the higher the better) HPC Asia

  15. 8600 PA-RISC Strassen + ATLAS ATLAS’s Performance HPC Asia

  16. ALPHA Strassen + ATLAS ATLAS’s Performance HPC Asia

  17. Conclusions • Our approach uses the balanced division as Strassen’s does • We performed an exhaustive testing of performance • Some architectures do not offer practical opportunity for S’s • We use benchmarking of ATLAS’s MM and MA for specific code tuning. • In the spirit of adaptive software packages • We speed up ATLAS’s MM without introducing any overhead • Due to data layout or extra control. HPC Asia

  18. Future work • The algorithm extends to rectangular matrices • We will characterize its performance • Parallel formulation and performance • Power management • MM and MA compose the application however they have different architecture utilization • Hardware configurations adaptation (e.g., Xscale) HPC Asia

More Related