180 likes | 314 Views
Adaptive Strassen and ATLAS’s DGEMM. Paolo D’Alberto (CMU) and Alexandru Nicolau (UCI). The Problem: Matrix Computations. The evolution of systems is modeled by matrix computations The prediction and evaluation of such models (of complex systems) is fundamental in scientific computing.
E N D
Adaptive Strassen and ATLAS’s DGEMM Paolo D’Alberto (CMU) and Alexandru Nicolau (UCI) HPC Asia
The Problem: Matrix Computations • The evolution of systems is modeled by matrix computations • The prediction and evaluation of such models (of complex systems) is fundamental in scientific computing. • For example, the solution of linear equations or the solution of least square systems. HPC Asia
The Problem: BLAS • The Basic Linear Algebra Subroutines is an interface describing a set of (basic) matrix and vector computations • Historically, the BLAS was a set of algorithms • Library implementing the BLAS are the back-bone of nowadays high performance computations • For ScaLAPACK • ESSL, PHiPac and ATLAS HPC Asia
The Problem: ATLAS • Implementation of BLAS 3 are based on Matrix Multiplication • In practice, ATLAS automatically generates a custom-tailored MM: • It probes the system • It tailors a kernel of MM to a specific system • It uses the MM as a basic routine for the other BLAS-3 routines HPC Asia
C0= A0B0 + A1B2 Matrix Multiplication (basics) C0 A0 B0 C1 A1 B1 * = C2 C3 A2 B3 B2 B3 C1= A0B1 + A1B3 C3= A2B1 + A3B3 C2= A2B0 + A3B2 HPC Asia
The Problem: MM • ATLAS uses this classic matrix multiply • For square matrices of size nxn, the algorithm takes O(n3) • It achieves 80-90% of peak performance • Strassen’s algorithm for large problems. • Because it reduces the number of computations (thus shortening the execution time) • We investigate the effects on single-processor systems HPC Asia
The Problem: Strassen’s • Strassen’s for 2n–size matrices O(nlog 7) • For even-size matrices, one recursive step is always applicable • Otherwise • Dynamic and static padding • Peeling: • For odd-size matrices [Hauss 97 & Luo 2004]: HPC Asia
2n+1 Odd-Size Square Matrices A B 2n B0 2n A0 2n+1 2n 2n A0 * B0 is an even-size problem. Strassen is applied once more HPC Asia
Our Approach: balanced division • For any matrix size, we apply a balanced Strassen’s division process • This reduces the number of computations further than an odd/even size problem (or padded) • Balanced division = balanced workload • Thus, predictable performance • Balanced sized operands • Better data cache utilization HPC Asia
m Balanced Division Matrices Near Square: m = n+p with min|n-p| B0 A0 A1 B1 n m A3 B2 B3 A2 p n p The quadrants are near square matrices. At any step of the recursion, all sub-matrices are near square matrices HPC Asia
Balanced Matrices (New matrix add and multiplication) • The balanced division with Strassen’s recursion needs a new MA definition • because addition of matrices of different sizes • We generalize the operations such that: • The algorithm is correct • The extra control for the irregular sizes is completely negligible and only for matrix additions HPC Asia
Experimental Results • We considered 14 systems • We hand coded the MA for each specific system • We measure performance of ATLAS’s MM and MA • We specify an adaptive recursion point size for each system • We encode the recursion point in the algorithm • We measured the relative performance Strassen vs ATLAS • We report the details for three systems shortly HPC Asia
Opteron Strassen + ATLAS ATLAS’s Performance (the higher the better) HPC Asia
8600 PA-RISC Strassen + ATLAS ATLAS’s Performance HPC Asia
ALPHA Strassen + ATLAS ATLAS’s Performance HPC Asia
Conclusions • Our approach uses the balanced division as Strassen’s does • We performed an exhaustive testing of performance • Some architectures do not offer practical opportunity for S’s • We use benchmarking of ATLAS’s MM and MA for specific code tuning. • In the spirit of adaptive software packages • We speed up ATLAS’s MM without introducing any overhead • Due to data layout or extra control. HPC Asia
Future work • The algorithm extends to rectangular matrices • We will characterize its performance • Parallel formulation and performance • Power management • MM and MA compose the application however they have different architecture utilization • Hardware configurations adaptation (e.g., Xscale) HPC Asia