Enduring Linear Algebra Software Libraries

Enduring Linear Algebra Software Libraries Fred Gustavson Adjunct Prof. Umea U. Umea, Sweden IBM Research, Emeritus, Ykt. Hgts., NY E-mail: fg2935@hotmail.com 2010 CScADS: Languages and Compilers for LA Libraries. Snowbird, Utah August 9, 2010

Fundamental "Triangle" A C H A: Algorithms H: Hardware C: Compilers

Algorithm and Architecture • The key to performance is to understand the algorithm and architecture interaction. • A significant improvement in performance can be obtained by matching the algorithm to the architecture or vice-versa. • A cost-effective way of providing a given level of performance. • Multi-core puts more of the burden on the algorithm part of the triangle • Especially hard for the designers of Library Software

Blocking • TLB Blocking -- minimize TLB misses • Cache Blocking -- minimize cache misses • Register Blocking -- minimize load/stores • The general idea of blocking is to get the information to a high-speed storage and use it multiple times so as to amortize the cost of moving the data. • Cache Blocking -- Reduces traffic between memory and cache • Register Blocking -- Reduces traffic between cache and CPU • TLB Blocking – Covers the current working set of a problem

Some Facts on Cache Blocking • A very important algorithmic technique • First used by ESSL and the Cedar Project • Cray 2 was impetus for Level 3 BLAS • Multi-core may modify the L3 BLAS standard • The gap between memory speed and many fast cores is too great to allow the current standard to be viable

Standard Fortran and C Matrices • A has size M rows by N cols with LDA >= M • Cols are stride one and rows are LDA • This is a one dimensional layout whereas A is 2-D • AT has size N rows by M cols with LDAT >= N • Rows are stride one and cols are LDAT • This is a one dimensional layout whereas AT is 2-D • Both A & AT contain the same information • However, two copies are necessary • Matrix A is 2-D but 1-D in Fortran and C

Generalization of Standard Format • Each a(i,j) is now a rectangular or square sub matrix A(I:I+MB-1,J:J+NB-1) • All sub matrices are contiguous; LDA = MB • Simple and non-simple layouts • Algorithms are identical in nature • Provable best approximation to 2-D in Fortran and C • Left over blocks are full; size MB*NB • Very important • Can transpose rectangular or square sub matrices in place

Proof Outline on role of _GEMM • Sketch of a proof that matrix factorization is almost all matrix multiplication a) Perform n = N/NB rank NB linear transformations on A to get say U; here PA=LU b) Each of these n composed NB linear transformations is matrix multiply by definition c) These n transformations preserve the solution properties of Ax = b if and only if Ux = L-1b by the principle of equivalent matrices IBM Thomas J. Watson Research Center Yorktown Heights, New York

Blocked Based Algorithms a la LAPACK • N coordinate transformations represented as n = N/NB composed rank NB coordinate transformations • View as a series of kernel algorithms • c(i, j)=c(i, j) - a(i, k)*b(k, j) : GEMM, SYRK : C=C-A*B • b(i, j)=b(i, j)/a(j, j) : TRSM : B = B*A-1 • L*U=P*A : Factor Kernel • L*LT=A : Cholesky Kernel • Q*R=A : QR Kernel • LAPACK treats factor kernels as a series of NB level two operations • Factor kernels can be written as level 3 kernels • Recursion is helpful • Register based programming

Dimension Theory • Maintains data locality in every neighborhood of an object • Number of coordinates necessary to describe any neighborhood of any point of an object • Using less coordinates destroys data locality • Cache thrashing • Fortran and C use 1-D layout for all n-D arrays • NDS blocks map perfectly into cache • But still 1-D • Use register blocking to overcomes problems in L0 cache

Overlapping Communication and Computation • Very Important Architecture Feature • Showed that Distributed Parallel Matrix Multiplication could be done at the Peak Rate • 512 Touchstone Delta Machine achieved peak performance using this algorithm • Key feature of Cell Architecture

Look ahead Idea for Factorization • Overlap Schur Complement Update aka matrix multiplication with the previous factor step • PA = (L1U1)(L2U2)…Ln = L1(U1L2)…(Un-1Ln) • L part is factor and scale and U part is SC update • factor step provide the A and B operands of the update GEMM part • with this use of the associative law the A & B of parts of GEMM is done early aka lookahead • factorization is almost 100% Update • makes factorization almost perfectly parallel

Some History behind this Panel • Worked loosely with David Padua since 2000 • Hierarchical Tiled Research Project started by David’s team • In 2010 David and I started on a LA compiler study • We have a continuation of this at Snowbird • Collaborated with Keshav Pingali since 2000 • Discussed limitations of Compilers re Arrays • Major paper produced on Cache Oblivious Algs. • Kamen Yotov’s thesis start of LA Compiler • Working loosely with Dongarra’s team at ICL • Software for LAPACK 3.2 and PLASMA contributed

References • ESSL Guide and Reference ; (cache blocking) Pub. No. SA22-7272-00 Feb. 1986. • R. C. Agarwal, F. G. Gustavson, M. Zubair. Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms. IBM Journal of R. & D., Vol. 38, No. 5, Sep. 1994, pp. 563,576. • R. C. Agarwal, F. G. Gustavson, M. Zubair. A High Performance Matrix-Multiplication Algorithm on a Distributed-Memory Parallel Computer Using Overlapped Communication. IBM Journal of R. & D., Vol. 38, No. 6, Nov. 1994, pp. 673-682. • F. Gustavson. High Performance Linear Algebra Algorithms using New Generalized Data Structures for Matrices. IBM J. of R. & D., Vol. 47, No. 1, Jan. 2003. • E. Elmroth, F. G. Gustavson, I. Jonsson, and B. Kagstrom. Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software. SIAM Review, Vol. 46, No. 1, Mar. 2004, pp. 3-45.

Enduring Linear Algebra Software Libraries

Enduring Linear Algebra Software Libraries

Presentation Transcript

Linear Algebra

Linear Algebra

Automatically Tuned Linear Algebra Software (ATLAS)

Linear Algebra

Accelerated Linear Algebra Libraries

Linear Algebra

Linear Algebra

Linear Algebra

Linear algebra

Linear Algebra

Linear Algebra

Linear Algebra

Linear Algebra

LINEAR ALGEBRA

Linear Algebra

Linear Algebra

Linear Algebra

Linear Algebra

Linear Algebra

Linear Algebra

Linear Algebra

LINEAR ALGEBRA