140 likes | 296 Views
Enduring Linear Algebra Software Libraries. Fred Gustavson Adjunct Prof. Umea U. Umea, Sweden IBM Research, Emeritus, Ykt. Hgts., NY E-mail: fg2935@hotmail.com.
E N D
Enduring Linear Algebra Software Libraries Fred Gustavson Adjunct Prof. Umea U. Umea, Sweden IBM Research, Emeritus, Ykt. Hgts., NY E-mail: fg2935@hotmail.com 2010 CScADS: Languages and Compilers for LA Libraries. Snowbird, Utah August 9, 2010
Fundamental "Triangle" A C H A: Algorithms H: Hardware C: Compilers
Algorithm and Architecture • The key to performance is to understand the algorithm and architecture interaction. • A significant improvement in performance can be obtained by matching the algorithm to the architecture or vice-versa. • A cost-effective way of providing a given level of performance. • Multi-core puts more of the burden on the algorithm part of the triangle • Especially hard for the designers of Library Software
Blocking • TLB Blocking -- minimize TLB misses • Cache Blocking -- minimize cache misses • Register Blocking -- minimize load/stores • The general idea of blocking is to get the information to a high-speed storage and use it multiple times so as to amortize the cost of moving the data. • Cache Blocking -- Reduces traffic between memory and cache • Register Blocking -- Reduces traffic between cache and CPU • TLB Blocking – Covers the current working set of a problem
Some Facts on Cache Blocking • A very important algorithmic technique • First used by ESSL and the Cedar Project • Cray 2 was impetus for Level 3 BLAS • Multi-core may modify the L3 BLAS standard • The gap between memory speed and many fast cores is too great to allow the current standard to be viable
Standard Fortran and C Matrices • A has size M rows by N cols with LDA >= M • Cols are stride one and rows are LDA • This is a one dimensional layout whereas A is 2-D • AT has size N rows by M cols with LDAT >= N • Rows are stride one and cols are LDAT • This is a one dimensional layout whereas AT is 2-D • Both A & AT contain the same information • However, two copies are necessary • Matrix A is 2-D but 1-D in Fortran and C
Generalization of Standard Format • Each a(i,j) is now a rectangular or square sub matrix A(I:I+MB-1,J:J+NB-1) • All sub matrices are contiguous; LDA = MB • Simple and non-simple layouts • Algorithms are identical in nature • Provable best approximation to 2-D in Fortran and C • Left over blocks are full; size MB*NB • Very important • Can transpose rectangular or square sub matrices in place
Proof Outline on role of _GEMM • Sketch of a proof that matrix factorization is almost all matrix multiplication a) Perform n = N/NB rank NB linear transformations on A to get say U; here PA=LU b) Each of these n composed NB linear transformations is matrix multiply by definition c) These n transformations preserve the solution properties of Ax = b if and only if Ux = L-1b by the principle of equivalent matrices IBM Thomas J. Watson Research Center Yorktown Heights, New York
Blocked Based Algorithms a la LAPACK • N coordinate transformations represented as n = N/NB composed rank NB coordinate transformations • View as a series of kernel algorithms • c(i, j)=c(i, j) - a(i, k)*b(k, j) : GEMM, SYRK : C=C-A*B • b(i, j)=b(i, j)/a(j, j) : TRSM : B = B*A-1 • L*U=P*A : Factor Kernel • L*LT=A : Cholesky Kernel • Q*R=A : QR Kernel • LAPACK treats factor kernels as a series of NB level two operations • Factor kernels can be written as level 3 kernels • Recursion is helpful • Register based programming
Dimension Theory • Maintains data locality in every neighborhood of an object • Number of coordinates necessary to describe any neighborhood of any point of an object • Using less coordinates destroys data locality • Cache thrashing • Fortran and C use 1-D layout for all n-D arrays • NDS blocks map perfectly into cache • But still 1-D • Use register blocking to overcomes problems in L0 cache
Overlapping Communication and Computation • Very Important Architecture Feature • Showed that Distributed Parallel Matrix Multiplication could be done at the Peak Rate • 512 Touchstone Delta Machine achieved peak performance using this algorithm • Key feature of Cell Architecture
Look ahead Idea for Factorization • Overlap Schur Complement Update aka matrix multiplication with the previous factor step • PA = (L1U1)(L2U2)…Ln = L1(U1L2)…(Un-1Ln) • L part is factor and scale and U part is SC update • factor step provide the A and B operands of the update GEMM part • with this use of the associative law the A & B of parts of GEMM is done early aka lookahead • factorization is almost 100% Update • makes factorization almost perfectly parallel
Some History behind this Panel • Worked loosely with David Padua since 2000 • Hierarchical Tiled Research Project started by David’s team • In 2010 David and I started on a LA compiler study • We have a continuation of this at Snowbird • Collaborated with Keshav Pingali since 2000 • Discussed limitations of Compilers re Arrays • Major paper produced on Cache Oblivious Algs. • Kamen Yotov’s thesis start of LA Compiler • Working loosely with Dongarra’s team at ICL • Software for LAPACK 3.2 and PLASMA contributed
References • ESSL Guide and Reference ; (cache blocking) Pub. No. SA22-7272-00 Feb. 1986. • R. C. Agarwal, F. G. Gustavson, M. Zubair. Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms. IBM Journal of R. & D., Vol. 38, No. 5, Sep. 1994, pp. 563,576. • R. C. Agarwal, F. G. Gustavson, M. Zubair. A High Performance Matrix-Multiplication Algorithm on a Distributed-Memory Parallel Computer Using Overlapped Communication. IBM Journal of R. & D., Vol. 38, No. 6, Nov. 1994, pp. 673-682. • F. Gustavson. High Performance Linear Algebra Algorithms using New Generalized Data Structures for Matrices. IBM J. of R. & D., Vol. 47, No. 1, Jan. 2003. • E. Elmroth, F. G. Gustavson, I. Jonsson, and B. Kagstrom. Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software. SIAM Review, Vol. 46, No. 1, Mar. 2004, pp. 3-45.