1 / 14

Enduring Linear Algebra Software Libraries

Enduring Linear Algebra Software Libraries. Fred Gustavson Adjunct Prof. Umea U. Umea, Sweden IBM Research, Emeritus, Ykt. Hgts., NY E-mail: fg2935@hotmail.com.

cato
Download Presentation

Enduring Linear Algebra Software Libraries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Enduring Linear Algebra Software Libraries Fred Gustavson Adjunct Prof. Umea U. Umea, Sweden IBM Research, Emeritus, Ykt. Hgts., NY E-mail: fg2935@hotmail.com 2010 CScADS: Languages and Compilers for LA Libraries. Snowbird, Utah August 9, 2010

  2. Fundamental "Triangle" A C H A: Algorithms H: Hardware C: Compilers

  3. Algorithm and Architecture • The key to performance is to understand the algorithm and architecture interaction. • A significant improvement in performance can be obtained by matching the algorithm to the architecture or vice-versa. • A cost-effective way of providing a given level of performance. • Multi-core puts more of the burden on the algorithm part of the triangle • Especially hard for the designers of Library Software

  4. Blocking • TLB Blocking -- minimize TLB misses • Cache Blocking -- minimize cache misses • Register Blocking -- minimize load/stores • The general idea of blocking is to get the information to a high-speed storage and use it multiple times so as to amortize the cost of moving the data. • Cache Blocking -- Reduces traffic between memory and cache • Register Blocking -- Reduces traffic between cache and CPU • TLB Blocking – Covers the current working set of a problem

  5. Some Facts on Cache Blocking • A very important algorithmic technique • First used by ESSL and the Cedar Project • Cray 2 was impetus for Level 3 BLAS • Multi-core may modify the L3 BLAS standard • The gap between memory speed and many fast cores is too great to allow the current standard to be viable

  6. Standard Fortran and C Matrices • A has size M rows by N cols with LDA >= M • Cols are stride one and rows are LDA • This is a one dimensional layout whereas A is 2-D • AT has size N rows by M cols with LDAT >= N • Rows are stride one and cols are LDAT • This is a one dimensional layout whereas AT is 2-D • Both A & AT contain the same information • However, two copies are necessary • Matrix A is 2-D but 1-D in Fortran and C

  7. Generalization of Standard Format • Each a(i,j) is now a rectangular or square sub matrix A(I:I+MB-1,J:J+NB-1) • All sub matrices are contiguous; LDA = MB • Simple and non-simple layouts • Algorithms are identical in nature • Provable best approximation to 2-D in Fortran and C • Left over blocks are full; size MB*NB • Very important • Can transpose rectangular or square sub matrices in place

  8. Proof Outline on role of _GEMM • Sketch of a proof that matrix factorization is almost all matrix multiplication a) Perform n = N/NB rank NB linear transformations on A to get say U; here PA=LU b) Each of these n composed NB linear transformations is matrix multiply by definition c) These n transformations preserve the solution properties of Ax = b if and only if Ux = L-1b by the principle of equivalent matrices IBM Thomas J. Watson Research Center Yorktown Heights, New York

  9. Blocked Based Algorithms a la LAPACK • N coordinate transformations represented as n = N/NB composed rank NB coordinate transformations • View as a series of kernel algorithms • c(i, j)=c(i, j) - a(i, k)*b(k, j) : GEMM, SYRK : C=C-A*B • b(i, j)=b(i, j)/a(j, j) : TRSM : B = B*A-1 • L*U=P*A : Factor Kernel • L*LT=A : Cholesky Kernel • Q*R=A : QR Kernel • LAPACK treats factor kernels as a series of NB level two operations • Factor kernels can be written as level 3 kernels • Recursion is helpful • Register based programming

  10. Dimension Theory • Maintains data locality in every neighborhood of an object • Number of coordinates necessary to describe any neighborhood of any point of an object • Using less coordinates destroys data locality • Cache thrashing • Fortran and C use 1-D layout for all n-D arrays • NDS blocks map perfectly into cache • But still 1-D • Use register blocking to overcomes problems in L0 cache

  11. Overlapping Communication and Computation • Very Important Architecture Feature • Showed that Distributed Parallel Matrix Multiplication could be done at the Peak Rate • 512 Touchstone Delta Machine achieved peak performance using this algorithm • Key feature of Cell Architecture

  12. Look ahead Idea for Factorization • Overlap Schur Complement Update aka matrix multiplication with the previous factor step • PA = (L1U1)(L2U2)…Ln = L1(U1L2)…(Un-1Ln) • L part is factor and scale and U part is SC update • factor step provide the A and B operands of the update GEMM part • with this use of the associative law the A & B of parts of GEMM is done early aka lookahead • factorization is almost 100% Update • makes factorization almost perfectly parallel

  13. Some History behind this Panel • Worked loosely with David Padua since 2000 • Hierarchical Tiled Research Project started by David’s team • In 2010 David and I started on a LA compiler study • We have a continuation of this at Snowbird • Collaborated with Keshav Pingali since 2000 • Discussed limitations of Compilers re Arrays • Major paper produced on Cache Oblivious Algs. • Kamen Yotov’s thesis start of LA Compiler • Working loosely with Dongarra’s team at ICL • Software for LAPACK 3.2 and PLASMA contributed

  14. References • ESSL Guide and Reference ; (cache blocking) Pub. No. SA22-7272-00 Feb. 1986. • R. C. Agarwal, F. G. Gustavson, M. Zubair. Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms. IBM Journal of R. & D., Vol. 38, No. 5, Sep. 1994, pp. 563,576. • R. C. Agarwal, F. G. Gustavson, M. Zubair. A High Performance Matrix-Multiplication Algorithm on a Distributed-Memory Parallel Computer Using Overlapped Communication. IBM Journal of R. & D., Vol. 38, No. 6, Nov. 1994, pp. 673-682. • F. Gustavson. High Performance Linear Algebra Algorithms using New Generalized Data Structures for Matrices. IBM J. of R. & D., Vol. 47, No. 1, Jan. 2003. • E. Elmroth, F. G. Gustavson, I. Jonsson, and B. Kagstrom. Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software. SIAM Review, Vol. 46, No. 1, Mar. 2004, pp. 3-45.

More Related