1 / 17

CS 267 Dense Linear Algebra: Possible Class Projects

Dive into optimizing codes in LAPACK library, experimenting with new algorithms and architectures, and enhancing missing functionality for impactful improvements.

acostar
Download Presentation

CS 267 Dense Linear Algebra: Possible Class Projects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 267 Dense Linear Algebra:Possible Class Projects James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr09 CS267 Lecture 12a

  2. Kinds of class projects • Try tuning existing (widely used) codes in LAPACK, ScaLAPACK or possible future versions • Possible impact: help many people to run faster • Add missing functionality to these libraries • Possible impact: lots of users want it • Experiment with algorithms on new architectures • Possible impact: What do we need to do differently for performance on these platforms? Are there any bottlenecks or other problems in the architecture? Could they be fixed? • Experiment with new software approaches • Possible impact: Is it easier to write these algorithms while getting most of the performance? Should we produce future versions of the libraries this way? • Experiment with new algorithms • Possible impact: Find a better one! CS267 Lecture 12a

  3. Challenges to Libraries (and parallel SW in general) • Minimizing communication costs • Cost of bandwidth and latency (to main memory or over a network) growing exponentially compared to arithmetic • Heterogeneous platforms • Different communication costs depending on destination • Same chip vs different socket vs different board … • CPU + GPU • Perform different operations at very different rates • Dynamic scheduling & load balancing • Can’t always assume each core/processor makes constant progress on your task • May be faster to grab next available task than use predesigned “perfectly balanced” schedule • OS may give, take away resources on the fly • Fault tolerance – how to recover when one proc fails CS267 Lecture 11

  4. Strassen’s Matmul on Multicore or GPU • Why no Strassen in most libraries? • See “Baleful Effect of Benchmarks…” by Prof. Kahan • Likely to be faster for modest-to-large matrix sizes • Where is the crossover? • May want hybrid: switch to O(n3) algorithm for certain sizes (smaller) • Autotuning? • Lots of “blocking” opportunities as for standard matmul • What is least amount of data movement possible? • How well does it work for the rectangular matmuls in LU, QR and Cholesky? • Do we need to modify LU, QR or Cholesky to take advantage of Strassen (by using a variant that multiplies different size matrices)? CS267 Lecture 12a

  5. Review: Alternative recursive GE formulation A = L * U function [L,U] = RLU (A) … assume A is m by n if (n=1) L = A/A(1,1), U = A(1,1) else [L1,U1] = RLU( A(1:m , 1:n/2)) … do left half of A … let L11 denote top n/2 rows of L1 A( 1:n/2 , n/2+1 : n ) = L11-1 * A( 1:n/2 , n/2+1 : n ) … update top n/2 rows of right half of A A( n/2+1: m, n/2+1:n ) = A( n/2+1: m, n/2+1:n ) - A( n/2+1: m, 1:n/2 ) * A( 1:n/2 , n/2+1 : n ) … update rest of right half of A [L2,U2] = RLU( A(n/2+1:m , n/2+1:n) ) … do right half of A return [ L1,[0;L2] ] and [U1,[ A(.,.) ; U2 ] ] • Toledo (1997) • Describe without pivoting for simplicity • “Do left half of matrix, then right half” CS267 Lecture 12a

  6. Register-file resident Linear Algebra on GPUs • Vasily’s results for LU, QR and Cholesky on GPU target single large matrices, too large to fit just in the “fast memory” (shared + registers) of the GPU • There is also demand for solving many smaller problems in parallel, eg A(i) * x(i) = b(i) for many different A(1),…,A(k) and b(1),…,b(k) • Project: Design linear algebra algorithms that operate on many different matrices in parallel, each small enough to fit in the 64 KB register set of each multiprocessor • single precision square matrix of dimension n=128 • Question: Does possible need to branch differently on each multiprocessor (because of different pivot orders) matter? If so, is QR better than LU? • Question: Do we need BLAS3 code versions on such small matrices, or is BLAS2 enough? CS267 Lecture 12a

  7. Extend Vasily’s GPU analysis, code to ATI • Vasily’s Best Student Paper Award from SC08 had two parts: • Analyzed bottlenecks, speedup possibilities in NVIDIA architecture • Applied lessons to reorganization of LU, QR, Cholesky • What about ATI GPU? • Both above aspects interesting • ATI GPU available in ParLab • What are pros, cons of ATI, NVIDIA architectures? Others? • Do we need to reorganize algorithms differently for each, or does one algorithm (perhaps with different block sizes, other parameters) work for both (which would be simpler)? • Other BLAS-like operations on GPU • Needed for finite-element analysis CS267 Lecture 12a

  8. Missing Drivers in Sca/LAPACK

  9. More missing drivers

  10. Missing matrix types in ScaLAPACK • Symmetric, Hermitian, triangular • Band, Packed • Positive Definite • Packed • Orthogonal, Unitary • Packed

  11. Tuning the data layout Layout depends on block size b and processor grid Pr x Pc Simple layouts easy for user, but bad for performance Speedups for using 2D processor grid range from 2x to 8x Times obtained on: 60 processors, Dual AMD Opteron 1.4GHz Cluster w/Myrinet Interconnect, 2GB Memory

  12. Cost of tuning the data layout, compared to runtime Cost of redistributing matrix to optimal layout is small Times obtained on: 60 processors, Dual AMD Opteron 1.4GHz Cluster w/Myrinet Interconnect, 2GB Memory Possible project: build “wrapper” that chooses fastest layout, whether to convert back and forth, and hides details from the user.

  13. Parallel Eigenvalue Algorithms on GPU • Harder to use all BLAS3 than solving Ax=b, least squares • Symmetric eigenvalue problem for A=AT (SVD similar) • Find orthogonal Q to transform A = QTQT, where T=TT is tridiagonal (nonzero on main diagonal, right above and below • Find eigenvals  =diag(λ1,…,λn)and orthog. eigenvecs U of T = UUT • Good parallel algorithms; cheaper than first step • Then A = (QU) (QU)T so orthog. eigenvectors =QU, eigenvalues = • A = QTQT is proposed challenge • Use “Successive Band Reduction” (Sun, Bischof et al) • Go from A to wide band matrix B via A = VBVT , V orthogonal • All BLAS3, fast on GPU • Go from B to tridiagonal T via B = WTWT , W orthogonal • BLAS1 and BLAS2, do it on CPU • Find T = UUT as above, then A = (VWU) (VWU)T • Prospect of minimizing communication in theory CS267 Lecture 12a

  14. Experiment with PLASMA for Multicore • PLASMA is experimental system for writing, scheduling linear algebra algorithms as Directed Acyclic Graphs (DAGs) • icl.cs.utk.edu/plasma/ CS267 Lecture 12a

  15. T T T A A B C C Fork-Join vs. Dynamic Execution on Multicore Source: Jack Dongarra Fork-Join – parallel BLAS Time DAG-based – dynamic scheduling Time saved Experiments on Intel’s Quad Core Clovertown with 2 Sockets w/ 8 Treads

  16. Experiment with PLASMA for Multicore • Perform analogous experiments with UPC, Titanium or other PGAS languages • PLASMA is experimental system for writing, scheduling linear algebra algorithms as Directed Acyclic Graphs (DAGs) • icl.cs.utk.edu/plasma/ • Experiment with PLASMA • Implement other factorizations • Compare performance • To LAPACK with parallel BLAS • To ScaLAPACK • Evaluate expressiveness for eigenvalue problems • Study interaction of scheduler with higher level scheduler being designed in ParLab • Can PLASMA “gracefully” accept, give up, resources? CS267 Lecture 12a

  17. Investigate role of “Dense Motif” in ParLab Apps • Initial study (below) showed Dense Linear Algebra in • Image, Speech, Music • Determine what is really needed • Functions, problem sizes, performance requirements • What do we still need to optimize?

More Related