CS 267 Dense Linear Algebra: Possible Class Projects

CS 267 Dense Linear Algebra:Possible Class Projects James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr09 CS267 Lecture 12a

Kinds of class projects • Try tuning existing (widely used) codes in LAPACK, ScaLAPACK or possible future versions • Possible impact: help many people to run faster • Add missing functionality to these libraries • Possible impact: lots of users want it • Experiment with algorithms on new architectures • Possible impact: What do we need to do differently for performance on these platforms? Are there any bottlenecks or other problems in the architecture? Could they be fixed? • Experiment with new software approaches • Possible impact: Is it easier to write these algorithms while getting most of the performance? Should we produce future versions of the libraries this way? • Experiment with new algorithms • Possible impact: Find a better one! CS267 Lecture 12a

Challenges to Libraries (and parallel SW in general) • Minimizing communication costs • Cost of bandwidth and latency (to main memory or over a network) growing exponentially compared to arithmetic • Heterogeneous platforms • Different communication costs depending on destination • Same chip vs different socket vs different board … • CPU + GPU • Perform different operations at very different rates • Dynamic scheduling & load balancing • Can’t always assume each core/processor makes constant progress on your task • May be faster to grab next available task than use predesigned “perfectly balanced” schedule • OS may give, take away resources on the fly • Fault tolerance – how to recover when one proc fails CS267 Lecture 11

Strassen’s Matmul on Multicore or GPU • Why no Strassen in most libraries? • See “Baleful Effect of Benchmarks…” by Prof. Kahan • Likely to be faster for modest-to-large matrix sizes • Where is the crossover? • May want hybrid: switch to O(n3) algorithm for certain sizes (smaller) • Autotuning? • Lots of “blocking” opportunities as for standard matmul • What is least amount of data movement possible? • How well does it work for the rectangular matmuls in LU, QR and Cholesky? • Do we need to modify LU, QR or Cholesky to take advantage of Strassen (by using a variant that multiplies different size matrices)? CS267 Lecture 12a

Review: Alternative recursive GE formulation A = L * U function [L,U] = RLU (A) … assume A is m by n if (n=1) L = A/A(1,1), U = A(1,1) else [L1,U1] = RLU( A(1:m , 1:n/2)) … do left half of A … let L11 denote top n/2 rows of L1 A( 1:n/2 , n/2+1 : n ) = L11-1 * A( 1:n/2 , n/2+1 : n ) … update top n/2 rows of right half of A A( n/2+1: m, n/2+1:n ) = A( n/2+1: m, n/2+1:n ) - A( n/2+1: m, 1:n/2 ) * A( 1:n/2 , n/2+1 : n ) … update rest of right half of A [L2,U2] = RLU( A(n/2+1:m , n/2+1:n) ) … do right half of A return [ L1,[0;L2] ] and [U1,[ A(.,.) ; U2 ] ] • Toledo (1997) • Describe without pivoting for simplicity • “Do left half of matrix, then right half” CS267 Lecture 12a

Register-file resident Linear Algebra on GPUs • Vasily’s results for LU, QR and Cholesky on GPU target single large matrices, too large to fit just in the “fast memory” (shared + registers) of the GPU • There is also demand for solving many smaller problems in parallel, eg A(i) * x(i) = b(i) for many different A(1),…,A(k) and b(1),…,b(k) • Project: Design linear algebra algorithms that operate on many different matrices in parallel, each small enough to fit in the 64 KB register set of each multiprocessor • single precision square matrix of dimension n=128 • Question: Does possible need to branch differently on each multiprocessor (because of different pivot orders) matter? If so, is QR better than LU? • Question: Do we need BLAS3 code versions on such small matrices, or is BLAS2 enough? CS267 Lecture 12a

Extend Vasily’s GPU analysis, code to ATI • Vasily’s Best Student Paper Award from SC08 had two parts: • Analyzed bottlenecks, speedup possibilities in NVIDIA architecture • Applied lessons to reorganization of LU, QR, Cholesky • What about ATI GPU? • Both above aspects interesting • ATI GPU available in ParLab • What are pros, cons of ATI, NVIDIA architectures? Others? • Do we need to reorganize algorithms differently for each, or does one algorithm (perhaps with different block sizes, other parameters) work for both (which would be simpler)? • Other BLAS-like operations on GPU • Needed for finite-element analysis CS267 Lecture 12a

Missing Drivers in Sca/LAPACK

More missing drivers

Missing matrix types in ScaLAPACK • Symmetric, Hermitian, triangular • Band, Packed • Positive Definite • Packed • Orthogonal, Unitary • Packed

Tuning the data layout Layout depends on block size b and processor grid Pr x Pc Simple layouts easy for user, but bad for performance Speedups for using 2D processor grid range from 2x to 8x Times obtained on: 60 processors, Dual AMD Opteron 1.4GHz Cluster w/Myrinet Interconnect, 2GB Memory

Cost of tuning the data layout, compared to runtime Cost of redistributing matrix to optimal layout is small Times obtained on: 60 processors, Dual AMD Opteron 1.4GHz Cluster w/Myrinet Interconnect, 2GB Memory Possible project: build “wrapper” that chooses fastest layout, whether to convert back and forth, and hides details from the user.

Parallel Eigenvalue Algorithms on GPU • Harder to use all BLAS3 than solving Ax=b, least squares • Symmetric eigenvalue problem for A=AT (SVD similar) • Find orthogonal Q to transform A = QTQT, where T=TT is tridiagonal (nonzero on main diagonal, right above and below • Find eigenvals  =diag(λ1,…,λn)and orthog. eigenvecs U of T = UUT • Good parallel algorithms; cheaper than first step • Then A = (QU) (QU)T so orthog. eigenvectors =QU, eigenvalues = • A = QTQT is proposed challenge • Use “Successive Band Reduction” (Sun, Bischof et al) • Go from A to wide band matrix B via A = VBVT , V orthogonal • All BLAS3, fast on GPU • Go from B to tridiagonal T via B = WTWT , W orthogonal • BLAS1 and BLAS2, do it on CPU • Find T = UUT as above, then A = (VWU) (VWU)T • Prospect of minimizing communication in theory CS267 Lecture 12a

Experiment with PLASMA for Multicore • PLASMA is experimental system for writing, scheduling linear algebra algorithms as Directed Acyclic Graphs (DAGs) • icl.cs.utk.edu/plasma/ CS267 Lecture 12a

T T T A A B C C Fork-Join vs. Dynamic Execution on Multicore Source: Jack Dongarra Fork-Join – parallel BLAS Time DAG-based – dynamic scheduling Time saved Experiments on Intel’s Quad Core Clovertown with 2 Sockets w/ 8 Treads

Experiment with PLASMA for Multicore • Perform analogous experiments with UPC, Titanium or other PGAS languages • PLASMA is experimental system for writing, scheduling linear algebra algorithms as Directed Acyclic Graphs (DAGs) • icl.cs.utk.edu/plasma/ • Experiment with PLASMA • Implement other factorizations • Compare performance • To LAPACK with parallel BLAS • To ScaLAPACK • Evaluate expressiveness for eigenvalue problems • Study interaction of scheduler with higher level scheduler being designed in ParLab • Can PLASMA “gracefully” accept, give up, resources? CS267 Lecture 12a

Investigate role of “Dense Motif” in ParLab Apps • Initial study (below) showed Dense Linear Algebra in • Image, Speech, Music • Determine what is really needed • Functions, problem sizes, performance requirements • What do we still need to optimize?

CS 267 Dense Linear Algebra: Possible Class Projects