700 likes | 1.08k Views
CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms www.cs.berkeley.edu/~odedsc/CS294. Communication Avoiding Algorithms for Dense Linear Algebra. Jim Demmel. Based on: Bootcamp 2011, CS267, Summer-school 2010, many papers. Outline for today.
E N D
CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms www.cs.berkeley.edu/~odedsc/CS294 Communication AvoidingAlgorithmsforDense Linear Algebra Jim Demmel Based on: Bootcamp 2011, CS267, Summer-school 2010, many papers
Outline for today Recall communication lower bounds What a “matching upper bound”, i.e. CA-algorithm, might have to do Summary of what is known so far Case Studies of CA algorithms Case study I: Matrix multiplication Case study II: LU, QR Case study III: Cholesky decomposition
Communication lower bounds Applies to algorithms satisfying technical conditions discussed last time “smells like 3-nested loops” For each processor: M = size of fast memory of that processor G = #operations (multiplies) performed by that processor #words_moved to/from fast memory Ω ( max ( G / M1/2 , #inputs + #outputs ) ) = #messages to/from fast memory #words_moved max_message_size ≥ = Ω ( G / M3/2 )
Attaining these lower bounds • Depends on what processor, memory refer to • Sequential vsParallel_distributed_memoryvsParallel_shared_memory • Simple vs Hierarchical vs Messier… • DRAM+cachevs multiple levels of cache • Uniprocessors connected over network vs multicore processors connected over network vs multicore/multiboard/multirack/multisite • Uniform communication in network vs slower if farther away • Parallel machine with local memory hierarchies • Register file vs shared memory on GPUs • Homogeneous vs Heterogeneous • All flop_rates/bandwidths/latencies/mem_sizes same vs different • Use minimum fast_memoryvs all available fast_memory • Big design space (even just for matmul)
Summary of dense sequential algorithms attaining communication lower bounds • Algorithms shown minimizing # Messages use (recursive) block layout • Not possible with columnwise or rowwise layouts • Many references (see reports), only some shown, plus ours • Cache-oblivious are underlined, Green are ours, ? is unknown/future work
Summary of dense parallel algorithms attaining communication lower bounds • Assume nxn matrices on P processors • MinimumMemory per processor = M = O(n2/ P) • Recall lower bounds: • #words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) • #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )
Summary of dense parallel algorithms attaining communication lower bounds • Assume nxn matrices on P processors (conventional approach) • MinimumMemory per processor = M = O(n2/ P) • Recall lower bounds: • #words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) • #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )
Summary of dense parallel algorithms attaining communication lower bounds • Assume nxn matrices on P processors(conventional approach) • Minimum Memory per processor = M = O(n2/ P) • Recall lower bounds: • #words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) • #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )
Summary of dense parallel algorithms attaining communication lower bounds • Assume nxn matrices on P processors (better) • MinimumMemory per processor = M = O(n2/ P) • Recall lower bounds: • #words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) • #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )
Can we do even better? • Assume nxn matrices on P processors • Why just one copy of data: M = O(n2/ P) per processor? • Increase M to reduce lower bounds: • #words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) • #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )
Other algorithms of interest Variations on pivoting to find “best” rows and/or columns Best = most independent PAPT = LDLT with symmetric pivoting A symmetric, P permutation, L lower triangular, D (block) diagonal Best row/column pairs first PAPT = LTLT with symmetric pivoting A symmetric, P permutation, L lower triangular, T tridiagonal Best row/column pairs first AP = QR with column pivoting Best columns first PrAPcT=LU with complete pivoting Best rows and columns first
Naïve Matrix Multiply {implements C = C + A*B} for i = 1 to n for j = 1 to n for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) Algorithm has 2*n3 = O(n3) Flops and operates on 3*n2 words of memory q potentially as large as 2*n3 / 3*n2 = O(n) A(i,:) C(i,j) C(i,j) B(:,j) = + *
Naïve Matrix Multiply {implements C = C + A*B} for i = 1 to n {read row i of A into fast memory} for j = 1 to n {read C(i,j) into fast memory} {read column j of B into fast memory} for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) back to slow memory} A(i,:) C(i,j) C(i,j) B(:,j) = + *
Naïve Matrix Multiply Number of slow memory references on unblocked matrix multiply m = n3 to read each column of B n times + n2 to read each row of A once + 2n2 to read and write each element of C once = n3 + 3n2 So q = f / m = 2n3 / (n3 + 3n2) 2 for large n, no improvement over matrix-vector multiply Inner two loops are just matrix-vector multiply, of row i of A times B Similar for any other order of 3 loops A(i,:) C(i,j) C(i,j) B(:,j) = + *
Matrix-multiply, optimized several ways Speed of n-by-n matrix multiply on Sun Ultra-1/170, peak = 330 MFlops
Recall: finding implementations for an algorithm that reduce communication • Allowed: • Reordering arithmetic operations, preserving the dependencies. • Reordering summations, using associativity. • Reordering multiplications, using commutativity.(irrelevant for matrix multiply and other bilinear algorithms). • Not allowed (i.e., considered as different algorithms) • Other reorderingseven if by doing so we preserve the arithmetic correctness(e.g., using distributivity, as in Strassen’s algorithms) .
Blocked (Tiled) Matrix Multiply Consider A,B,C to be n/b-by-n/b matrices of b-by-b subblocks where b is called the block size for i = 1 to n/b for j = 1 to n/b {read block C(i,j) into fast memory} for k = 1 to n/b {read block A(i,k) into fast memory} {read block B(k,j) into fast memory} C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks} {write block C(i,j) back to slow memory} A(i,k) C(i,j) C(i,j) = + * B(k,j)
Blocked (Tiled) Matrix Multiply Recall: m is #words_moved between slow and fast memory matrix has nxn elements, and n/b x n/b blocks each of size bxb So: m = n3/bread each block of B (n/b)3times ((n/b)3* b2 = n3/b) + n3/bread each block of A (n/b)3times + 2n2 read and write each block of C once = 2n3/b + 2n2 So we can reduce communication by increasing the blocksize b Limit: 3b2 ≤ M = fast memory size m ≥ 2 * 31/2 *n3 / M1/2
What if there are more than 2 levels of memory? • Recall goal is to minimize communication between all levels • The tiled algorithm requires finding a good block size • Machine dependent • Need to “block” b x b matrix multiply in inner most loop • 1 level of memory 3 nested loops (naïve algorithm) • 2 levels of memory 6 nested loops • 3 levels of memory 9 nested loops … • Cache Oblivious Algorithms offer an alternative • Treat nxn matrix multiply as a set of smaller problems • Eventually, these will fit in cache • Will minimize # words moved between every level of memory hierarchy (between L1 and L2 cache, L2 and L3, L3 and main memory etc.) – at least asymptotically
Recursive Matrix Multiplication (RMM) (1/2) • For simplicity: square matrices with n = 2m • C = = A · B = · · = • True when each Aijetc 1x1 or n/2 x n/2 A11 A12 A21 A22 B11 B12 B21 B22 C11 C12 C21 C22 A11·B11 +A12·B21 A11·B12 +A12·B22 A21·B11 +A22·B21 A21·B12 +A22·B22 func C = RMM (A, B, n) if n = 1, C = A * B, else { C11 = RMM (A11 , B11 , n/2) + RMM (A12 , B21 , n/2) C12 = RMM (A11 , B12 , n/2) + RMM (A12 , B22 , n/2) C21 = RMM (A21 , B11 , n/2) + RMM (A22 , B21 , n/2) C22 = RMM (A21 , B12 , n/2) + RMM (A22 , B22 , n/2) } return
Recursive Matrix Multiplication (2/2) func C = RMM (A, B, n) if n=1, C = A * B, else { C11 = RMM (A11 , B11 , n/2) + RMM (A12 , B21 , n/2) C12 = RMM (A11 , B12 , n/2) + RMM (A12 , B22 , n/2) C21 = RMM (A21 , B11 , n/2) + RMM (A22 , B21 , n/2) C22 = RMM (A21 , B12 , n/2) + RMM (A22 , B22 , n/2) } return A(n) = # arithmetic operations in RMM( . , . , n) = 8 · A(n/2) + 4(n/2)2 if n > 1, else 1 = 2n3 … same operations as usual, in different order M(n) = # words moved between fast, slow memory by RMM( . , . , n) = 8 · M(n/2) + 4(n/2)2 if 3n2 > Mfast, else 3n2 = O( n3 / (Mfast )1/2 +n2 ) … same as blocked matmul
Recursion: Cache Oblivious Algorithms • Recursion for general A (mxn) * B (nxp) • Case1: m>= max{n,p}: split A horizontally: • Case 2 : n>= max{m,p}: split A vertically and B horizontally • Case 3: p>= max{m,n}: split B vertically • Attains lower bound in O() sense 1 2 Case 1 Case 2 Case 3
Experience with Cache-Oblivious Algorithms • In practice, need to cut off recursion well before 1x1 blocks • Call “Micro-kernel” for small blocks, eg 16 x 16 • Implementing a high-performance Cache-Oblivious code is not easy • Using fully recursive approach with highly optimized recursive micro-kernel, Pingali et al report that they never got more than 2/3 of peak. • Issues with Cache Oblivious (recursive) approach • Recursive Micro-Kernels yield less performance than iterative ones using same scheduling techniques • Pre-fetching is needed to compete with best code: not well-understood in the context of Cache Oblivous codes Unpublished work, presented at LACSI 2006
How hard is hand-tuning matmul, anyway? • Results of 22 student teams trying to tune matrix-multiply, in CS267 Spr09 • Students given “blocked” code to start with • Still hard to get close to vendor tuned performance (ACML) • For more discussion, see www.cs.berkeley.edu/~volkov/cs267.sp09/hw1/results/
Parallel matrix-matrix multiplication • Consider distributed memory machines • Each processor has its own private memory • Communication by sending messages over a network • Examples: MPI, UPC, Titanium • First question: how is matrix initially distributed across different processors?
Different Parallel Data Layouts for Matrices (not all!) 1) 1D Column Blocked Layout 2) 1D Column Cyclic Layout 4) Row versions of the previous layouts b 3) 1D Column Block Cyclic Layout Generalizes others 6) 2D Row and Column Block Cyclic Layout 5) 2D Row and Column Blocked Layout
p0 p1 p2 p3 p4 p5 p6 p7 Matrix Multiply with 1D Column Block Layout • Assume matrices are n x n and n is divisible by p • A(i) refers to the n by n/p block column that processor i owns (similiarly for B(i) and C(i)) • B(i,j) is the n/p by n/p sublock of B(i) • in rows j*n/p through (j+1)*n/p • Algorithm uses the formula C(i) = C(i) + A*B(i) = C(i) + Sj A(j)*B(j,i) May be a reasonable assumption for analysis, not for code
Matmul for 1D Column Block layout on a Processor Ring • Pairs of adjacent processors can communicate simultaneously Copy A(myproc) into Tmp C(myproc) = C(myproc) + Tmp*B(myproc , myproc) for j = 1 to p-1 Send Tmp to processor myproc+1 mod p Receive Tmp from processor myproc-1 mod p C(myproc) = C(myproc) + Tmp*B( myproc-j mod p , myproc) • Need to be careful about talking to neighboring processors • May want double buffering in practice for overlap • Ignoring deadlock details in code • Time of inner loop = 2*(a + b*n2/p) + 2*n*(n/p)2g • a = latency, b = latency, g = time_per_flop
Matmul for 1D layout on a Processor Ring • Time of inner loop = 2*(a + b*n2/p) + 2*n*(n/p)2g • Total Time = 2*n* (n/p)2g + (p-1) * Time of inner loop • 2*(n3/p)g + 2*p*a + 2*b*n2 • (Nearly) Optimal for 1D layout on Ring or Bus, even with Broadcast: • Perfect speedup for arithmetic • A(myproc) must move to each other processor, costs at least (p-1)*cost of sending n*(n/p) words • Parallel Efficiency = 2*n3g/ (p * Total Time) = 1/(1+ (a/g) * (p2/n3) + (b/g) * (p/n) ) = 1/ (1 + O(p/n)) • Grows to 1 as n/p increases (or a/g and b/g shrink)
MatMul with 2D Layout • Consider processors in 2D grid (physical or logical) • Processors communicate with 4 nearest neighbors • Assume p processors form square s x s grid, s = p1/2 p(0,0) p(0,1) p(0,2) p(0,0) p(0,1) p(0,2) p(0,0) p(0,1) p(0,2) = * p(1,0) p(1,1) p(1,2) p(1,0) p(1,1) p(1,2) p(1,0) p(1,1) p(1,2) p(2,0) p(2,1) p(2,2) p(2,0) p(2,1) p(2,2) p(2,0) p(2,1) p(2,2)
Cannon’s Algorithm … C(i,j) = C(i,j) + S A(i,k)*B(k,j) … assume s = sqrt(p) is an integer forall i=0 to s-1 … “skew” A left-circular-shift row i of A by i … so that A(i,j) overwritten by A(i,(j+i)mod s) forall i=0 to s-1 … “skew” B up-circular-shift column i of B by i … so that B(i,j) overwritten by B((i+j)mod s), j) for k=0 to s-1 … sequential forall i=0 to s-1 and j=0 to s-1 … all processors in parallel C(i,j) = C(i,j) + A(i,j)*B(i,j) left-circular-shift each row of A by 1 up-circular-shift each column of B by 1 k
Cannon’s Matrix Multiplication C(1,2) = A(1,0) * B(0,2) + A(1,1) * B(1,2) + A(1,2) * B(2,2)
Initial Step to Skew Matrices in Cannon • Initial blocked input • After skewing before initial block multiplies A(0,0) A(0,1) A(0,2) B(0,0) B(0,1) B(0,2) A(1,0) A(1,1) A(1,2) B(1,0) B(1,1) B(1,2) A(2,0) A(2,1) A(2,2) B(2,0) B(2,1) B(2,2) A(0,0) A(0,1) A(0,2) B(0,0) B(1,1) B(2,2) A(1,1) A(1,2) A(1,0) B(1,0) B(2,1) B(0,2) A(2,2) A(2,0) A(2,1) B(2,0) B(0,1) B(1,2)
A(0,0) A(0,1) A(0,2) B(0,0) B(1,1) B(2,2) A(1,1) A(1,2) A(1,0) B(1,0) B(2,1) B(0,2) A(2,2) A(2,0) A(2,1) B(2,0) B(0,1) B(1,2) Skewing Steps in CannonAll blocks of A must multiply all like-colored blocks of B • First step • Second • Third A(0,2) A(0,0) A(0,1) B(1,0) B(2,1) B(0,2) A(1,0) A(1,1) B(2,0) B(0,1) B(1,2) A(1,2) A(2,0) A(2,2) B(0,0) B(1,1) B(2,2) A(2,1) A(0,0) B(2,0) B(0,1) B(1,2) A(0,2) A(0,1) A(1,0) B(0,0) B(1,1) B(2,2) A(1,1) A(1,2) A(2,0) A(2,1) A(2,2) B(1,0) B(2,1) B(0,2)
Cost of Cannon’s Algorithm forall i=0 to s-1 … recall s = sqrt(p) left-circular-shift row i of A by i … cost ≤ s*(a + b*n2/p) forall i=0 to s-1 up-circular-shift column i of B by i … cost ≤ s*(a + b*n2/p) for k=0 to s-1 forall i=0 to s-1 and j=0 to s-1 C(i,j) = C(i,j) + A(i,j)*B(i,j) … cost = 2*(n/s)3 = 2*n3/p3/2 left-circular-shift each row of A by 1 … cost = a + b*n2/p up-circular-shift each column of B by 1 … cost = a + b*n2/p • Total Time = 2*(n3/p)g+ 4*s* + 4**n2/s …. attains lower bound! • Parallel Efficiency = 2*n3g/ (p * Total Time) • = 1/( 1 + (a/g)* 2*(s/n)3 + (b/g)* 2*(s/n) ) • = 1/(1 + O(sqrt(p)/n)) • Grows to 1 as n/s = n/sqrt(p) = sqrt(data per processor) grows • Better than 1D layout, which had Efficiency = 1/(1 + O(p/n))
Pros and Cons of Cannon • So what if it’s “optimal”, is it fast? • Yes: Local computation one call to (optimized) matrix-multiply • Hard to generalize for • p not a perfect square • A and B not square • Dimensions of A, B not perfectly divisible by s=sqrt(p) • A and B not “aligned” in the way they are stored on processors • block-cyclic layouts • Can you show optimal Cannon-like implementation (up to a constant), that can deal with items 1 and 3 above? • Can you show optimal Cannon-like implementation (up to a constant), that can deal with item 2 above?(hint: what do you get if you use rectangle blocks with aspect ratios as the that of the input/output matrices?) • Memory hog (extra copies of local matrices)
SUMMA Algorithm • SUMMA = Scalable Universal Matrix Multiply • Slightly less efficient than Cannon, but simpler and easier to generalize • Can accommodate any layout, dimensions, alignment • Uses broadcast of submatrices instead of circular shifts • Sends log p times as much data as Cannon • Can use much less extra memory than Cannon, but send more messages • Similar ideas appeared many times • Used in practice in PBLAS = Parallel BLAS • www.netlib.org/lapack/lawns/lawn{96,100}.ps
SUMMA B(k,j) k j k * = C(i,j) i A(i,k) • i, j represent all rows, columns owned by a processor • k is a block of b 1 rows or columns • C(i,j) = C(i,j) + SkA(i,k)*B(k,j) • Assume a pr by pc processor grid (pr = pc = 4 above) • Need not be square
SUMMA B(k,j) k j k * = C(i,j) i A(i,k) For k=0 to n-1 … or n/b-1 where b is the block size … = # cols in A(i,k) and # rows in B(k,j) for all i = 1 to pr… in parallel owner of A(i,k) broadcasts it to whole processor row for all j = 1 to pc… in parallel owner of B(k,j) broadcasts it to whole processor column Receive A(i,k) into Acol Receive B(k,j) into Brow C_myproc = C_myproc + Acol * Brow
SUMMA performance • To simplify analysis only, assume s = sqrt(p) For k=0 to n/b-1 for all i = 1 to s … s = sqrt(p) owner of A(i,k) broadcasts it to whole processor row … time = log s *( a + b * b*n/s), using a tree for all j = 1 to s owner of B(k,j) broadcasts it to whole processor column … time = log s *( a + b * b*n/s), using a tree Receive A(i,k) into Acol Receive B(k,j) into Brow C_myproc = C_myproc + Acol * Brow … time = 2*(n/s)2*b • Total time = 2*(n3/p) g+ a* log p * n/b + b * log p * n2 /s
SUMMA performance • Total time = 2*(n3/p) g + a * log p * n/b + b * log p * n2 /s • Parallel Efficiency = • 1/(1 + (a/g)* log p * p / (2*b*n2) + (b/g)* log p * s/(2*n) ) • Same b term as Cannon, except for log p factor • log p grows slowly so this is ok • Latency (a/g) term can be larger, depending on b • When b=1, get (a/g)* log p * n • As b grows to n/s, term shrinks to • (a/g)* log p * s (log p times Cannon) • Temporary storage grows like 2*b*n/s • Can change b to tradeoff latency cost with memory
Matrix multiplication – summary • Cannon and SUMMA both (nearly) attain communication lower bound • Assuming 1 copy of data is allowed (plus a constant amount of buffer space) • Depend on assumptions about initial, final data layouts • Called “2D algorithms” to reflect layout • When more memory is available, get 2.5D algorithm • Future talk (Edgar Solomonik) • When processors heterogeneous, get another algorithm • Future talk (Grey Ballard and Andrew Gearhart) • Since Matmul naturally decomposes into smaller analogous problems, hierarchical machines can be accommodated
A brief history of (Dense) Linear Algebra software (1/5) • In the beginning was the do-loop… • Libraries like EISPACK (for eigenvalue problems) • Then the BLAS (1) were invented (1973-1977) • Standard library of 15 operations (mostly) on vectors • “AXPY” ( y = α·x + y ), dot product, scale (x = α·x ), etc • Up to 4 versions of each (S/D/C/Z), 46 routines, 3300 LOC • Goals • Common “pattern” to ease programming, readability, self-documentation • Robustness, via careful coding (avoiding over/underflow) • Portability + Efficiency via machine specific implementations • Why BLAS 1 ? They do O(n1) ops on O(n1) data • Used in libraries like LINPACK (for linear systems) • Source of the name “LINPACK Benchmark” (not the code!)
A brief history of (Dense) Linear Algebra software (2/5) • But the BLAS-1 weren’t enough • Consider AXPY ( y = α·x + y ): 2n flops on 3n read/writes • “Computational intensity” = #flops / #mem_refs = (2n)/(3n) = 2/3 • Too low to run near peak speed (time for mem_refs dominates) • Hard to vectorize (“SIMD’ize”) on supercomputers of the day (1980s) • So the BLAS-2 were invented (1984-1986) • Standard library of 25 operations (mostly) on matrix/vector pairs • “GEMV”: y = α·A·x + β·x, “GER”: A = A + α·x·yT, “TRSV”: y = T-1·x • Up to 4 versions of each (S/D/C/Z), 66 routines, 18K LOC • Why BLAS 2 ? They do O(n2) ops on O(n2) data • So computational intensity still just ~(2n2)/(n2) = 2 • OK for vector machines, but not for machine with caches
A brief history of (Dense) Linear Algebra software (3/5) • The next step: BLAS-3 (1987-1988) • Standard library of 9 operations (mostly) on matrix/matrix pairs • “GEMM”: C = α·A·B + β·C, “SYRK”: C = α·A·AT + β·C, “TRSM”: C = T-1·B • Up to 4 versions of each (S/D/C/Z), 30 routines, 10K LOC • Why BLAS 3 ? They do O(n3) ops on O(n2) data • So computational intensity (2n3)/(4n2) = n/2 – big at last! • Tuning opportunities machines with caches, other mem. hierarchy levels • How much BLAS1/2/3 code so far (all at www.netlib.org/blas) • Source: 142 routines, 31K LOC, Testing: 28K LOC • Reference (unoptimized) implementation only • Ex: 3 nested loops for GEMM • Lots more optimized code • Most computer vendors provide own optimized versions • Motivates “automatic tuning” of the BLAS
A brief history of (Dense) Linear Algebra software (4/5) • LAPACK – “Linear Algebra PACKage” - uses BLAS-3 (1989 – now) • Ex: Obvious way to express Gaussian Elimination (GE) is adding multiples of one row to other rows – BLAS-1 • How do we reorganize GE to use BLAS-3 ? (details later) • Contents of LAPACK (summary) • Algorithms we can turn into (nearly) 100% BLAS 3 • Linear Systems: solve Ax=b for x • Least Squares: choose x to minimize ||r||2S ri2where r=Ax-b • Algorithms we can only make up to ~50% BLAS 3 (so far) • “Eigenproblems”: Findland x where Ax = l x • Singular Value Decomposition (SVD): ATAx=2x • Error bounds for everything • Lots of variants depending on A’s structure (banded, A=AT, etc) • How much code? (Release 3.2, Nov 2008) (www.netlib.org/lapack) • Source: 1582 routines, 490K LOC, Testing: 352K LOC • Ongoing development (at UCB, UTK and elsewhere)
A brief history of (Dense) Linear Algebra software (5/5) • Is LAPACK parallel? • Only if the BLAS are parallel (possible in shared memory) • ScaLAPACK – “Scalable LAPACK” (1995 – now) • For distributed memory – uses MPI • More complex data structures, algorithms than LAPACK • Only (small) subset of LAPACK’s functionality available • All at www.netlib.org/scalapack