510 likes | 520 Views
Explore block-matrix algorithms for matrix multiplication, commonly used in graphics, neural nets, and linear algebra tasks. Delve into the complexities of computation and memory ratios for efficient parallel processing.
E N D
EE 193: Parallel Computing Fall 2017 Tufts University Instructor: Joel Grodstein joel.grodstein@tufts.edu Lecture 7: Matrix multiply
Goals • Primary goals: • Understand block-matrix algorithms
Matrix multiplication • Matrix multiplication is the poster child for parallel computation. Some uses: • Computer graphics (e.g., linear algebra for transforming points between coordinate spaces) • Anything that uses linear algebra • Neural nets (perhaps the biggest use of GPGPUs) • And that's good – because matrix multiplication is easy, right? • It's easy to get it correct. • But not so easy to make it fast. EE 193 Joel Grodstein
Matrix times vector • Let's start simple: a (4x4 matrix) * (1x4 vector) Write the eqns. Write the code. A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 x0 x1 x2 x3 y0 y1 y2 y3 * = If N were big, what do you think about the temporal and spatial locality of this code? • for r=0…N-1 • for k=0…N-1 • y[r] += A[r,k]*x[k]; EE 193 Joel Grodstein
Matrix times vector • Temporal: • each element of A is used only once . • each element of x is used N times . • this is only useful if x fits into cache (hopefully easy) • Spatial: • The inner loop accesses memory in row-major order • for r=0…N-1 • for k=0…N-1 • y[r] += A[r,k]*x[k]; A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 x0 x1 x2 x3 y0 y0 y2 y3 * = EE 193 Joel Grodstein
Data for HSX and Pascal • Look at Haswell server compute and memory EE 193 Joel Grodstein
Xeon compute-to-memory ratio • 125G operand/sec • Max DRAM bandwidth = =25G floats/sec • Note that 125 > 25. So • If we are reading our data from main memory (not from caches)… • then we cannot keep all FPUs in all cores busy • Note that 125/25 ≈ 5 • If every piece of data gets used once (from memory) and four more times (from cache)… • then we can keep all FPUs in all cores busy • Now let’s look at a consequence of this EE 193 Joel Grodstein
Compute/memory ratio • for i=0…N-1 • for k=0…N-1 • y[i] += A[i,k]*y[k]; • How many computations are there? • How many floats are read/written? • The compute/memory ratio is • We want every piece of data to be bereused 4 times – but it’s only used once, period • So a Xeon cannot get data in from memory fast enough to keep the FPUs busy . • Even perfect caches cannot fix this! A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 x0 x1 x2 x3 y0 y0 y2 y3 * = N2 FMACs. N2 +2N Roughly 1 EE 193 Joel Grodstein
What does it all mean? • Big-data problems do not fit in your registers. • They're unlikely to fit in L1. • They may fit in L2… or L3… or main memory… • If they fit into L1, all of this is irrelevant • since the BW and latency from L1 to RF is so good. • If they fit into main memory… • A chain is no stronger than its weak link. • Cache can buffer, reuse, even out flow, etc. • If the compute/memory ratio is <5, then HSX cannot use all of its computes • But if the average DRAM bandwidth is the weak link, there is no way to fix that. EE 193 Joel Grodstein
A slightly different problem • for i=0…N-1 • for k=0…N-1 • y[i] += A[i,k]*x[k]; • Related problem: what if A is fixed, and we want to stream in lots of x vectors. • Consider the work for one vector x producing one vector y, and assuming A is already in cache. • How many computations are there? • How many floats are read/written? • The compute/memory ratio is • So HSX can deal with this for N≥10. • Note that computer graphics only typically has N=4. But the graphics pipeline has more than just one matrix multiply. A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 x0 x1 x2 x3 y0 y0 y2 y3 * = N2FMACs. 2N N/2 EE 193 Joel Grodstein
More on matrix*vector • for i=0…N-1 • for k=0…N-1 • y[i] += A[i,k]*x[k]; • Keep looking at the same problem: A is fixed, and we want to stream in lots of x vectors. • In this example, do we have any need for caches? • We need someplace to store the A matrix. • Our code may not wind up accessing the various x vectors exactly in address-sequential order; a cache can store the ones we skip until we use them. A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 x0 x1 x2 x3 y0 y0 y2 y3 * = EE 193 Joel Grodstein
Matrix times matrix • Let's move on to a bigger problem: matrix*matrix C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 * = • for r=0…N-1 • for c=0…N-1 • for k=0…N-1 • C[r,c] += A[r,k]*B[k,c]; EE 193 Joel Grodstein
Matrix times matrix • Let's move on to a bigger problem: matrix*matrix A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 * = How many computations are there? How many floats are read/written? The compute/memory ratio is N3 FMACs 3N2 N/3. So, in principle, this should be easy! • for r=0…N-1 • for c=0…N-1 • for k=0…N-1 • C[r,c] += A[r,k]*B[k,c]; EE 193 Joel Grodstein
Matrix access patterns • Access pattern for A? • row major, repeat each line N times • row will mostly be accessed from a small, fast cache • Access pattern for B? • column major, repeat entire matrix N times • accessing from main memory, and slowly • Access pattern for C? • row major, repeat each element N times • extremely fast access A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 * = Can we fix this? • for r=0…N-1 • for c=0…N-1 • for k=0…N-1 • C[r,c] += A[r,k]*B[k,c]; EE 193 Joel Grodstein
Our trick: swap the 2nd and 3rd loops. • Any change in the end-result computation? • We still execute the final line with the same r,c,k triplets; just in a different order. And adding the same numbers in a different order doesn't change the result. A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 • for r=0…N-1 • for c=0…N-1 • for k=0…N-1 • C[r,c] += A[r,k]*B[k,c]; * = EE 193 Joel Grodstein
In-class exercise • Draw matrix multiple for 2x2 matrices on the board • Divide the class into two groups • one group works on the original loop ordering, and the other on the swapped ordering • Each group manually runs the code, and notes which A*B products got summed into which locations in C. • At the end, compare to see if they both added the same product pairs EE 193 Joel Grodstein
Matrix access patterns • Access pattern for A? • row major, repeat each element N times • extremely fast access • Access pattern for B? • row major, repeat entire matrix N times • but if we cannot fit B into the L3, we must still keep reloading it from memory • Access pattern for C? • row major, repeat each row N times • row will mostly be accessed from a small, fast cache A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 * = Can we fix this? • for r=0…N-1 • for k=0…N-1 • for c=0…N-1 • C[r,c] += A[r,k]*B[k,c]; EE 193 Joel Grodstein
A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 * = r k • for r=0…N-1 • for k=0…N-1 • for c=0…N-1 • C[r,c] += A[r,k]*B[k,c]; 0 0 0 1 0 2 0 3 EE 193 Joel Grodstein
A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 * = r k • for r=0…N-1 • for k=0…N-1 • for c=0…N-1 • C[r,c] += A[r,k]*B[k,c]; 1 0 1 1 1 2 1 3 EE 193 Joel Grodstein
A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 * = r k • for r=0…N-1 • for k=0…N-1 • for c=0…N-1 • C[r,c] += A[r,k]*B[k,c]; 2 0 2 1 2 2 2 3 EE 193 Joel Grodstein
A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 * = r k • for r=0…N-1 • for k=0…N-1 • for c=0…N-1 • C[r,c] += A[r,k]*B[k,c]; 3 0 3 1 3 2 3 3 EE 193 Joel Grodstein
Let's go parallel. Assign each thread to just a few r values. We want to assign each thread to one core, and have each thread only access as much memory as can fit in L2. Does it work? • Each thread must now access only its own rows of A and C • Each thread still accesses every element of B; and once per value of r. • Conclusion: each thread must be able to hold all of B in its cache. Other than that, it parallelizes quite well. A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 • for r=0…N-1 • for k=0…N-1 • for c=0…N-1 • C[r,c] += A[r,k]*B[k,c]; * = EE 193 Joel Grodstein
Block-matrix algorithms • What happens if B doesn't fit into our relevant cache? Is this a real concern? • YES. Matrices can be big…, and we want to be able to do linear algebra on small CPUs (e.g., for Folding @home)… and GPUs have small caches. • Let's try to change our algorithm to work even on small caches and big matrices. EE 193 Joel Grodstein
Baby step • Consider the two programs: • for a=0..1 • for b=0..3 • i=4*a+b; • do something with i; • for i=0..7 • do something with i; • Do they get the same results? • Unless the "do something" is very tricky, yes they do. EE 193 Joel Grodstein
Interpretation: matrix blocks A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 • Break the 4x4 matrix into 4 2x2 blocks • Block 0,0 • Block 0,1 • Block 1,0 • Block 1,1 EE 193 Joel Grodstein
Do we believe that this has the same result as before? • Yes. But it is getting a bit confusing. A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 • for rb=0…Nb-1 • for ri=0..Ni-1 • for kb=0..Nb-1 • for ki=0..Ni-1 • for cb=0..Nb-1 • for ci=0..Ni-1 • r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; • C[r,c] += A[r,k]*B[k,c]; Pick the row * = Pick k Pick the column EE 193 Joel Grodstein
Now swap some loops A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 for rb=0…Nb-1 * = for ri=0…Ni-1 for kb=0…Nb-1 for ki=0…Ni-1 for cb=0…Nb-1 • for ci=0…Ni-1 • r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; • C[r,c] += A[r,k]*B[k,c]; EE 193 Joel Grodstein
What? Doesn't it only take two numbers to specify a block??? A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 • for rb=0…Nb-1 • for kb=0..Nb-1 • for cb=0..Nb-1 • for ri=0..Ni-1 • for ki=0..Ni-1 • for ci=0..Ni-1 • r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; • C[r,c] += A[r,k]*B[k,c]; * = Pick the block Pick the offset within a block EE 193 Joel Grodstein
Yes, it only take two numbers to specify a block. But A, B and C each use a different index pair. A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 • for rb=0…Nb-1 • for kb=0..Nb-1 • for cb=0..Nb-1 • for ri=0..Ni-1 • for ki=0..Ni-1 • for ci=0..Ni-1 • r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; • C[r,c] += A[r,k]*B[k,c]; * = rbcbkb Pick the three blocks 0 0 0 0 0 1 EE 193 Joel Grodstein
Yes, it only take two numbers to specify a block. But A, B and C each use a different index pair. A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 • for rb=0…Nb-1 • for kb=0..Nb-1 • for cb=0..Nb-1 • for ri=0..Ni-1 • for ki=0..Ni-1 • for ci=0..Ni-1 • r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; • C[r,c] += A[r,k]*B[k,c]; * = rbcbkb Pick the block 0 0 1 0 1 1 EE 193 Joel Grodstein
Yes, it only take two numbers to specify a block. But A, B and C each use a different index pair. A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 • for rb=0…Nb-1 • for kb=0..Nb-1 • for cb=0..Nb-1 • for ri=0..Ni-1 • for ki=0..Ni-1 • for ci=0..Ni-1 • r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; • C[r,c] += A[r,k]*B[k,c]; * = rbcbkb Pick the block 1 0 1 0 1 1 EE 193 Joel Grodstein
Why is this useful? • We only work on three blocks at a time • We can pick the block size so that three blocks will fit into cache • Why isn’t this is free lunch? • Instead of traversing entire rows of the matrix, we only traverse rows of the block → take less advantage of spatial locality. • The cache turns over after each block triplet → our compute/mem ratio is really rather than N/3. Basically, we're not getting as much temporal reuse. A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 • for each triplet of three blocks specified by rb, cb, kb • for each element combination ri, ci, kiwithin the blocks • r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; • C[r,c] += A[r,k]*B[k,c]; * = EE 193 Joel Grodstein
Look at blocks another way • Multiply two 16x16 matrices. = * EE 193 Joel Grodstein
Look at blocks another way • Row 4, column 0 = * EE 193 Joel Grodstein
Look at blocks another way • Row 4, column 1 = * EE 193 Joel Grodstein
Look at blocks another way • Row 5, column 0 = * EE 193 Joel Grodstein
Look at blocks another way • Block 1,0 = * EE 193 Joel Grodstein
Look at blocks another way • Interpretation: block[1,0] = row[1] ∙ column[0] • The output block depends only on the input blocks. • We reused the same rows/columns over & over. = * EE 193 Joel Grodstein
Block-matrix and threads • We have a block-matrix algorithm • We even understand it • It should make good use of caches, and thus be quite fast. • Next problems: • ordering the inner and outer loops • multithreading • Simplifying assumption: • Pick the block size so that it fits into the L2 (or lower) cache • Use the L3 (i.e., ring cache) to get “some” reuse of blocks without going to main memory EE 193 Joel Grodstein
for rb=0…Nb-1 • for kb=0..Nb-1 • for cb=0..Nb-1 • for ri=0..Ni-1 • for ki=0..Ni-1 • for ci=0..Ni-1 • r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; • C[r,c] += A[r,k]*B[k,c]; Pick the offset within a block • Does it matter what order we put the three inner loops? After all, blocks fit into the L2. • Probably. It would be nice to get a lot of hits from the L1, to make sure our cores don’t wait for memory EE 193 Joel Grodstein
for rb=0…Nb-1 • for kb=0..Nb-1 • for cb=0..Nb-1 • for ri=0..Ni-1 • for ki=0..Ni-1 • for ci=0..Ni-1 • r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; • C[r,c] += A[r,k]*B[k,c]; Pick the block • Does it matter what order we put the three outer loops? After all, there’s enough computation that the initial block-loading time doesn’t matter. • Perhaps. But that depends on a lot of factors, and it’s easy enough to pick a good loop ordering. That will let us reload blocks from the L3 rather than from main memory. • What order to use? • rb/kb/cblooks good, for the same reasons as before. EE 193 Joel Grodstein
for rb=0…Nb-1 • for kb=0..Nb-1 • for cb=0..Nb-1 • for ri=0..Ni-1 • for ki=0..Ni-1 • for ci=0..Ni-1 • r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; • C[r,c] += A[r,k]*B[k,c]; Pick the block Pick the offset within a block • How many threads should we use? • Start with one thread per block triplet (i.e., one thread per rb/kb/cb triplet). We can change that if we want to. This goes in a thread EE 193 Joel Grodstein
for rb=0…Nb-1 • for kb=0..Nb-1 • for cb=0..Nb-1 • for ri=0..Ni-1 • for ki=0..Ni-1 • for ci=0..Ni-1 • r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; • C[r,c] += A[r,k]*B[k,c]; Pick the block thread (thread_func, rb, kb, cb); Pick the offset within a block • Start with one thread per block triplet (i.e., one thread per rb/kb/cb triplet). • How many threads are there? • If we have 1Kx1K matrices, and 64x64 blocks, then how many threads are there? • 1K/64 = 16. So Nb=16, and there are 163=4K threads • What do you think about the code above? • Way too many threads to launch all at once! This goes in a thread Nb3 EE 193 Joel Grodstein
for rb=0…Nb-1 • for kb=0..Nb-1 • for cb=0..Nb-1 • thread (thread_func, rb, kb, cb); Pick the block • Another issue: do we need a critical section to avoid multiple threads writing the same data values? • Yes. Every thread with the same rb and cb all try to write the same locations of the output matrix. • How many threads are fighting each other? • Nb (i.e., all the different values of kb) • If we use a critical section to protect the “C[r,c] += A[r,k]*B[k,c]”, how will that affect performance? • Basically, no two threads can run at the same time any more • If we make each thread keep its own private work area and then merge them, how will that work? • Each thread will need to keep an entire block as its own work space. That’s potentially a lot of space. And the merging is (#threads)*Nb2 flops, which is non-trivial EE 193 Joel Grodstein
for rb=0…Nb-1 • for kb=0..Nb-1 • for cb=0..Nb-1 • for ri=0..Ni-1 • for ki=0..Ni-1 • for ci=0..Ni-1 • r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; • C[r,c] += A[r,k]*B[k,c]; Pick the block • do { • launch 4 threads; • wait for them to finish; • } until you’ve gone through every one of the Nb3 • triplets (in (Nb3)/4 groups) Pick the offset within a block • Any ideas? • What if we only launch a few at a time? With 4 cores, maybe launch 4 at a time? • Does it fix the problem of having too many threads at once? • Yes • Do we still need a critical section? • Probably not, since we won’t have two threads with the same kb running at the same time. This goes in a thread EE 193 Joel Grodstein
Results on Norbert • for kb=0…Nb-1 • for rb=0..Nb-1 • for cb=0..Nb-1 • Spawn a thread to compute C[r,c] += A[r,k]*B[k,c]; • (for every ri,ci,ki in the three blocks) • if (we've spawned as many threads as cores) • wait for them all to finish; Launching a thread takes .01ms, so .3s total • Very slow. Why? • Another problem, too. • 1Kx1K matrices, 64x64 blocks → Nb=32. • How many threads? • How many groups of 32 threads (there are 32 cores)? • Different threads go at different rates. We have 1024 iterations of "wait for the slowest thread to finish." Hugely inefficient. 32x32x32=32K 1024 EE 193 Joel Grodstein
What to do? • for kb=0…Nb-1 • for rb=0..Nb-1 • for cb=0..Nb-1 • Spawn a thread to compute C[r,c] += A[r,k]*B[k,c]; • (for every r,c,k in the three blocks) • if (we've spawned as many threads as cores) • wait for them all to finish; • What are your thoughts on how to fix this? • We launched too many threads • We kept having the fastest thread wait for the slowest • You get to take your best shot in HW #3. • But you can brainstorm together here first EE 193 Joel Grodstein
Brainstorming • Brainstorm ideas and write some metacode on the board • Questions to ask: • you don’t want to launch a million threads. Roughly how many threads do you want to launch? • Can you divide up the work between threads so that you don’t need a critical section? • You must initialize the output matrix to 0 before threads start summing partial products into it. Can you distribute this work among threads also? EE 193 Joel Grodstein
SMT • Onto the next problem • We've assumed 1 thread/core until now. Any problem with that? • SMT is often a good thing. It allows one thread to compute while the other thread is stalled anyway. • If we write the 1st thread really well, perhaps it will never stall anyway? • Let's try to use hyperthreading. • Can we spawn 2 threads/core instead of 1? EE 193 Joel Grodstein
C++ threads can really suck • Let's say we divvy up the work on a block triplet among two threads in the same core • But wait – we have a big problem • C++ threads does not let you assign threads to cores. It lets the O/S do its best • There’s no guarantee our two threads, using the same block triplet, will be assigned to the same core. • In fact, even if we only have 1 thread/core, a stupid enough O/S might put all the threads on 1 core. EE 193 Joel Grodstein