Why GPU?

Solving Common Problems With GPUA Case Study with Sparse MatricesSean Baxter (324)ARTS Talk17 Mar 2011www.github.com/seanbaxter/sparse/

Why GPU?

Why GPU • 1972 – Intel 8008 – 10000nm - 3500 trans • 1985 – Intel 386 – 1000nm - 275k trans • 2000 – Pentium 4 – 180nm - 42m trans • 2011 – Sandy Bridge – 32nm - 995m trans • Now is time to move to new architecture • CMOS victim of its own success

Running out of Atoms • 28nm is half distance between start of repeating features • Radius of Si is .11nm • Only 250 Si atoms between start of repeating features • Will soon rut out of atoms • Optimize the die to see better performance

Optimize the Die • CPU pipeline • Architecture conforms to programmer • Redundant parallel circuitry for program correctness with long pipelines • GPU pipeline • Programmer conforms to hardware • Latency hiding allows long pipelines with great efficiency

Geforce 580 Architecture • 16 Streaming Multiprocessors (SMs) • 1.5 TFlops • 192 GB/s bandwidth • SMs are like independent processors • Use coarse-grained parallelism to balance work over SMs

SM Architecture • NV warp = 32 threads, ATI warp = 64 threads • Warps are serviced by instruction scheduler • Process as many warps as feasible on SM to hide instruction and memory latencies • Warp is fundamental unit of execution • Threads are convenient abstraction • Load balance to warps

SM Architecture • SM supports 1536 threads in flight • Divide SM into up to 8 blocks • 48k shared memory divided by blocks • 32768 32bit registers divided by blocks • High occupancy means better latency hiding • Limited to 20 regs/thread and 32bytes shared/thread for 100% occupancy

Memory Bandwidth • AMD Phenom II: 14.6 GB/s • Intel Sandy Bridge Quad: 21.3 GB/s • Itanium 2 (NASA Columbia): 6.4 GB/s • NVidiaGeforce GTX 580: 192 GB/s Rule of thumb: GPU has 10x memory bandwidth and 50x arithmetic throughput compared to CPU

Memory Architecture • 128 byte memory segments • Memory requests are collected into segment transactions • Requests to the same segment from a warp produce a single “coalesced” request • Coalesced requests are most important optimization for bandwidth-bound kernels

Hitting Peak Bandwidth • Issue one coalesced request every 16 cycles • For large structures (> 16bytes) spread loads over multiple threads • Use texture cache for un-coalesced reads with coherence • Peak bandwidth is achievable – keep tinkering until you hit it

New Thinking Is Needed

Obvious Parallelization • Consider a CSR encoded MxN sparse matrix • Launch M threads (one per row) • Each thread reads an interval from the Row array • Each thread dynamically loops over all non-zero values in the row, reads from the Col array, and uses that to index into the dense vector texture • Accumulate products and write

Bad With Memory • Consider memory transactions on GPU hardware • Threads 0-31 (computing rows 0-31) run concurrently • If average row density is 9 values, the warp reads over an interval of 32*9 = 288 values • 288 values span 10 segments • Memory throughput is 1/10th of peak

Bad With Threads • Threads 0-31 (computing rows 0-31) run concurrently • All threads take as long as the thread with the most work • If all threads in warp have rows with 5 values, except one thread with 40 values, all threads wait through 40 values • Runs 1/8th efficiency!

The Moral SMs are not coarse-grained parallel processors Schedule for warp execution and coalesced transactions

The Right Algorithm

Scan • The essential GPU utility • Not a callable routine – more a technique • Facilitates communication between threads • Load balances jagged problems • Allows allocation on a device with no dynamic memory

Scan • At its simplest, adds up sequence of numbers: • Inclusive scan transforms sequence: (3, 2, 1, 4, 5) -> (3, 5, 6, 10, 15) • Exclusive scan transforms sequence: (3, 2, 1, 4, 5) -> (0, 3, 5, 6, 10) • Sequential algorithm is trivial • Parallel algorithm is complex and work inefficient O(n log n) • In GPGPU world, that’s progress

// Basic 32-element scan in cs_5_0 HLSL #define NUM_THREADS 32 #define sync GroupMemoryBarrierWithGroupSync Buffer<uint> readbuf : register(b0); RWBuffer<uint> writebuf : register(u0); groupshareduintsharedArray[NUM_THREADS]; [numthreads(NUM_THREADS, 1, 1)] void WarpScan(uinttid : SV_GroupIndex) { uint x = readbuf[tid]; sharedArray[tid] = x; sync(); [unroll] for(uint offset = 1; offset < NUM_THREADS; offset<<= 1) { uint left = (NUM_THREADS - 1) & (tid - offset); uint y = sharedArray[left]; sync(); if(offset <= tid) x += y; sharedArray[tid] = x; sync(); } writebuf[tid] = x; }

cs_5_0 dcl_globalFlags refactoringAllowed dcl_resource_buffer (uint,uint,uint,uint) t0 dcl_uav_typed_buffer (uint,uint,uint,uint) u0 dcl_input vThreadIDInGroupFlattened dcl_temps 3 dcl_tgsm_structured g0, 4, 32 dcl_thread_group 32, 1, 1 ld_indexable(buffer)(uint,uint,uint,uint) r0.x, vThreadIDInGroupFlattened.xxxx, t0.xyzw store_structured g0.x, vThreadIDInGroupFlattened.x, l(0), r0.x sync_g_t iadd r1.xyzw, vThreadIDInGroupFlattened.xxxx, l(-1, -2, -4, -8) and r1.xyzw, r1.xyzw, l(31, 31, 31, 31) ld_structured r0.y, r1.x, l(0), g0.xxxx iadd r0.y, r0.x, r0.y sync_g_t uge r2.xyzw, vThreadIDInGroupFlattened.xxxx, l(1, 2, 4, 8) movc r0.x, r2.x, r0.y, r0.x store_structured g0.x, vThreadIDInGroupFlattened.x, l(0), r0.x sync_g_t ld_structured r0.y, r1.y, l(0), g0.xxxx iadd r0.y, r0.x, r0.y movc r0.x, r2.y, r0.y, r0.x (cont)

sync_g_t store_structured g0.x, vThreadIDInGroupFlattened.x, l(0), r0.x sync_g_t ld_structured r0.y, r1.z, l(0), g0.xxxx iadd r0.y, r0.x, r0.y movc r0.x, r2.z, r0.y, r0.x sync_g_t store_structured g0.x, vThreadIDInGroupFlattened.x, l(0), r0.x sync_g_t ld_structured r0.y, r1.w, l(0), g0.xxxx iadd r0.y, r0.x, r0.y movc r0.x, r2.w, r0.y, r0.x sync_g_t store_structured g0.x, vThreadIDInGroupFlattened.x, l(0), r0.x sync_g_t iadd r0.y, vThreadIDInGroupFlattened.x, l(-16) and r0.y, r0.y, l(31) ld_structured r0.y, r0.y, l(0), g0.xxxx iadd r0.y, r0.x, r0.y uge r0.z, vThreadIDInGroupFlattened.x, l(16) movc r0.x, r0.z, r0.y, r0.x store_uav_typed u0.xyzw, vThreadIDInGroupFlattened.xxxx, r0.xxxx ret // Approximately 38 instruction slots used

Segmented Scan • Sum from left-to-right within segments • Same as above code but with a modified predicate: [unroll] for(uint offset = 1; offset < NUM_THREADS; offset<<= 1) { uint left = (NUM_THREADS - 1) & (tid - offset); uint y = sharedArray[left]; sync(); if(offset <= delta) x += y; sharedArray[tid] = x; sync(); } • delta is distance from tid to start of segment

Segmented Scan • Values of the same color are in the same segment: (2 1 2 0 3 4 5 12 10 4 5 2 1) • Segmented scan performs a complete scan within each segment (2 3 5 0 3 7 12 12 30 4 9 11 12)

Sparse Matrix • Sparse matrix * vector is a very obvious segmented scan problem • Each segment is the size of the number of non-zero rows in the matrix • Each element is the product of a non-zero element and its corresponding component from the dense vector

Sparse Matrix • Consider data in CSR (Compressed Sparse Row) format: • Row: (0 3 5 6 9 14) • Col: (5 4 3 1 234 6 3 1 2 1 5 4) • Index vec from col to get vector values for the matrix values, and multiply into the matrix values: • Mat*vec: (x1 x2 x3 x4 x5x6x7 x8 x9 x10 x11 x12 x13 x14)

Run a segmented scan – the last value in each segment is the dot product of a sparse matrix row and the dense vector (x1 x2 x3 x4 x5x6x7 x8 x9 x10 x11 x12 x13 x14) scans to x1 x1+x2 x1+x2+x3 x4 x4+x5 x6 x7 x7+x8 x7+x8+x9 x10 x10+x11 x10+x11+x12 x10+x11+x12+x13 x10+x11+x12+x13+x14

Sparse Matrix Challenges • Sparse matrix * vector should be bandwidth limited – each component requires only a multiply-and-add. • To achieve peak bandwidth, we need to issue a coalesced read every 16 cycles • Parallel scan has poor work efficiency • Matrix data not in a convenient order for processing multiple values per thread

Putting the GPU to Work

Fermi SM Programming Environment • Shared memory is primary mechanism for inter-thread communication • Intra-warp threads are implicitly synchronized • Inter-warp communication requires __syncthreads() call • Incurs pipeline flush • Inter-block communication requires global memory access and __threadfence or new kernel launch

Parallel hierarchy • Prioritize computation: • Thread sequential (90%) • Fast sequential algorithms • Runs at high occupancy • Compute information-dense values and store in shared memory • Intra-warp communication (8%) • Slower parallel algorithms • Fast shared mem communication with volatile pointer • Runs at high occupancy – no pipeline issues

Parallel hierarchy (2) • Prioritize computation (2): • Inter-warp communication (1.9%) • Slower parallel algorithms • Sync between shared mem access requires pipeline flush • May include divergent branches (such as in multiscan) • Inter-block communication (0.1%) • If all blocks are running concurrently, __threadfence can sync, otherwise new kernel launch is required • Must share data through global memory

A simple model • Thread sequential work is ‘vertical’ • Load multiple values per thread and process from top to bottom • Thread parallel work is ‘horizontal’ • Combine information-dense values from vertical stage from left to right with scan

Thinking Like the Machine

Sparse Matrix on CUDA • www.github.com/seanbaxter/sparse/ • My open source SpMxV library • Performance 1.6-4.3x faster than CUSPARSE • Full pre-conditioned CG solver in a few weeks • Super fast radix sort included in next update • Hits 197 GB/s throughput on GTX 570 (141 GB/s peak)

Sparse Matrix on CUDA Put vector in texture cache. Cache misses main cause of performance hit. pwtk.mtx (wind tunnel) Height = 217,918 nz = 11,634,424 nz / h = 54.389 Max throughput = 196 GB/s Matrix is dense and well banded

pdb1HYS.mtx (Protein) height = 36,417 nz = 4,344,765 nz / h = 119.306 Max throughput = 197 GB/s Row density brings high throughput scircuit.mtx (Circuit) height = 170,998 nz = 958,936 nz / h = 5.608 Max throughput = 105 GB/s High matrix bandwidth and low density cause texture cache misses, impairing performance

Reformat the Matrix • Uses special matrix encoding to accelerate scans by baking offsets and flags into unused top bits of col indices • SpMxV is performed thousands of times to solve CG problem – reformatting is slow, but is only done once, and doubles multiplication throughput

Strided Order vs Thread Order Warps making multiple coalesced reads receive data in strided order WARP_SIZE = 8, VALUES_PER_THREAD = 4 Threads don’t hold sequential values! a0 a1 a2 a3 a4 a5 a6 b0 b1 b2 c0 c1 d0 d1 d2 d3 d4 d5 e0 e1 e2 e3 e4 f0 f1 f2 g0 g1 g2 g3 g4 g5

Transposed Format Transpose each group’s data so that coalesced reads put values in thread order: a0 a4 b1 d0 d4 e2 f1 g2 a1 a5 b2 d1 d5 e3 f2 g3 a2 a6 c0 d2 e0 e4 g0 g4 a3 b0 c1 d3 e1 f0 g1 g5 With data in thread order we can perform sequential scan within threads, then parallel scan between them

Locating scan buckets tid 0: a0 a1 a2 a3 tid 1: a4 a5 a6b0 tid2: b1 b2c0 c1 tid3: d0 d1 d2 d3 tid4: d4 d5e0 e1 tid5: e2 e3 e4f0 tid6: f1 f2g0 g1 tid7: g2 g3 g4 g5 Locate the start and end of each bucket (matrix row) within each thread Left underline indicates first value in a bucket in the thread Right underline indicates last value in a bucket in the thread Summing within threads with sequential scan is fast, provided each thread doesn’t need to calculate matrix geometry

Encoding scan buckets Encode first and last value bits into column indices. scanOffset is the position in shared memory in which to store the first “last” value for each thread. These are dot product fragments. tid0: TF FF FF FT scanOffset= 0 tid1: TF FF FT TT scanOffset= 1 tid2: TF FT TF TT scanOffset= 3 tid3: TF FF FF FT scanOffset= 5 tid4: TF FT TF FT scanOffset= 6 tid5: TF FF FT TT scanOffset= 8 tid6: TF FT TF FT scanOffset= 10 tid7: TF FF FF FT scanOffset= 12

Sequential Scan Results tid 0: a0 a0+a1 a0+a1+a2 a0+a1+a2+a3 s[0] = a0+a1+a2+a3 tid 1: a4 a4+a5 a4+a5+a6 b0 s[1] = a4+a5+a6 s[2] = b0 tid 2: b1 b1+b2 c0 c0+c1 s[3] = b1+b2 s[4] = c0+c1 tid 3: d0 d0+d1 d0+d1+d2 d0+d1+d2+d3 s[5] = d0+d1+d2+d3 tid 4: d4 d4+d5 e0 e0+e1 s[6] = d4+d5 s[7] = e0+e1 tid 5: e2 e2+e3 e2+e3+e4 f0 s[8] = e2+e3+e4 s[9] = f0 tid 6: f1 f1+f2 g0 g0+g1 s[10] = f1+f2 s[11] = g0+g1 tid 7: g2 g2+g3 g2+g3+g4 g2+g3+g4+g5 s[12] = g2+g3+g4+g5 • Stream in data and compute products • Use sequential segmented scan (i.e. just add the current value to the previous total) • Write to shared memory at sharedOffset and inc sharedOffset is LAST flag is set • Clear preceding total before adding if FIRST flag is set

Parallel Scan • A parallel segmented scan merges partial dot products • There are at most 2 * WARP_SIZE partial dot products, so each thread handles 2 elements in the parallel scan: tid and WARP_SIZE + tid • Bake delta offsets into unused bits of two of the column indices (recall slide 44)

Parallel Scan For convenience, let: A0 = a0+a1+a2+a3 A1 = a4+a5+a6 B0 = b0 B1 = b1+b2 C0 = c0+c1 D0 = d0+d1+d2+d3 D1 = d4+d5 E0 = e0+e1 E1 = e2+e3+e4 F0 = f0 F1 = f1+f2 G0 = g0+g1 G1 = g2+g3+g4+g5 2 * WARP_SIZE sharedArray s = [ A0A1B0B1C0D0D1E0E1F0F1G0G1XXXXXX ] tid 0: A0 E1deltaX = 0 deltaY = 1 tid 1: A1 F0deltaX = 1 deltaY = 0 tid 2: B0 F1 deltaX = 0 deltaY = 1 tid 3: B1 G0 deltaX = 1 deltaY = 0 tid 4: C0 G1deltaX = 0 deltaY = 1 tid 5: D0 XX deltaX = 0 deltaY = 0 tid 6: D1 XX deltaX = 1 deltaY = 0 tid 7: E0 XX deltaX = 0 deltaY = 0

Parallel Scan • After parallel scan, the shared array holds: s[0] = A0 s[1] = A0+A1 s[2] = B0 s[3] = B0+B1 s[4] = C0 s[5] = D0 s[6] = D0+D1 s[7] = E0 s[8] = E0+E1 s[9] = F0 s[10] = F0+F1 s[11] = G0 s[12] = G0+G1 s[13] = XX s[14] = XX s[15] = XX • The completed dot products are at indices 1, 3, 4, 6, 8, 10, and 12 • Bake these indices into the unused high bits of column indices • Each thread writes up to 1 value to global memory

The final pass • Blocks are constant size • The last row in each block may spill over into the next block, causing two or more partial dot products to be written to global memory • These are summed by a simple final pass

Performance Analysis • All global memory loads are coalesced • Most of the sum is computed with efficient sequential segmented scan as opposed to inefficient parallel scan • There is no branching in the kernel when computing the sum • Segmented scan flags and offsets are baked into col indices to accelerate inner loops • Low register and shared mem usage delivers high SM occupancy

Performance Analysis • Exceptionally high memory bandwidth of GPUs make them the obvious choice for iterative algorithms like CG • Switched fabric in clusters results in high latency, slowing process and reducing effectiveness of parallelism • With GTX 580, expect 12.5 billion double-precision dot product components • A 50 million element DP matrix can be multiplied 250 times per second with a single card

Optimization Wrap-up • GPUs are not coarse-grained parallel systems • Prioritize for coalesced r/w • Favor sequential operations • Use intra-warp reduction • Maintain high SM occupancy to avoid pipeline stalls by reducing register usage and warp-divergent branches • Keep optimizing until 1/16th of instructions are global memory ops

Why GPU?

Why GPU?

Presentation Transcript