210 likes | 427 Views
Fast matrix multiplication with CUDA. Overview. Platform GEFORCE 8800GT, 512MB Core: G92, Shader frequency: 1.5 GHz, Mem frequency: 900 MHz Performance Tuned for 4k x 4k matrix, 192 GFlops Revisiting the Tiled version Using large tiles Base algorithm Optimized algorithm Tools and tips.
E N D
Overview • Platform • GEFORCE 8800GT, 512MB • Core: G92, Shader frequency: 1.5 GHz, Mem frequency: 900 MHz • Performance • Tuned for 4k x 4k matrix, 192 GFlops • Revisiting the Tiled version • Using large tiles • Base algorithm • Optimized algorithm • Tools and tips University of Central Florida
The Tiled version • Tile Size: 16 x 16 • 256 threads / block • 14 regs, 2076 smem / block • Occupancy: 2/3 T0 T1 … T14 T15 T16 T17 … T30 T31 … T239 … T255 University of Central Florida
The Tiled version – Memory access • Every half warp is accessing continuous memory locations. • Memory accesses are fully coalesced. T0 T1 … T14 T15 T16 T17 … T30 T31 … T239 … T255 University of Central Florida
The Tiled version – Bank conflicts • No bank conflicts. 16 banks broadcast T0 T1 … T14 T15 T16 T17 … T30 T31 … T239 … T255 University of Central Florida
The Tiled version - Bottlenecks • If fully use memory bandwidth and ALUs: • 14.4G float/s, 168G MAD/s • 11.67 MAD/float • With 16 x 16 tiles: • Total (W3/8) loads, 8 MAD/float • Too many loads! • Solution: large tile. • “Psub += As[ty][k] * Bs[k][tx]” • Extra instructions. • 77 GFlops (4k x 4k) mov.b32 $r12, s[$ofs4+0x0000] mov.b32 $r7, s[$ofs4+0x0040] mad.rn.f32 $r11, s[$ofs1+0x000c], $r11, $r13 add.b32 $ofs4, $ofs3, 0x0000019c mad.rn.f32 $r13, s[$ofs1+0x0010], $r12, $r11 mov.b32 $r12, s[$ofs4+0x0000] mov.b32 $r11, s[$ofs4+0x0040] mad.rn.f32 $r7, s[$ofs1+0x0014], $r7, $r13 add.b32 $ofs4, $ofs3, 0x0000021c mad.rn.f32 $r13, s[$ofs1+0x0018], $r12, $r7 Reused THeight times. TWidth Reused TWidth times. THeight University of Central Florida
Using Large Tiles • Each thread: • 17 loads / iteration • W/16 iterations • Total (W3/15) loads, 15 MAD/load 16 256 Stored in registers. 256 threads 16 256 16 16 Stored in shared memory. 16 Psubs/thread University of Central Florida
Using Large Tiles - Algorithm • For each sub tile in A & B • Read the sub tile from A to shared memory. 1 number / thread. • For each of the 16 numbers in B: • Read one number from B into a register. • Perform one MAD for each Psub. • To remove extra instructions for offset calculation, we want the sub tile A to be stored in column-major format in the shared memory. • But … B A C T0 T1 T2 … T255 University of Central Florida
Using Large Tiles - Algorithm • Solution1: • Transpose A to column-major format first. A Shared A C T0 B0 B15 T15 T0 T1 T2 … T255 • Solution2: • Read A in row-major format, write to the shared memory in column-major format. • Bank conflicts when write to the shared memory! A Shared A C B0 T0 T15 B15 T0 T1 T2 … T255 University of Central Florida
Using Large Tiles - Algorithm • Solution3: • Padding Shared A with one empty row. • No bank conflicts. Do not need to transpose A. • 164 GFlops (4k x 4k). Shared A A C B0 B1 B15 T0 T15 B1 B2 B0 B15 B0 B1 T0 T1 T2 … T255 University of Central Florida
Using Large Tiles - code for (int i = 0; i < MATRIX_WIDTH/16; i++) { ashare[tx][ty] = A[0]; __syncthreads(); #pragma unroll // 150 GFlops (4k x 4k) without unroll for (int k = 0; k < 16; k++) { b = B[k * MATRIX_WIDTH]; comp16(b, &ashare[k][0], c); } A += 16; B += 16 * MATRIX_WIDTH; __syncthreads(); }; University of Central Florida
Using Large Tiles - optimized do { ashare[tx][ty] = a; __syncthreads(); a = A[0]; bb[0] = b[0]; bb[1] = b[1]; bb[2] = b[2]; bb[3] = b[3]; b[0] = B[4 * MATRIX_WIDTH]; b[1] = B[5 * MATRIX_WIDTH]; b[2] = B[6 * MATRIX_WIDTH]; b[3] = B[7 * MATRIX_WIDTH]; for (int i = 0; i < 4; i ++) comp16(bb[i], &ashare[i][0], c); … bb[0] = b[0]; bb[1] = b[1]; bb[2] = b[2]; bb[3] = b[3]; b[0] = B[12 * MATRIX_WIDTH]; b[1] = B[13 * MATRIX_WIDTH]; b[2] = B[14 * MATRIX_WIDTH]; b[3] = B[15 * MATRIX_WIDTH]; for (int i = 0; i < 4; i ++) comp16(bb[i], &ashare[i + 8][0], c); bb[0] = b[0]; bb[1] = b[1]; bb[2] = b[2]; bb[3] = b[3]; A += 16; B += 16 * MATRIX_WIDTH; b[0] = B[0 * MATRIX_WIDTH]; b[1] = B[1 * MATRIX_WIDTH]; b[2] = B[2 * MATRIX_WIDTH]; b[3] = B[3 * MATRIX_WIDTH]; for (int i = 0; i < 4; i ++) comp16(bb[i], &ashare[i + 12][0], c); __syncthreads(); } while( A < Alast ); ... // last iteration University of Central Florida
Using Large Tiles - Performance Execution time is measured as the computation time on GPU. University of Central Florida
Using Large Tiles – Performance 2 Gflops (comp): excluding CPU GPU data transfer time. Gflops (total): including CPU GPU data transfer time. University of Central Florida
Tools - CUDA GPU Occupancy Calculator University of Central Florida
Tools - decuda • Developed by Wladimir J. van der Laan • a PhD candidate at the Institute of Mathematics and Computing Science of the University of Groningen. • http://www.cs.rug.nl/~wladimir/decuda/ University of Central Florida
Tools – CUDA Visual Profiler • http://forums.nvidia.com/index.php?showtopic=57443 • GPU TimeCPU TimeOccupancy • Profiler counters:gld_incoherent : Number of non-coalesced global memory loadsgld_coherent : Number of coalesced global memory loadsgst_incoherent : Number of non-coalesced global memory storesgst_coherent : Number of coalesced global memory storeslocal_load : Number of local memory loadslocal_store : Number of local memory storesbranch : Number of branch events (instruction and/or sync stack)divergent_branch : Number of divergent branches within a warp instructions : Number of dynamic instructions (in fetch)warp_serialize : Number of threads in a warp serialize basedon address (GRF or constant)cta_launched : Number of CTAs launched on the PM TPC University of Central Florida
Tips • Get usage of reg, smem, cmem, and lmem: • nvcc -m32 -o data/matrix_kernel.cubin -cubin matrix_kernel.cu --compiler-options -fno-strict-aliasing -I. -I/usr/local/cuda/include -I../../common/inc -DUNIX -O3 --ptxas-options=-v • Compile with –maxrregcount University of Central Florida
References • NVIDIA CUDA Samples: • http://www.nvidia.com/object/cuda_sample_linear_algebra.html • Simple CUBLAS • Matrix Multiplication • Matrix Transpose • NVIDIA Forum: • http://forums.nvidia.com/index.php?showtopic=47689&st=0 University of Central Florida