GPU Computing: Pervasive Massively Multithreaded Processors

GPU Computing: Pervasive Massively Multithreaded Processors Michael C ShebanowSr. Architecture Manager, GPUs

Agenda GPU Computing Tesla Products CUDA SM: Thread Multiprocessor Application Performance Model

GPU Computing • GPU = graphics processing unit • NVIDIA’s latest products = 88xx/98xx series • Accelerates Graphics • GPU Computing • Using GPUs for general purpose computing other than graphics

GPU Computing is Pervasive • Huge #s of deployed parallel computing engines • NVIDIA has shipped >70M CUDA-capable GPUs to date • NVIDIA currently shipping more than1M CUDA-capable GPUs per week • Wide range of products ranging from laptops to the servers • Coming soon: cell phones, PDAs, …

GPU Computing Key Concepts • Hardware (HW) thread management • HW thread launch and monitoring • HW thread switching • Tens of thousands of lightweight, concurrent threads • Real threads: PC, private registers, … • SIMT execution model • Multiple memory scopes • Per-thread private memory • Per-thread-block shared memory • Global memory • Using threads to hide memory latency • Coarse grain thread synchronization

warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95 warp 8 instruction 12 warp 3 instruction 96 SIMT Multithreaded Execution • SIMT: Single-Instruction Multi-Threadexecutes one instruction across many independent threads • Warp: a set of 32 parallel threadsthat execute a SIMT instruction • SIMT provides easy single-thread scalar programming with SIMD efficiency • Hardware implements zero-overhead warp and thread scheduling • SIMT threads can execute independently • SIMT warp diverges and converges when threads branch independently • Best efficiency and performance when threads of a warp execute together Single-Instruction Multi-Thread instruction scheduler time ...

Thread Kernel 0 Per-deviceGlobal Memory Per-threadLocal Memory Sequential Blocks Block Kernel 1 Per-blockShared Memory . . . . . . Multiple Memory Scopes • Per-thread private memory • Each thread has its own local memory • Stacks, other private data • Per-thread-block shared memory • Small memory close to the processor, low latency • Allocated per thread block • Main memory • GPU frame buffer • Can be accessed by any thread in any thread block

Hiding LOAD Latency • Principle: Little’s Law: • N = “number in flight” • l = arrival rate • L = memory latency • Arrival Rate product of: • Desired execution rate (IPC) • Density of LOAD instructions (%) • N = # of threads needed to cover latency L Time

Hiding LOAD Latency w/ Fewer Threads • Use batching • Group independent LOADs together • Modified law: • B = batch size // batch size 3 example float *d_A, *d_B, *d_C; float a, b, c, result; a = *d_A; b = *d_B; c = *d_C; result = a * b + c; • The values ‘a’, ‘b’, and ‘c’ are loaded independently before being used • Implication is that we can execute 3 loads from one thread before the first use (‘a’ in this case) causes a stall

Thread Synchronization • Barrier synchronization among threads of block • Fast single-instruction barrier in Tesla GPUs • void __syncthreads(); • Synchronizes all threads in a thread block • Once all threads have reached this point, kernel execution resumes normally • Use before reading shared memory written by another thread in the same block • Global synchronization between dependent kernels • Waits for all thread blocks of kernel grid to complete • Fast synchronization and kernel launch in Tesla GPUs 10

Tesla Series Products

The Tesla 8-Series ProcessorNVIDIA’s 1st Generation CUDA Processor • 681 million transistors • 518 Gigaflops • 128 cores (SPs), 12288 threads max • 384-bit 800 MHz GDDR3 • 76 GB/sec peak

The Tesla 10-Series ProcessorNVIDIA’s 2nd Generation CUDA Processor • 1.4 billion transistors • 1 Teraflop • 240 cores (SPs), 30720 threads max • 512-bit, 800MHz GDDR3 • 102 GB/sec peak

Tesla C1060 Computing Processor

Tesla S1070 1U System

Example Speedups 146X 36X 17X 100X 18X Interactive visualization of volumetric white matter connectivity Ionic placement for molecular dynamics simulation on GPU Simulation in Matlab using .mex file CUDA function Astrophysics N-body simulation Transcoding HD video stream to H.264 149X 47X 20X 24X 30X Financial simulation of LIBOR model with swaptions Ultrasound medical imaging for cancer diagnostics Highly optimized object oriented molecular dynamics GLAME@lab: An M-script API for linear Algebra operations on GPU Cmatch exact string matching to find similar proteins and gene sequences

Example: Fluid Simulation CUDA port of: JosStam, "Stable Fluids", In SIGGRAPH 99 Conference Proceedings, Annual Conference Series, August 1999, 121-128.

CUDA N-Body Simulation • 10B interactions / s • 16K bodies • 44 FPS • x 20 FLOPS / interaction • x 16K2 interactions / frame • = 240 GFLOP/s • = 50x tuned CPU • implementation • on Intel Core 2 Duo • GeForce 8800 GTX GPU • Highly Parallel High Arithmetic Intensity

N-Body Physics on CUDA • All-pairs gravitational N-body physics of 16,384 stars • 240 GFLOPS on NVIDIA GeForce 8800 – see GPU Gems 3

CUDA

CUDA Programming Model Virtual • C CUDA • Application • Minimal extension of C and C++ languages • Write a serial program that calls parallel kernels • Serial portions execute on the host CPU • A kernel executes as parallel threads on the GPU device • Kernels may be simple functions or full programs • Many threads execute each kernel NVCC • CPU Code • PTX Code Physical PTX to Target Compiler • G80 • … • GTX Target code

Thread Thread Block t0 t1 t2 … tm Block 0 Block 1 Block 2 Block n . . . CUDA Thread Model • Thread • Computes result elements • threadIdx is thread id number • Thread Block • Computes result data Block • 1 to 512 threads per Thread Block • blockIdx is block id number • Grid of Blocks • Computes many result blocks • 1 to many blocks per grid • Sequential Grids • Compute sequential problem steps Grid

CUDA: Basics • Declaration specifiers to indicate where things live __global__ void KernelFunc(...); // kernel callable from host __device__ void DeviceFunc(...); // function callable on device __device__intGlobalVar; // variable in device memory __shared__intSharedVar; // in per-block shared memory • Extend function invocation syntax for parallel kernel launch KernelFunc<<<500, 128>>>(...); // 500 blocks, 128 threads each • Special variables for thread identification in kernels dim3 threadIdx; dim3 blockIdx; dim3 blockDim; • Intrinsics that expose specific operations in kernel code __syncthreads();// barrier synchronization

CUDA: Extended Features • Standard mathematical functions sinf, powf, atanf, ceil, etc. • Built-in vector types float4, int4, uint4, etc. for dimensions 1..4 • Atomics atomicAdd(int *pmem; int value), etc. add, sub, min, max, and, or, xor ... • Texture accesses in kernels texture<float,2>my_texture; // declare texture reference float4 texel = texfetch(my_texture, u, v);

CUDA Memory Management // allocate host memory unsigned int numBytes = N * sizeof(float) float* h_A = (float*) malloc(numBytes); // allocate device memory float* d_A = 0; cudaMalloc((void**)&d_A, numbytes); // copy data from host to device cudaMemcpy(d_A, h_A, numBytes, cudaMemcpyHostToDevice); // copy data from device back to host cudaMemcpy(h_A, d_A, numBytes, cudaMemcpyDeviceToHost); // free device memory cudaFree(d_A); 25

Example: Vector Addition Kernel Device Code // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { inti = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; } int main() { // Run N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C); }

Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { inti = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; } int main() { // Run N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C); } Host Code

Example: Host code for vecAdd // allocate and initialize host (CPU) memory float *h_A = …, *h_B = …; // allocate device (GPU) memory float *d_A, *d_B, *d_C; cudaMalloc( (void**) &d_A, N * sizeof(float)); cudaMalloc( (void**) &d_B, N * sizeof(float)); cudaMalloc( (void**) &d_C, N * sizeof(float)); // copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice) ); cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice) ); // execute the kernel on N/256 blocks of 256 threads each vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);

Example #2: Adding matrices with 2D grids

The SM:Thread Multiprocesor

Tesla 10-Series Architecture • Scales parallel performance 2X beyond Tesla 8-series • 240 multithreaded thread processors at 1.5 GHz, 1 TFLOPS peak

T10 Multithreaded Multiprocessor • Scalar register-based ISA • Multithreaded Instruction Unit • 1024 threads, hardware multithreaded • 32 SIMT warps of 32 threads • Independent thread execution • Hardware thread scheduling • 8 SP Thread Processors • IEEE 754 32-bit floating point • 32-bit and 64-bit integer • 16K 32-bit registers • 2 SFU Special Function Units • RCP, RSQRT, EXP2, LOG2, SIN, COS • 2 SFUs per SM yields ¼ instruction throughput • Accuracy ranges from 22.5 to 24.0 bits • 1 DP Double Precision Unit • IEEE 754 64-bit floating point • Fused multiply-add • Full-speed denormalized operands and results • 16KB Shared Memory • Concurrent threads share data • Low latency load/store

SM Conceptual Block Diagram

Application Performance

Limiter Theory • SM a form of queuing system • Use “limiter theory” to predict SM performance • There are three types of limits on the performance of the SM: • Bandwidth resource limiters • Per-thread-block space limiters • Per-thread space limiters • The most constraining limiter is called the critical limiter • = min(all limiters)

Bandwidth Limiters • Thread blocks arrive at some rate λTB • Threads composed of some distribution of operations • Each arriving thread block of S threads contributes a distribution of operations to be performed • Per operation type, the offered load, or BW DEMAND, is the product of: • Thread block arrival rate λTB • # of threads S in a block • Operation density δ in each thread • BW CAPACITY is an upper bound on BW DEMAND // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { inti = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; }

Space Limiters • SM also has space resources. Examples: • Finite limit on warp count • Finite limit on register file space • Finite limit on shared memory size • Space resources: • Allocated on thread block launch • Deallocated on thread block completion • Consumption computed using Little’s Law (N = λL) • Thread Latency (L) • Complex computation • Varies with memory behavior

Limitations of Limiter Theory • Limiter theory assumes uniform workloads • Breaks down if “traffic jam” behavior • Limiter theory is an ok 1st order approximation // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { inti = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; }

Implications of Limiter Theory • Kernel code has to pay careful attention to Operation “mix” • Math-to-memory operation ratios for example • Do not want to bottleneck on one function unit leaving other units idling • Ideal: all units equally critical • Don’t “freeway jam” kernel code • Making thread blocks too large so that only a few execute on the SM at a time a bad idea • “Bunching” operations of a similar type in one section of a kernel will aggravate the problem • Ideal: lots of small thread blocks with uniform distribution of operation densities • Focus on space resource consumption • Ideal: use as few resources necessary to “load the SM”

Final Thoughts • Threads are free • A common mistake in GPU Computing kernels is to make threads do too much • Keep them short and sweet • Example: one thread per vector element • HW provides LOTs of them (10s of thousands) • HW launch => near zero overhead to create them • HW context switching => near zero overhead scheduling • Barriers are cheap • Single instruction • HW synchronization of thread blocks • Partition kernel code into producer-consumer • DON’T use spin locks! • Partition on results, not sources

Questions? MShebanow@nvidia.com http://www.nvidia.com/CUDA http://www.nvidia.com/TESLA

GPU Computing: Pervasive Massively Multithreaded Processors