Introduction To GPUs

Introduction To GPUs • Bálint Joó • Jefferson Lab • Parallelism In Experimental Nuclear Physics Workshop • CNU, Jan 6, 2011

Introduction • General Purpose Graphics Processing Units (GPUs) offer a disruptive price/performance improvement in throughput oriented computing • Primary GPU Manufacturers: • Advanced Micro Devices • Radeon & FireStream • NVIDIA • Gaming Cards: GeForce GTX series • HPC Cards: Tesla Series • Programming GPUs • AMD Stream SDK (now OpenCL, used to be Brook/Brook++) • NVIDIA CUDA • OpenCL ( supports AMD, NVIDIA, CPUs etc) • This talk: focus on NVIDIA GPUs (mostly Fermi) + CUDA

CPU/GPU Configurations Host Memory: 48 GB common Device Memory: up to 6 GB B/W to device: O(100) GB/s (e.g.: Tesla C2050: 144 GB/s) B/W to cache: O(10) GB/s (e.g.: 3 channel DDR3-1333: 32GB/s) PCIe Gen2 x16: B/W: 8 GB/s/dir. 16 GB/s bi-dir. Host CPU: 4-6 cores/socket typical, Peak SP Flops/socket: O(100) (e.g. Intel Nehalem 4-core @ 3GHz: 96 SP Gflops) GPU device : O(100) cores typical, Peak SP FLOPs / device: O(1000) (e.g. Tesla C2050: 1.03 SP Tflops)

Anatomy of a Fermi GPU • NVIDIA GPU consists of Streaming Multiprocessors (SMs) • SMs provide: • registers (32K 32-bit) • CUDA cores (32 per SM) – 1 SP mul-add per clock. • 64 KB Shared Memory (configured as memory/L1 cache) • Special Function units (for fast sin/cos/exp etc) • Hardware barrier within SM. • texture caches, thread dispatch logic etc.

Anatomy of a Fermi GPU • Example: NVIDIA Tesla C2050 • 14 SMs → 448 CUDA Cores • CUDA Cores @ 1.15 GHz → 515 mul-adds/s → 1030 Gflops • 3 GB GDDR5 on-device memory • 144 GB/sec memory bandwidth

Programming GPUs with CUDA • CUDA provides facilities for programming the accelerator • A thread execution model • A memory hierarchy • Extensions to C for writing 'kernels' • A run-time API for • querying device attributes (eg compute capability) • memory management (allocation, movement) • for launching kernels • for managing 'task parallelism' (CUDA streams) • CUDA Toolkit gives tools • compiler, debugger, profiler • CUDA driver (kernel level) for making all this happen

The CUDA Thread Model • user 'kernels' execute in a 'grid' of 'blocks' of 'threads' • block has ID in the grid • thread has ID in the block • blocks are 'independent' • no synchronization between blocks • threads within a block may cooperate • use shared memory • fast synchronization • in H/W blocks are mapped to SMs

CUDA Memories • Registers - automatic variables in kernels are mapped to registers • Fermi hardware places limit of 64 registers / thread. • Shared memory- shared by kernels within a thread block • shared memory is 'banked' (like CPU N-way caches) • Global device memory • accessed through 'device' pointers • Constant cache – fast read only memory for constants • Texture cache – fast read only memory for data with spatial locality • Host memory • host pointers cannot be directly accessed by kernels • must copy memory from host to a device memory • can be mapped to GPU (zero copy) – accessed through dev. ptr.

Include cuda.h to access cuda API (may also need cuda_runtime.h) Example: Kernel to add two vectors __global__ marks this as a kernel Generate a global thread ID These are device memory accesses #include <cuda.h> #include <cstdio> #include <iostream> #define N 20 // Kernel to add vectors 'x' and 'y' into 'z' // vectors are of length N elements __global__ void add( float *z, float *x, float *y ) { // Compute global thread ID from: // - local id within the block (threadIdx) // - id of block within grid (blockIdx) // threadIdx and blockIdx are predefined and can be up to 3d int thread_id = threadIdx.x + blockIdx.x * blockDim.x; if( thread_id < N ) { z[ thread_id ] = x[ thread_id ] + y[ thread_id ]; } }

Example: Host Code Copy back answer to host LAUNCH KERNEL!!! Set up grid (1-d) of blocks Set up input on host Copy host data to device arrays (via PCIe bus) Allocate arrays in device global memory int main(int argc, char *argv[]) { float host_x[N], host_y[N], host_z[N]; float* device_x; float* device_y; float* device_z; for(int i=0; i < N; i++) { host_x[i]=(float)i; host_y[i]=(float)(2*i); } cudaMalloc( &device_x, N*sizeof(float) ); cudaMalloc( &device_y, N*sizeof(float) ); cudaMalloc( &device_z, N*sizeof(float) ); cudaMemcpy(device_x, host_x, N*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(device_y, host_y, N*sizeof(float), cudaMemcpyHostToDevice); dim3n_blocks; dim3threads_per_block; n_blocks.x = 1; threads_per_block.x = N; add<<< threads_per_block, n_blocks >>>( device_z, device_x, device_y ); cudaMemcpy( host_z, device_z, N*sizeof(float), cudaMemcpyDeviceToHost ); cudaFree( device_x );cudaFree( device_y );cudaFree( device_z ); }

Warps & Divergence All threads in warp exit if-else together 2nd 16 threads go other way first 16 wait 16 threads go one way other 16 wait • Threads mapped to hardware in groups of 32 threads at a time • these groups are called 'warps' • Threads within a warp proceed in lock-step • if threads within a warp take different execution paths one gets 'thread-divergence’ • Divergence reduces performance as divergent branches are serialized • eg: __global__ void add( float *z, float *x, float *y ) { int thread_id = threadIdx.x + blockIdx.x * blockDim.x; if( thread_id % 32 < 15 ) { z[ thread_id ] = x[ thread_id ] + y[ thread_id ]; } else { z[ thread_id ] = x[ thread_id ] - y[ thread_id ]; } }

Read/Write Coalescing Pre Fermi • Memory transactions are issued for a half-warp (16 threads) at the same time • Under the right circumstances, the reads for the 16 threads may be combined into “bursts”: called “read coalescing” • For compute capability 1.2 & 1.3 coalescing rules are simple: • the words accessed by threads in ½ warp must lie in the same segment of size equal to: • 32 bytes if all threads access 8-bit words • 64 bytes if all threads access 16-bit words • 128 bytes if all threads access 32-bit or 64-bit words • For compute capability < 1.2 rules are much more restrictiv • required alignment, sequential access etc… • Fermi coalescing is different yet again • Memory accesses are cached, cache line length is 128 bytes • Single memory request for a single warp (128 bytes aligned and all addresses in the warp are within the 128 byte line)

thread 2 thread 2 thread 3 thread 0 thread 6 thread 7 thread 4 thread 9 thread 10 thread 11 thread 8 thread 13 thread 14 thread 15 thread 12 160 232 192 216 208 200 248 192 176 168 128 152 144 136 184 thread 1 240 thread 5 216 256 thread 1 224 thread 3 thread 0 thread 5 thread 6 thread 7 thread 4 224 thread 10 thread 11 thread 8 thread 13 thread 14 thread 9 thread 12 256 248 200 160 184 176 208 128 152 168 144 136 232 240 128 byte alignment boundary 128 byte alignment boundary Coalescing 'double'-s (c.c. 1.3) Compute Capability >= 1.2 breaks this into just 2 transactions 1 for each segment This would be coalesced for compute capability < 1.2 as well... For compute capability < 1.2 misalignment would have caused 16 separate transactions thread 15 128 byte segment 128 byte segment 128 byte alignment boundary 128 byte alignment boundary misaligned coalesced

__shared__ float data[17][2]; 4 8 16 0 24 132 124 128 bank 1 Using Shared Memory bank 3 bank 4 bank 0 bank 2 bank 31 bank 1 bank 0 ... [15][1] [16][0] [16][1] [0][0] [1][0] [1][1] [0][1] [2][0] • CUDA devices contain on-chip fast access shared memory • Fermi: shared mem can be configured as addressable/cache • In CUDA one can declare memory as __shared__ • Shared memory is banked • compute capability 2.x: 32 banks • compute capability 1.x: 16 banks • Successive 32 bit words assigned to successive banks

4 8 16 0 24 124 128 132 Bank Conflicts • As long as all requests come from separate banks, there are no conflicts and requests can be satisfied simultaneously • If multiple requests hit same bank: bank conflicts • requests serviced in serial • Similar to n-way cache bank conflicts • Broadcast special case: several threads hit same word (no conflict) tid=0 tid=1 tid=2 tid=3 tid=4 no conflict bank 1 bank 3 bank 4 bank 0 bank 2 bank 31 bank 0 bank 1 ... [15][1] [16][0] [16][1] [0][0] [1][0] [1][1] [0][1] [2][0] conflict: tid=0,4 hit same bank tid=0 tid=1 tid=2 tid=3 tid=4 Broadcast: tid 1,2 acess the same word

CUDA Streams • CUDA provides a form of task parallelism: streams • Streams are command queues: enque task & wait to finish • Classic use: overlap computation, with host-device memcpy stream 0 (default) Host code: stream1 stream2 cudaStreamCreate(&stream1) cudaStreamCreate(&stream2) kern1<<<size,Nb>> kern2<<<size,Nb,Smem,stream1>>> cudaMemcpyAsync(..., stream2); cudaStreamSynchronize(0) cudaStreamSynchronize(stream1) cudaStreamSynchronize(stream2) cudaStreamDestroy(stream1) cudaStreamDestroy(stream2)

What else is there to help me? • Thrust: The STL of CUDA • uses C++ type system & template tricks to hide a lot of the finicky memcpy stuff etc. • http://code.google.com/p/thrust/ • Lots of Tools, libraries for BLAS, LAPACK, FFTs etc • http://developer.nvidia.com/object/gpucomputing.html • Prefer Python to C++? • Check out PyCUDA • Your favorite piece of software may already have been ported to CUDA (talks later today…)

A word about OpenCL • CUDA is NVIDIA proprietary • Other multi-manycore devices exist: • AMD GPUs, multi-core CPUs, Intel MIC (coming soon?) • OpenCL is a vendor neutral standard for programming heterogeneous devices • Similar concepts to CUDA (groups of work items=blocks of threads) • Code is a lot more 'noisy' than CUDA code • Lot of boilerplate code to manage devices, work queues etc. • JIT compilation: lot of code to set up kernel invocations • Productivity features are coming: e.g.: http://code.google.com/p/clutil/ • Code still needs to be tuned to hardware • Compiler support is now maturing (NVIDIA, AMD, Apple, Intel,...)

Conclusions • GPUs offer phenomenal power for throughput computations • Careful tuning is required to achieve good performance • host / device memory hierarchies • compute resources (e.g. registers, shared memory) • host / device PCI hierarchy, page locked memory • Latencies are typically high: hidden by allocating many threads • Success Stories: • Lattice QCD: QUDA Library (Clark et. al.) • V0.2: ~250 Gflops/GPU for multi-GPU (GTX 480s) • V0.3: ~310 Gflops/GPU for single GPU (GTX 480) • Signal Processing in Astronomy (Clark et. al.) ~ 1 Tflop/GPU • Other Applications (Keeneland Project, Vetter)

Further Learning Resources • CUDA Zone: http://www.nvidia.com/cuda • Books: • Sanders, J. and Kandrot E. - “CUDA by Example, An Introduction to General-Purpose GPU Programming” • Kirk, D. B and Hwu, W-m. W. - “ Programming Massively Parallel Processors: A hands on approach” • Stanford University CUDA Course • Follow this link • Lecture videos on iTunesU: search for 'CUDA' • UIUC CUDA Course (by Wen-mei W. Hwu) • http://courses.engr.illinois.edu/ece498/al/index.html

Introduction To GPUs

Introduction To GPUs

Presentation Transcript

Introduction to Graphics Hardware and GPUs

Brook for GPUs

Introduction to Parallel Programming with Single and Multiple GPUs

Why GPUs?

Compiling Metaprogrammed Shaders to Stream GPUs

100M CUDA GPUs

List Ranking on GPUs

Synchronization 3 and GPUs

GPU Tutorial: How To Program for GPUs

Physical Simulation on GPUs

Questions about GPUs

Introduction to Programmable GPUs CPSC 314

GPUs and Accelerators

100M CUDA GPUs

Mapping Irregular Computations to GPUs

Brook for GPUs

Data-Intensive Computing: From Clouds to GPUs

Questions about GPUs

Lecture: Storage, GPUs