200 likes | 425 Views
Introduction To GPUs. Bálint Joó Jefferson Lab Parallelism In Experimental Nuclear Physics Workshop CNU, Jan 6, 2011. Introduction. General Purpose Graphics Processing Units (GPUs) offer a disruptive price/performance improvement in throughput oriented computing Primary GPU Manufacturers:
E N D
Introduction To GPUs • Bálint Joó • Jefferson Lab • Parallelism In Experimental Nuclear Physics Workshop • CNU, Jan 6, 2011
Introduction • General Purpose Graphics Processing Units (GPUs) offer a disruptive price/performance improvement in throughput oriented computing • Primary GPU Manufacturers: • Advanced Micro Devices • Radeon & FireStream • NVIDIA • Gaming Cards: GeForce GTX series • HPC Cards: Tesla Series • Programming GPUs • AMD Stream SDK (now OpenCL, used to be Brook/Brook++) • NVIDIA CUDA • OpenCL ( supports AMD, NVIDIA, CPUs etc) • This talk: focus on NVIDIA GPUs (mostly Fermi) + CUDA
CPU/GPU Configurations Host Memory: 48 GB common Device Memory: up to 6 GB B/W to device: O(100) GB/s (e.g.: Tesla C2050: 144 GB/s) B/W to cache: O(10) GB/s (e.g.: 3 channel DDR3-1333: 32GB/s) PCIe Gen2 x16: B/W: 8 GB/s/dir. 16 GB/s bi-dir. Host CPU: 4-6 cores/socket typical, Peak SP Flops/socket: O(100) (e.g. Intel Nehalem 4-core @ 3GHz: 96 SP Gflops) GPU device : O(100) cores typical, Peak SP FLOPs / device: O(1000) (e.g. Tesla C2050: 1.03 SP Tflops)
Anatomy of a Fermi GPU • NVIDIA GPU consists of Streaming Multiprocessors (SMs) • SMs provide: • registers (32K 32-bit) • CUDA cores (32 per SM) – 1 SP mul-add per clock. • 64 KB Shared Memory (configured as memory/L1 cache) • Special Function units (for fast sin/cos/exp etc) • Hardware barrier within SM. • texture caches, thread dispatch logic etc.
Anatomy of a Fermi GPU • Example: NVIDIA Tesla C2050 • 14 SMs → 448 CUDA Cores • CUDA Cores @ 1.15 GHz → 515 mul-adds/s → 1030 Gflops • 3 GB GDDR5 on-device memory • 144 GB/sec memory bandwidth
Programming GPUs with CUDA • CUDA provides facilities for programming the accelerator • A thread execution model • A memory hierarchy • Extensions to C for writing 'kernels' • A run-time API for • querying device attributes (eg compute capability) • memory management (allocation, movement) • for launching kernels • for managing 'task parallelism' (CUDA streams) • CUDA Toolkit gives tools • compiler, debugger, profiler • CUDA driver (kernel level) for making all this happen
The CUDA Thread Model • user 'kernels' execute in a 'grid' of 'blocks' of 'threads' • block has ID in the grid • thread has ID in the block • blocks are 'independent' • no synchronization between blocks • threads within a block may cooperate • use shared memory • fast synchronization • in H/W blocks are mapped to SMs
CUDA Memories • Registers - automatic variables in kernels are mapped to registers • Fermi hardware places limit of 64 registers / thread. • Shared memory- shared by kernels within a thread block • shared memory is 'banked' (like CPU N-way caches) • Global device memory • accessed through 'device' pointers • Constant cache – fast read only memory for constants • Texture cache – fast read only memory for data with spatial locality • Host memory • host pointers cannot be directly accessed by kernels • must copy memory from host to a device memory • can be mapped to GPU (zero copy) – accessed through dev. ptr.
Include cuda.h to access cuda API (may also need cuda_runtime.h) Example: Kernel to add two vectors __global__ marks this as a kernel Generate a global thread ID These are device memory accesses #include <cuda.h> #include <cstdio> #include <iostream> #define N 20 // Kernel to add vectors 'x' and 'y' into 'z' // vectors are of length N elements __global__ void add( float *z, float *x, float *y ) { // Compute global thread ID from: // - local id within the block (threadIdx) // - id of block within grid (blockIdx) // threadIdx and blockIdx are predefined and can be up to 3d int thread_id = threadIdx.x + blockIdx.x * blockDim.x; if( thread_id < N ) { z[ thread_id ] = x[ thread_id ] + y[ thread_id ]; } }
Example: Host Code Copy back answer to host LAUNCH KERNEL!!! Set up grid (1-d) of blocks Set up input on host Copy host data to device arrays (via PCIe bus) Allocate arrays in device global memory int main(int argc, char *argv[]) { float host_x[N], host_y[N], host_z[N]; float* device_x; float* device_y; float* device_z; for(int i=0; i < N; i++) { host_x[i]=(float)i; host_y[i]=(float)(2*i); } cudaMalloc( &device_x, N*sizeof(float) ); cudaMalloc( &device_y, N*sizeof(float) ); cudaMalloc( &device_z, N*sizeof(float) ); cudaMemcpy(device_x, host_x, N*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(device_y, host_y, N*sizeof(float), cudaMemcpyHostToDevice); dim3n_blocks; dim3threads_per_block; n_blocks.x = 1; threads_per_block.x = N; add<<< threads_per_block, n_blocks >>>( device_z, device_x, device_y ); cudaMemcpy( host_z, device_z, N*sizeof(float), cudaMemcpyDeviceToHost ); cudaFree( device_x );cudaFree( device_y );cudaFree( device_z ); }
Warps & Divergence All threads in warp exit if-else together 2nd 16 threads go other way first 16 wait 16 threads go one way other 16 wait • Threads mapped to hardware in groups of 32 threads at a time • these groups are called 'warps' • Threads within a warp proceed in lock-step • if threads within a warp take different execution paths one gets 'thread-divergence’ • Divergence reduces performance as divergent branches are serialized • eg: __global__ void add( float *z, float *x, float *y ) { int thread_id = threadIdx.x + blockIdx.x * blockDim.x; if( thread_id % 32 < 15 ) { z[ thread_id ] = x[ thread_id ] + y[ thread_id ]; } else { z[ thread_id ] = x[ thread_id ] - y[ thread_id ]; } }
Read/Write Coalescing Pre Fermi • Memory transactions are issued for a half-warp (16 threads) at the same time • Under the right circumstances, the reads for the 16 threads may be combined into “bursts”: called “read coalescing” • For compute capability 1.2 & 1.3 coalescing rules are simple: • the words accessed by threads in ½ warp must lie in the same segment of size equal to: • 32 bytes if all threads access 8-bit words • 64 bytes if all threads access 16-bit words • 128 bytes if all threads access 32-bit or 64-bit words • For compute capability < 1.2 rules are much more restrictiv • required alignment, sequential access etc… • Fermi coalescing is different yet again • Memory accesses are cached, cache line length is 128 bytes • Single memory request for a single warp (128 bytes aligned and all addresses in the warp are within the 128 byte line)
thread 2 thread 2 thread 3 thread 0 thread 6 thread 7 thread 4 thread 9 thread 10 thread 11 thread 8 thread 13 thread 14 thread 15 thread 12 160 232 192 216 208 200 248 192 176 168 128 152 144 136 184 thread 1 240 thread 5 216 256 thread 1 224 thread 3 thread 0 thread 5 thread 6 thread 7 thread 4 224 thread 10 thread 11 thread 8 thread 13 thread 14 thread 9 thread 12 256 248 200 160 184 176 208 128 152 168 144 136 232 240 128 byte alignment boundary 128 byte alignment boundary Coalescing 'double'-s (c.c. 1.3) Compute Capability >= 1.2 breaks this into just 2 transactions 1 for each segment This would be coalesced for compute capability < 1.2 as well... For compute capability < 1.2 misalignment would have caused 16 separate transactions thread 15 128 byte segment 128 byte segment 128 byte alignment boundary 128 byte alignment boundary misaligned coalesced
__shared__ float data[17][2]; 4 8 16 0 24 132 124 128 bank 1 Using Shared Memory bank 3 bank 4 bank 0 bank 2 bank 31 bank 1 bank 0 ... [15][1] [16][0] [16][1] [0][0] [1][0] [1][1] [0][1] [2][0] • CUDA devices contain on-chip fast access shared memory • Fermi: shared mem can be configured as addressable/cache • In CUDA one can declare memory as __shared__ • Shared memory is banked • compute capability 2.x: 32 banks • compute capability 1.x: 16 banks • Successive 32 bit words assigned to successive banks
4 8 16 0 24 124 128 132 Bank Conflicts • As long as all requests come from separate banks, there are no conflicts and requests can be satisfied simultaneously • If multiple requests hit same bank: bank conflicts • requests serviced in serial • Similar to n-way cache bank conflicts • Broadcast special case: several threads hit same word (no conflict) tid=0 tid=1 tid=2 tid=3 tid=4 no conflict bank 1 bank 3 bank 4 bank 0 bank 2 bank 31 bank 0 bank 1 ... [15][1] [16][0] [16][1] [0][0] [1][0] [1][1] [0][1] [2][0] conflict: tid=0,4 hit same bank tid=0 tid=1 tid=2 tid=3 tid=4 Broadcast: tid 1,2 acess the same word
CUDA Streams • CUDA provides a form of task parallelism: streams • Streams are command queues: enque task & wait to finish • Classic use: overlap computation, with host-device memcpy stream 0 (default) Host code: stream1 stream2 cudaStreamCreate(&stream1) cudaStreamCreate(&stream2) kern1<<<size,Nb>> kern2<<<size,Nb,Smem,stream1>>> cudaMemcpyAsync(..., stream2); cudaStreamSynchronize(0) cudaStreamSynchronize(stream1) cudaStreamSynchronize(stream2) cudaStreamDestroy(stream1) cudaStreamDestroy(stream2)
What else is there to help me? • Thrust: The STL of CUDA • uses C++ type system & template tricks to hide a lot of the finicky memcpy stuff etc. • http://code.google.com/p/thrust/ • Lots of Tools, libraries for BLAS, LAPACK, FFTs etc • http://developer.nvidia.com/object/gpucomputing.html • Prefer Python to C++? • Check out PyCUDA • Your favorite piece of software may already have been ported to CUDA (talks later today…)
A word about OpenCL • CUDA is NVIDIA proprietary • Other multi-manycore devices exist: • AMD GPUs, multi-core CPUs, Intel MIC (coming soon?) • OpenCL is a vendor neutral standard for programming heterogeneous devices • Similar concepts to CUDA (groups of work items=blocks of threads) • Code is a lot more 'noisy' than CUDA code • Lot of boilerplate code to manage devices, work queues etc. • JIT compilation: lot of code to set up kernel invocations • Productivity features are coming: e.g.: http://code.google.com/p/clutil/ • Code still needs to be tuned to hardware • Compiler support is now maturing (NVIDIA, AMD, Apple, Intel,...)
Conclusions • GPUs offer phenomenal power for throughput computations • Careful tuning is required to achieve good performance • host / device memory hierarchies • compute resources (e.g. registers, shared memory) • host / device PCI hierarchy, page locked memory • Latencies are typically high: hidden by allocating many threads • Success Stories: • Lattice QCD: QUDA Library (Clark et. al.) • V0.2: ~250 Gflops/GPU for multi-GPU (GTX 480s) • V0.3: ~310 Gflops/GPU for single GPU (GTX 480) • Signal Processing in Astronomy (Clark et. al.) ~ 1 Tflop/GPU • Other Applications (Keeneland Project, Vetter)
Further Learning Resources • CUDA Zone: http://www.nvidia.com/cuda • Books: • Sanders, J. and Kandrot E. - “CUDA by Example, An Introduction to General-Purpose GPU Programming” • Kirk, D. B and Hwu, W-m. W. - “ Programming Massively Parallel Processors: A hands on approach” • Stanford University CUDA Course • Follow this link • Lecture videos on iTunesU: search for 'CUDA' • UIUC CUDA Course (by Wen-mei W. Hwu) • http://courses.engr.illinois.edu/ece498/al/index.html