190 likes | 323 Views
Using CUDA for High Performance Scientific Computing. Dana Schaa NUCAR Research Group Northeastern University. Outline. What is CUDA? Concepts and Terminology Program Design with CUDA Example Programs Ideal Characteristics for Graphics Processing. CUDA and nVidia.
E N D
Using CUDA for High Performance Scientific Computing • Dana Schaa • NUCAR Research Group • Northeastern University
Outline • What is CUDA? • Concepts and Terminology • Program Design with CUDA • Example Programs • Ideal Characteristics for Graphics Processing
CUDA and nVidia • CUDA = Compute Unified Device Architecture • CUDA is a programming interface • Portions of the code are targeted to run on an nVidia GPU • Provides an extension to C • Only functions from the C standard library are supported • nvcc - nVidia CUDA compiler
GeForce 8800 Architecture GeForce 8800 multi-processor mp mp Device Memory (768MB) Local Global Etc shared 16KB mp mp uP uP mp mp uP uP mp mp uP uP mp mp mp mp uP uP mp mp mp mp
CUDA Terminology • A kernel is a portion of a program that is executed many times (independently) and on different data • The host is the CPU that is executing the code (i.e. the CPU of the system that the nVidia board is plugged in to) • The device is the nVidia board itself
CUDA Terminology (2) • A thread is an instance of a computational kernel • Threads are arranged into SIMD groups, called warps • The warp size for the 8800 Series is 32 • A block is a group of threads • One block at a time is assigned to a multiprocessor • A grid is a group of blocks • All of the threads in a grid perform the same task • Blocks within a grid can not run different operations
CUDA Terminology (3) GRID BLOCK BLOCK BLOCK BLOCK W W W W W W W W W W W W BLOCK BLOCK BLOCK BLOCK W W W W W W W W W W W W BLOCK BLOCK BLOCK BLOCK W W W W W W W W W W W W BLOCK BLOCK BLOCK BLOCK W W W W W W W W W W W W
Program Design - Threads • More threads per block are better for time slicing • Minimum: 64, Ideal: 192-256 • More threads per block means fewer registers per thread • Kernel invocation may fail if the kernel compiles to more registers than are available • Threads within a block can be synchronized • Important for SIMD efficiency • The maximum threads allowed per grid is 64K^3
Program Design - Blocks • There should be at least as many blocks as multiprocessors • The number of blocks should be at least 100 to scale to future generations • Blocks within a grid can not be synchronized • Blocks can only be swapped by partitioning registers and shared memory among them
Program Design: On-Chip Memory • Registers • 32 registers per processor • Shared Memory - per block • 16KB per multiprocessor • Data should be in 32-bit increments to take advantage of concurrent accesses • Access is as fast as register access if no banks conflict
a - no conflict • b - no conflict • c - conflict, must be serialized a b c
Program Design: Off-Chip Memory • Local (per thread) and Global (per grid) Memories • 200-300 cycle latency • Global memory is accessible from the host • Local memory is used for data and variables that cant fit in registers • 768MB (includes constant and texture memories) • 64-128 bit accesses • Are the different memories variable in size?
Program Design: Off-Chip Memory (2) • Host-Device • Transfers are much slower than intra-device transfers • Small transfers should be batched • Intermediate data should be kept on device
Program Design: Memory GeForce 8800 multi-processor mp mp Device Memory (768MB) Local Global Etc shared 16KB mp mp uP uP mp mp uP uP mp mp uP uP mp mp mp mp uP uP mp mp mp mp
Program Design: Memory (2) • Typical memory access pattern • Copy data from global memory to shared memory • Synchronize threads • Manipulate the data in shared memory • Synchronize threads • Copy data back to global memory from shared memory
Program Design: Control Flow • Since the hardware is SIMD, control flow instructions can cause thread execution paths to diverge • Divergent execution paths must be serialized (costly) • if, switch, and while statements should be avoided if threads from the same warp will take different paths • The compiler may remove if statements if favor of predicated instructions to prevent divergence
Ideal CUDA Programs • High intrinsic parallelism • per-pixel or per-element operations • fft, matrix multiply • most image processing applications • Minimal communication (if any) between threads • limited synchronization • Few control flow statements • High ratio of arithmetic to memory operations