Using CUDA for High Performance Scientific Computing

Using CUDA for High Performance Scientific Computing • Dana Schaa • NUCAR Research Group • Northeastern University

Outline • What is CUDA? • Concepts and Terminology • Program Design with CUDA • Example Programs • Ideal Characteristics for Graphics Processing

CUDA and nVidia • CUDA = Compute Unified Device Architecture • CUDA is a programming interface • Portions of the code are targeted to run on an nVidia GPU • Provides an extension to C • Only functions from the C standard library are supported • nvcc - nVidia CUDA compiler

GeForce 8800 Architecture GeForce 8800 multi-processor mp mp Device Memory (768MB) Local Global Etc shared 16KB mp mp uP uP mp mp uP uP mp mp uP uP mp mp mp mp uP uP mp mp mp mp

CUDA Terminology • A kernel is a portion of a program that is executed many times (independently) and on different data • The host is the CPU that is executing the code (i.e. the CPU of the system that the nVidia board is plugged in to) • The device is the nVidia board itself

CUDA Terminology (2) • A thread is an instance of a computational kernel • Threads are arranged into SIMD groups, called warps • The warp size for the 8800 Series is 32 • A block is a group of threads • One block at a time is assigned to a multiprocessor • A grid is a group of blocks • All of the threads in a grid perform the same task • Blocks within a grid can not run different operations

CUDA Terminology (3) GRID BLOCK BLOCK BLOCK BLOCK W W W W W W W W W W W W BLOCK BLOCK BLOCK BLOCK W W W W W W W W W W W W BLOCK BLOCK BLOCK BLOCK W W W W W W W W W W W W BLOCK BLOCK BLOCK BLOCK W W W W W W W W W W W W

Program Design - Threads • More threads per block are better for time slicing • Minimum: 64, Ideal: 192-256 • More threads per block means fewer registers per thread • Kernel invocation may fail if the kernel compiles to more registers than are available • Threads within a block can be synchronized • Important for SIMD efficiency • The maximum threads allowed per grid is 64K^3

Program Design - Blocks • There should be at least as many blocks as multiprocessors • The number of blocks should be at least 100 to scale to future generations • Blocks within a grid can not be synchronized • Blocks can only be swapped by partitioning registers and shared memory among them

Program Design: On-Chip Memory • Registers • 32 registers per processor • Shared Memory - per block • 16KB per multiprocessor • Data should be in 32-bit increments to take advantage of concurrent accesses • Access is as fast as register access if no banks conflict

a - no conflict • b - no conflict • c - conflict, must be serialized a b c

Program Design: Off-Chip Memory • Local (per thread) and Global (per grid) Memories • 200-300 cycle latency • Global memory is accessible from the host • Local memory is used for data and variables that cant fit in registers • 768MB (includes constant and texture memories) • 64-128 bit accesses • Are the different memories variable in size?

Program Design: Off-Chip Memory (2) • Host-Device • Transfers are much slower than intra-device transfers • Small transfers should be batched • Intermediate data should be kept on device

Program Design: Memory GeForce 8800 multi-processor mp mp Device Memory (768MB) Local Global Etc shared 16KB mp mp uP uP mp mp uP uP mp mp uP uP mp mp mp mp uP uP mp mp mp mp

Program Design: Memory (2) • Typical memory access pattern • Copy data from global memory to shared memory • Synchronize threads • Manipulate the data in shared memory • Synchronize threads • Copy data back to global memory from shared memory

Program Design: Control Flow • Since the hardware is SIMD, control flow instructions can cause thread execution paths to diverge • Divergent execution paths must be serialized (costly) • if, switch, and while statements should be avoided if threads from the same warp will take different paths • The compiler may remove if statements if favor of predicated instructions to prevent divergence

MatrixMul example

MatrixMul example (2)

Ideal CUDA Programs • High intrinsic parallelism • per-pixel or per-element operations • fft, matrix multiply • most image processing applications • Minimal communication (if any) between threads • limited synchronization • Few control flow statements • High ratio of arithmetic to memory operations

Using CUDA for High Performance Scientific Computing

Using CUDA for High Performance Scientific Computing

Presentation Transcript

HIGH PERFORMANCE COMPUTING

Java for High Performance Computing

Java for High Performance Computing

Java for High Performance Computing

High-Performance Computing

High-Performance Computing

High Performance Computing with CUDA™ Supercomputing 2011 Tutorial

High Performance Computing

HIGH PERFORMANCE COMPUTING

Using CUDA for High Performance Scientific Computing

High Performance Computing

High-Performance Computing

High Performance Computing

Tools for High Performance Scientific Computing

High Performance Computing

High Performance Computing

HIGH-PERFORMANCE COMPUTING

High Performance Computing

S03: High Performance Computing with CUDA Heterogeneous GPU Computing for Molecular Modeling

HIGH PERFORMANCE COMPUTING

High Performance Computing

Java for High Performance Computing