Intermediate GPGPU Programming in CUDA

Intermediate GPGPU Programming in CUDA Supada Laosooksathit

NVIDIA Hardware Architecture Host memory

Recall • 5 steps for CUDA Programming • Initialize device • Allocate device memory • Copy data to device memory • Execute kernel • Copy data back from device memory

Initialize Device Calls • To select the device associated to the host thread • cudaSetDevice(device) • This function must be called before any __global__ function, otherwise device 0 is automatically selected. • To get number of devices • cudaGetDeviceCount(&devicecount) • To retrieve device’s property • cudaGetDeviceProperties(&deviceProp, device)

Hello World Example • Allocate host and device memory

Hello World Example • Host code

Hello World Example • Kernel code

To Try CUDA Programming • SSH to 138.47.102.111 • Set environment vals in .bashrc in your home directory export PATH=$PATH:/usr/local/cuda/bin export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH • Copy the SDK from /home/students/NVIDIA_GPU_Computing_SDK • Compile the following directories • NVIDIA_GPU_Computing_SDK/shared/ • NVIDIA_GPU_Computing_SDK/C/common/ • The sample codes are in NVIDIA_GPU_Computing_SDK/C/src/

Demo • Hello World • Print out block and thread IDs • Vector Add • C = A + B

NVIDIA Hardware Architecture SM

Specifications of a Device • For more details • deviceQuery in CUDA SDK • Appendix F in Programming Guide 4.0

Demo • deviceQuery • Show hardware specifications in details

Memory Optimizations • Reduce the time of memory transfer between host and device • Use asynchronous memory transfer (CUDA streams) • Use zero copy • Reduce the number of transactions between on-chip and off-chip memory • Memory coalescing • Avoid bank conflicts in shared memory

Reduce Time of Host-Device Memory Transfer • Regular memory transfer (synchronously)

Reduce Time of Host-Device Memory Transfer • CUDA streams • Allow overlapping between kernel and memory copy

CUDA Streams Example

GPU Timers • CUDA Events • An API • Use the clock shade in kernel • Accurate for timing kernel executions • CUDA timer calls • Libraries implemented in CUDA SDK

CUDA Events Example

Demo • simpleStreams

Reduce Time of Host-Device Memory Transfer • Zero copy • Allow device pointers to access page-lockedhost memory directly • Page-locked host memory is allocated by cudaHostAlloc()

Demo • Zero copy

Reduce number of On-chip and Off-chip Memory Transactions • Threads in a warp access global memory • Memory coalescing • Copy a bunch of words at the same time

Memory Coalescing • Threads in a warp access global memory in a straight forward way (4-byte word per thread)

Memory Coalescing • Memory addresses are aligned in the same segment but the accesses are not sequential

Memory Coalescing • Memory addresses are not aligned in the same segment

Shared Memory • 16 banks for compute capability 1.x, 32 banks for compute capability 2.x • Help utilizing memory coalescing • Bank conflicts may occur • Two or more threads in access the same bank • In compute capability 1.x, no broadcast • In compute capability 2.x, broadcast the same data to many threads that request

Bank Conflicts No bank conflict 2-way bank conflict Threads: Banks: Threads: Banks: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3

Matrix Multiplication Example

Matrix Multiplication Example • Reduce accesses to global memory • (A.height/BLOCK_SIZE) times reading A • (B.width/BLOCK_SIZE) times reading B

Demo • Matrix Multiplication • With and without shared memory • Different block sizes

Control Flow • if, switch, do, for, while • Branch divergence in a warp • Threads in a warp issue different instruction sets • Different execution paths will be serialized • Increase number of instructions in that warp

Branch Divergence

Summary • 5 steps for CUDA Programming • NVIDIA Hardware Architecture • Memory hierarchy: global memory, shared memory, register file • Specifications of a device: block, warp, thread, SM

Summary • Memory optimization • Reduce overhead due to host-device memory transfer with CUDA streams, Zero copy • Reduce the number of transactions betweenon-chip and off-chip memory by utilizing memory coalescing (shared memory) • Try to avoid bank conflicts in shared memory • Control flow • Try to avoid branch divergence in a warp

References • http://docs.nvidia.com/cuda/cuda-c-programming-guide/ • http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/ • http://www.developer.nvidia.com/cuda-toolkit

Intermediate GPGPU Programming in CUDA

Intermediate GPGPU Programming in CUDA

Presentation Transcript

GPGPU: CUDA vs OpenGL

GPGPU Programming

CUDA Programming,

GPU programming: CUDA

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

CUDA GPU Programming

Basic CUDA Programming

CUDA Lecture 4 CUDA Programming Basics

CUDA programming (continue)

CUDA programming Performance considerations (CUDA best practices)

CUDA Programming

GPGPU and CUDA

CUDA Programming

Programming With CUDA

GPGPU Programming

GPGPU Programming

CUDA Programming Model