380 likes | 526 Views
Intermediate GPGPU Programming in CUDA. Supada Laosooksathit. NVIDIA Hardware Architecture. Host memory. Recall. 5 steps for CUDA Programming Initialize device Allocate device memory Copy data to device memory Execute kernel Copy data back from device memory.
E N D
Intermediate GPGPU Programming in CUDA Supada Laosooksathit
NVIDIA Hardware Architecture Host memory
Recall • 5 steps for CUDA Programming • Initialize device • Allocate device memory • Copy data to device memory • Execute kernel • Copy data back from device memory
Initialize Device Calls • To select the device associated to the host thread • cudaSetDevice(device) • This function must be called before any __global__ function, otherwise device 0 is automatically selected. • To get number of devices • cudaGetDeviceCount(&devicecount) • To retrieve device’s property • cudaGetDeviceProperties(&deviceProp, device)
Hello World Example • Allocate host and device memory
Hello World Example • Host code
Hello World Example • Kernel code
To Try CUDA Programming • SSH to 138.47.102.111 • Set environment vals in .bashrc in your home directory export PATH=$PATH:/usr/local/cuda/bin export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH • Copy the SDK from /home/students/NVIDIA_GPU_Computing_SDK • Compile the following directories • NVIDIA_GPU_Computing_SDK/shared/ • NVIDIA_GPU_Computing_SDK/C/common/ • The sample codes are in NVIDIA_GPU_Computing_SDK/C/src/
Demo • Hello World • Print out block and thread IDs • Vector Add • C = A + B
Specifications of a Device • For more details • deviceQuery in CUDA SDK • Appendix F in Programming Guide 4.0
Demo • deviceQuery • Show hardware specifications in details
Memory Optimizations • Reduce the time of memory transfer between host and device • Use asynchronous memory transfer (CUDA streams) • Use zero copy • Reduce the number of transactions between on-chip and off-chip memory • Memory coalescing • Avoid bank conflicts in shared memory
Reduce Time of Host-Device Memory Transfer • Regular memory transfer (synchronously)
Reduce Time of Host-Device Memory Transfer • CUDA streams • Allow overlapping between kernel and memory copy
GPU Timers • CUDA Events • An API • Use the clock shade in kernel • Accurate for timing kernel executions • CUDA timer calls • Libraries implemented in CUDA SDK
Demo • simpleStreams
Reduce Time of Host-Device Memory Transfer • Zero copy • Allow device pointers to access page-lockedhost memory directly • Page-locked host memory is allocated by cudaHostAlloc()
Demo • Zero copy
Reduce number of On-chip and Off-chip Memory Transactions • Threads in a warp access global memory • Memory coalescing • Copy a bunch of words at the same time
Memory Coalescing • Threads in a warp access global memory in a straight forward way (4-byte word per thread)
Memory Coalescing • Memory addresses are aligned in the same segment but the accesses are not sequential
Memory Coalescing • Memory addresses are not aligned in the same segment
Shared Memory • 16 banks for compute capability 1.x, 32 banks for compute capability 2.x • Help utilizing memory coalescing • Bank conflicts may occur • Two or more threads in access the same bank • In compute capability 1.x, no broadcast • In compute capability 2.x, broadcast the same data to many threads that request
Bank Conflicts No bank conflict 2-way bank conflict Threads: Banks: Threads: Banks: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3
Matrix Multiplication Example • Reduce accesses to global memory • (A.height/BLOCK_SIZE) times reading A • (B.width/BLOCK_SIZE) times reading B
Demo • Matrix Multiplication • With and without shared memory • Different block sizes
Control Flow • if, switch, do, for, while • Branch divergence in a warp • Threads in a warp issue different instruction sets • Different execution paths will be serialized • Increase number of instructions in that warp
Summary • 5 steps for CUDA Programming • NVIDIA Hardware Architecture • Memory hierarchy: global memory, shared memory, register file • Specifications of a device: block, warp, thread, SM
Summary • Memory optimization • Reduce overhead due to host-device memory transfer with CUDA streams, Zero copy • Reduce the number of transactions betweenon-chip and off-chip memory by utilizing memory coalescing (shared memory) • Try to avoid bank conflicts in shared memory • Control flow • Try to avoid branch divergence in a warp
References • http://docs.nvidia.com/cuda/cuda-c-programming-guide/ • http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/ • http://www.developer.nvidia.com/cuda-toolkit