GPU Programming with CUDA – Optimisation Mike Griffiths

GPU Programming with CUDA – OptimisationMike Griffiths GPUComputing@Sheffield http://gpucomputing.sites.sheffield.ac.uk/

Hardware Model DRAM GDRAM GPU Kernel Code _________ _____ ______ Main Program Code ___________ _______ _________ GPU Kernel Code _________ _____ ______ GPU Kernel Code _________ _____ ______ CPU GPU PCIe I/O I/O SM Shared Memory

Performance Inhibitors Data transfer to/from device memory Device under-utilisation GPU memory bandwidth Code Branching

Performance Inhibitors Data transfer to/from device memory Device under-utilisation and occupancy GPU memory bandwidth Code Branching

Data Transfer • CPU (host) and GPU (device) have separate dedicated memory • All data read/written on the device must be copied via PCIe bus • Very expensive operation • Optimisation Technique: Minimise data copies • Keep resident data on the device • May have to move some computation to the GPU even if is not computationally expensive • Might be quicker to re-calculate data on the device than copy it

Data Transfer Example Loop over timesteps inexpensive_routine_on_host(data_on_host) copy data from host to device expensive_routine_on_device(data_on_device) copy data from device to host End loop over timesteps copy data from host to device Loop over timesteps inexpensive_routine_on_device(data_on_device) expensive_routine_on_device(data_on_device) End loop over timesteps copy data from device to host Port inexpensive routine to the device Minimise transfers by moving copy out of the loop

Exposing Parallelism • GPU performance relies on the use of many threads • Degree of parallelism must be much higher than on the CPU • Ideally need many more threads than cores • Effort must be made to expose as much parallelism as possible • May require re-engineering your problem • If significant sections of code are serial then GPU acceleration will be limited • Amdahl’s Law

Memory Latency • Access to GPU memory has several hundred cycles of latency • When a thread is waiting for data it is stalled • GPUs have very fast context switching • Stalled threads can be switched with active threads • Switching hides memory latency if other threads are performing compute • Requires many threads ideally performing large amounts of computation • Optimisation Technique: Have lots of threads with high arithmetic intensity • Defined as the ratio of arithmetic computation to memory accesses

Exposing parallelism example Loop over i from 1 to 512 Loop over j from 1 to 512 independent iteration Original code 1D decomposition 2D decomposition Calci from thread/block ID Loop over j from 1 to 512 independent iteration Calc i & j from thread/block ID independent iteration ✔ ✖ 262,144 threads 512 threads

Memory Coalescing • GPUs have high peak memory bandwidth • Maximum bandwidth is achieved when data is accessed in large requests rather than many small requests • Large requests must come from multiple threads • Otherwise memory accesses are serialised degrading performance • Memory coalescing: Consecutive threads accessing consecutive memory locations • Optimisation technique: Coalesced memory accesses reduce the number of requests and achieve higher bandwidth

Coalescing Example index= blockIdx.x*blockDim.x + threadIdx.x; output[index] = 2*input[index]; Consecutive threads are those with consecutive threadIdx.x values Question: Do consecutive threads access consecutive memory locations?

Coalescing Example index= blockIdx.x*blockDim.x + threadIdx.x; output[index] = 2*input[index]; Consecutive threads are those with consecutive threadIdx.x values Question: Do consecutive threads access consecutive memory locations? Yes: Consecutive index values leads to consecutive threadIdx values

Coalescing Example 2 i= blockIdx.x*blockDim.x + threadIdx.x; for (j=0; j<N; j++) output[i][j]=2*input[i][j]; j= blockIdx.x*blockDim.x + threadIdx.x; for (i=0; i<N; i++) output[i][j]=2*input[i][j]; Question: Do consecutive threads access consecutive memory locations?

Coalescing Example 2 i= blockIdx.x*blockDim.x + threadIdx.x; for (j=0; j<N; j++) output[i][j]=2*input[i][j]; j= blockIdx.x*blockDim.x + threadIdx.x; for (i=0; i<N; i++) output[i][j]=2*input[i][j]; Question: Do consecutive threads access consecutive memory locations? No: Consecutive threadIdx.x corresponds to consecutive i values Yes: Consecutive threadIdx.x corresponds to consecutive j values

Memory Coalescing in 2D intj = blockIdx.x * blockDim.x + threadIdx.x; inti = blockIdx.y * blockDim.y + threadIdx.y; c[i][j] = a[i][j] + b[i][j]; • What about 2D or 3D decompositions • Exactly the same • Always threadIdx.x which should increment with consecutive threads • E.g. Matrix addition

Code Branching • On NVIDIA GPUs there are less instructional scheduling units than cores • Threads are scheduled in groups of 32 (a warp) • Threads within a warp execute the same instruction in lock-step • Single Instruction Multiple Data (SIMD) • CUDA C Kernels are free to specify branches • BUT all threads will have to follow all code paths within the warp • Optimisation Technique: Avoid inter warp branching wherever possible

Branching Example i = blockIdx.x*blockDim.x + threadIdx.x; if (i%2 == 0) … else … i = blockIdx.x*blockDim.x + threadIdx.x; if ((i/32)%2 == 0) … else … You want to split your threads into two groups:

CUDA Profiling # CUDA_PROFILE_LOG_VERSION 2.0 # CUDA_DEVICE 0 Tesla M1060 # CUDA_CONTEXT 1 # TIMESTAMPFACTOR fffff6e2e9ee8858 method,gputime,cputime,occupancy method=[ memcpyHtoD ] gputime=[ 37.952 ] cputime=[ 86.000 ] method=[ memcpyHtoD ] gputime=[ 37.376 ] cputime=[ 71.000 ] method=[ memcpyHtoD ] gputime=[ 37.184 ] cputime=[ 57.000 ] method=[ _Z23inverseEdgeDetect1D_colPfS_S_ ] gputime=[ 253.536 ] cputime=[ 13.00 0 ] occupancy=[ 0.250 ] ... • Set COMPUTE_PROFILE environment variable to 1 • Log file will be created at runtime • E.g. cuda_profile_0.log • Contains timing information for kernel and data transfer • Possible to output more metrics (see doc/Compute_Profiler.txt)

Conclusions • GPUs offers higher Floating Point and memory bandwidth performance than CPUs • A number of factors will inhibit execution performance • A number of techniques can be applied to circumvent these • Some techniques may require re-engineering your problem • If you application cant be adapted GPU performance will not be good! • It is important to have an understanding of the application, architecture and programming model

GPU Programming with CUDA – Optimisation Mike Griffiths