GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond

GPU Programming with CUDA – CUDA 5 and 6Paul Richmond GPUComputing@Sheffield http://gpucomputing.sites.sheffield.ac.uk/

Overview Dynamic Parallelism (CUDA 5+) GPU Object Linking (CUDA 5+) Unified Memory (CUDA 6+) Other Developer Tools

Dynamic Parallelism GPU Kernel B CPU Kernel A Kernel D Kernel C • Before CUDA 5 threads had to be launched from the host • Limited ability to perform recursive functions • Dynamic Parallelism allows threads to be launched from the device • Improved load balancing • Deep Recursion

An Example //Host Code ... A<<<...>>>(data); B<<<...>>>(data); C<<<...>>>(data); //Kernel Code __global__ void vectorAdd(float *data) { do_stuff(data); X<<<...>>>(data); X<<<...>>>(data); X<<<...>>>(data); do_more stuff(data); }

GPU Object Linking a.cu _________ _____ ______ b.cu _________ _____ ______ c.cu _________ _____ ______ Main .cpp ___________ _______ _________ + Program.exe a.o b.o c.o • CUDA 4 required a single source file for a single kernel • No linking of compiled device code • CUDA 5.0+ Allows different object files to be linked • Kernels and host code can be built independently

GPU Object Linking Main .cpp ___________ _______ _________ Main2 .cpp ___________ _______ _________ + + foo.cu bar.cu a.cu _________ _____ ______ b.cu _________ _____ ______ + + ... ab.culib ab.culib + a.o b.o Program.exe Program2.exe • Objects can also be built into static libraries • Shared by different sources • Much better code reuse • Reduces compilation time • Closed source device libraries

Unified Memory Unified Memory System Memory GPU Memory CPU GPU CPU GPU • Developer view is that GPU and CPU have separate memory • Memory must be explicitly copied • Deep copies required for complex data structures • Unified Memory changes that view • Single pointer to data accessible anywhere • Simpler code porting

Unified Memory Example void sortfile(FILE *fp, int N) { char *data; data = (char *)malloc(N); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); free(data); } void sortfile(FILE *fp, int N) { char *data; cudaMallocManaged(&data, N); fread(data, 1, N, fp); qsort(data, N, 1, compare); cudaDeviceSynchronize(); use_data(data); free(data); }

Other Developer Tools • XT and Drop-in libraries • cuFFT and cuBLAS optimised for multi GPU (on the same node) • GPUDirect • Direct Transfer between GPUs (cut out the host) • To support direct transfer via Infiniband (over a network) • Developer Tools • Remote Development using Nsight Eclipse • Enhanced Visual Profiler

GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond