100 likes | 311 Views
GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond. GPUComputing@Sheffield http://gpucomputing.sites.sheffield.ac.uk/. Overview. Dynamic Parallelism (CUDA 5+) GPU Object Linking (CUDA 5+) Unified Memory (CUDA 6+) Other Developer Tools. Dynamic Parallelism. GPU. Kernel B. CPU.
E N D
GPU Programming with CUDA – CUDA 5 and 6Paul Richmond GPUComputing@Sheffield http://gpucomputing.sites.sheffield.ac.uk/
Overview Dynamic Parallelism (CUDA 5+) GPU Object Linking (CUDA 5+) Unified Memory (CUDA 6+) Other Developer Tools
Dynamic Parallelism GPU Kernel B CPU Kernel A Kernel D Kernel C • Before CUDA 5 threads had to be launched from the host • Limited ability to perform recursive functions • Dynamic Parallelism allows threads to be launched from the device • Improved load balancing • Deep Recursion
An Example //Host Code ... A<<<...>>>(data); B<<<...>>>(data); C<<<...>>>(data); //Kernel Code __global__ void vectorAdd(float *data) { do_stuff(data); X<<<...>>>(data); X<<<...>>>(data); X<<<...>>>(data); do_more stuff(data); }
GPU Object Linking a.cu _________ _____ ______ b.cu _________ _____ ______ c.cu _________ _____ ______ Main .cpp ___________ _______ _________ + Program.exe a.o b.o c.o • CUDA 4 required a single source file for a single kernel • No linking of compiled device code • CUDA 5.0+ Allows different object files to be linked • Kernels and host code can be built independently
GPU Object Linking Main .cpp ___________ _______ _________ Main2 .cpp ___________ _______ _________ + + foo.cu bar.cu a.cu _________ _____ ______ b.cu _________ _____ ______ + + ... ab.culib ab.culib + a.o b.o Program.exe Program2.exe • Objects can also be built into static libraries • Shared by different sources • Much better code reuse • Reduces compilation time • Closed source device libraries
Unified Memory Unified Memory System Memory GPU Memory CPU GPU CPU GPU • Developer view is that GPU and CPU have separate memory • Memory must be explicitly copied • Deep copies required for complex data structures • Unified Memory changes that view • Single pointer to data accessible anywhere • Simpler code porting
Unified Memory Example void sortfile(FILE *fp, int N) { char *data; data = (char *)malloc(N); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); free(data); } void sortfile(FILE *fp, int N) { char *data; cudaMallocManaged(&data, N); fread(data, 1, N, fp); qsort(data, N, 1, compare); cudaDeviceSynchronize(); use_data(data); free(data); }
Other Developer Tools • XT and Drop-in libraries • cuFFT and cuBLAS optimised for multi GPU (on the same node) • GPUDirect • Direct Transfer between GPUs (cut out the host) • To support direct transfer via Infiniband (over a network) • Developer Tools • Remote Development using Nsight Eclipse • Enhanced Visual Profiler