270 likes | 518 Views
CUDA - 101. Basics. Overview. What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication . GPU revised!. What is CUDA?. C ompute D evice U nified A rchitecture Programming interface to GPU Supports C/C++ and Fortran natively
E N D
CUDA - 101 Basics
Overview • What is CUDA? • Data Parallelism • Host-Device model • Thread execution • Matrix-multiplication
What is CUDA? • Compute Device Unified Architecture • Programming interface to GPU • Supports C/C++ and Fortran natively • Third party wrappers for Python, Java, MATLAB etc • Various libraries available • cuBLAS, cuFFTand many more… • https://developer.nvidia.com/gpu-accelerated-libraries
Data Parallel programming i1 i3 … iN i2 Kernel o1 o3 … oN o2
Data parallel algorithm • Dot product : C = A . B A1 B1 A3 B3 … A2 B2 AN BN Kernel + + + + + C1 C3 … CN C2
Host-Device model CPU (Host) GPU (Device)
Threads • A thread is an instance of the kernel program • Independent in a data parallel model • Can be executed on a different core • Host tells the device to run a kernel program • And how many threads to launch
CPU-only MatrixMultiplication For all elements of P Execute this code
Memory Indexing in C (and CUDA) M(i, j) = M[i + j * width]
CUDA program flow • Allocate input and output memory on host • Do the same for device • Transfer input data from host -> device • Launch kernel on device • Transfer output data from device -> host
Allocating Device memory • Host tells the device when to allocate and free memory in device • Functions for host-program • cudaMalloc(memory reference, size) • cudaFree(memory reference)
Transfer Data to/from device • Again, host tells device when to transfer data • cudaMemcpy(target, source, size, flag)
CUDA version - 2 Host Memory Device Memory Allocate matrix M on device Transfer M from host -> Device Allocate matrix N on device Transfer N from host -> Device Allocate matrix P on device Execute Kernel on Device Transfer P from Device-> Host Free Device memories for M, N and P
Matrix Multiplication Kernel • Kernel specifies the function to be executed on Device Parameters = Device memories, width Thread = Each element of output matrix P Dot product of M’s row and N’s column Write dot product at current location
Extensions : Thread indexing • All threads execute the same code • But they need work on separate memory data • threadId.x & threadId.y • These variables automatically receive corresponding values for their threads
Thread Grid • Represents group of all threads to be executed for a particular kernel • Two level hierarchy • Grid is composed of Blocks • Each Block is composed of threads
Thread Grid 0, 0 1, 0 2, 0 width-1, 0 0, 1 width–1, 1 0, 2 0, width-1 width – 1, width - 1
Conclusion • Sample code and tutorials • CUDA nodes? • Programming guide • http://docs.nvidia.com/cuda/cuda-c-programming-guide/ • SDK • https://developer.nvidia.com/cuda-downloads • Available for windows, Mac and Linux • Lot of sample programs