CUDA

CUDA - 101 Basics

Overview • What is CUDA? • Data Parallelism • Host-Device model • Thread execution • Matrix-multiplication

GPU revised!

What is CUDA? • Compute Device Unified Architecture • Programming interface to GPU • Supports C/C++ and Fortran natively • Third party wrappers for Python, Java, MATLAB etc • Various libraries available • cuBLAS, cuFFTand many more… • https://developer.nvidia.com/gpu-accelerated-libraries

CUDA computing stack

Data Parallel programming i1 i3 … iN i2 Kernel o1 o3 … oN o2

Data parallel algorithm • Dot product : C = A . B A1 B1 A3 B3 … A2 B2 AN BN Kernel + + + + + C1 C3 … CN C2

Host-Device model CPU (Host) GPU (Device)

Threads • A thread is an instance of the kernel program • Independent in a data parallel model • Can be executed on a different core • Host tells the device to run a kernel program • And how many threads to launch

Matrix-Multiplication

CPU-only MatrixMultiplication For all elements of P Execute this code

Memory Indexing in C (and CUDA) M(i, j) = M[i + j * width]

CUDA version - I

CUDA program flow • Allocate input and output memory on host • Do the same for device • Transfer input data from host -> device • Launch kernel on device • Transfer output data from device -> host

Allocating Device memory • Host tells the device when to allocate and free memory in device • Functions for host-program • cudaMalloc(memory reference, size) • cudaFree(memory reference)

Transfer Data to/from device • Again, host tells device when to transfer data • cudaMemcpy(target, source, size, flag)

CUDA version - 2 Host Memory Device Memory Allocate matrix M on device Transfer M from host -> Device Allocate matrix N on device Transfer N from host -> Device Allocate matrix P on device Execute Kernel on Device Transfer P from Device-> Host Free Device memories for M, N and P

Matrix Multiplication Kernel • Kernel specifies the function to be executed on Device Parameters = Device memories, width Thread = Each element of output matrix P Dot product of M’s row and N’s column Write dot product at current location

Extensions : Function qualifiers

Extensions : Thread indexing • All threads execute the same code • But they need work on separate memory data • threadId.x & threadId.y • These variables automatically receive corresponding values for their threads

Thread Grid • Represents group of all threads to be executed for a particular kernel • Two level hierarchy • Grid is composed of Blocks • Each Block is composed of threads

Thread Grid 0, 0 1, 0 2, 0 width-1, 0 0, 1 width–1, 1 0, 2 0, width-1 width – 1, width - 1

Conclusion • Sample code and tutorials • CUDA nodes? • Programming guide • http://docs.nvidia.com/cuda/cuda-c-programming-guide/ • SDK • https://developer.nvidia.com/cuda-downloads • Available for windows, Mac and Linux • Lot of sample programs

Questions?

CUDA - 101

CUDA - 101

Presentation Transcript

CUDA Programming,

CUDA

CUDA

CUDA Lecture 8 CUDA Memories

CUDA

CUDA Lecture 4 CUDA Programming Basics

CUDA Programming

CUDA

CUDA 5.0

CUDA

CUDA