1 / 27

CUDA - 101

CUDA - 101. Basics. Overview. What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication . GPU revised!. What is CUDA?. C ompute D evice U nified A rchitecture Programming interface to GPU Supports C/C++ and Fortran natively

alayna
Download Presentation

CUDA - 101

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CUDA - 101 Basics

  2. Overview • What is CUDA? • Data Parallelism • Host-Device model • Thread execution • Matrix-multiplication

  3. GPU revised!

  4. What is CUDA? • Compute Device Unified Architecture • Programming interface to GPU • Supports C/C++ and Fortran natively • Third party wrappers for Python, Java, MATLAB etc • Various libraries available • cuBLAS, cuFFTand many more… • https://developer.nvidia.com/gpu-accelerated-libraries

  5. CUDA computing stack

  6. CUDA computing stack

  7. CUDA computing stack

  8. CUDA computing stack

  9. Data Parallel programming i1 i3 … iN i2 Kernel o1 o3 … oN o2

  10. Data parallel algorithm • Dot product : C = A . B A1 B1 A3 B3 … A2 B2 AN BN Kernel + + + + + C1 C3 … CN C2

  11. Host-Device model CPU (Host) GPU (Device)

  12. Threads • A thread is an instance of the kernel program • Independent in a data parallel model • Can be executed on a different core • Host tells the device to run a kernel program • And how many threads to launch

  13. Matrix-Multiplication

  14. CPU-only MatrixMultiplication For all elements of P Execute this code

  15. Memory Indexing in C (and CUDA) M(i, j) = M[i + j * width]

  16. CUDA version - I

  17. CUDA program flow • Allocate input and output memory on host • Do the same for device • Transfer input data from host -> device • Launch kernel on device • Transfer output data from device -> host

  18. Allocating Device memory • Host tells the device when to allocate and free memory in device • Functions for host-program • cudaMalloc(memory reference, size) • cudaFree(memory reference)

  19. Transfer Data to/from device • Again, host tells device when to transfer data • cudaMemcpy(target, source, size, flag)

  20. CUDA version - 2 Host Memory Device Memory Allocate matrix M on device Transfer M from host -> Device Allocate matrix N on device Transfer N from host -> Device Allocate matrix P on device Execute Kernel on Device Transfer P from Device-> Host Free Device memories for M, N and P

  21. Matrix Multiplication Kernel • Kernel specifies the function to be executed on Device Parameters = Device memories, width Thread = Each element of output matrix P Dot product of M’s row and N’s column Write dot product at current location

  22. Extensions : Function qualifiers

  23. Extensions : Thread indexing • All threads execute the same code • But they need work on separate memory data • threadId.x & threadId.y • These variables automatically receive corresponding values for their threads

  24. Thread Grid • Represents group of all threads to be executed for a particular kernel • Two level hierarchy • Grid is composed of Blocks • Each Block is composed of threads

  25. Thread Grid 0, 0 1, 0 2, 0 width-1, 0 0, 1 width–1, 1 0, 2 0, width-1 width – 1, width - 1

  26. Conclusion • Sample code and tutorials • CUDA nodes? • Programming guide • http://docs.nvidia.com/cuda/cuda-c-programming-guide/ • SDK • https://developer.nvidia.com/cuda-downloads • Available for windows, Mac and Linux • Lot of sample programs

  27. Questions?

More Related