1 / 51

Basic CUDA Programming

Basic CUDA Programming. Computer Architecture 2014 (Prof. Chih -Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang (chenyen.yang@gmail.com). From Graphics to General Purpose Processing – CPU vs GPU. CPU : general purpose computation (SISD) GPU : data-parallel computation (SIMD).

tobias
Download Presentation

Basic CUDA Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang (chenyen.yang@gmail.com)

  2. From Graphics to General Purpose Processing – CPU vs GPU • CPU:general purpose computation (SISD) • GPU:data-parallel computation (SIMD)

  3. What is CUDA? • Compute Unified Device Architecture • Hardware or software? • A programming model • A parallel computing platform

  4. Heterogeneous computing: CPU+GPUCooperation Host-Device Architecture: CPU (host) GPU w/ local DRAM (device)

  5. Heterogeneous computing:NVIDIA CUDA Compiler (NVCC) • NVCC separates CPU and GPU source code into two parts. • For host codes, NVCC invokes typical C compiler like GCC, Intel C compiler, or MS C compiler. • All the device codes are compiled by NVCC. • The extension of device source files should be “.cu”. • All executable with CUDA code requires: • CUDA core library (cuda) • CUDA runtime library (cudart)

  6. Heterogeneous computing:CUDA Code Execution (1/2)

  7. Heterogeneous computing:CUDA Code Execution (2/2)

  8. Heterogeneous computing:NIVIDA G80 series • Texture Processor Cluster (TPC) • Streaming Multiprocessor (SM) • Streaming Processor (SP) / CUDA core • Special Function Unit (SFU) • For NV G8800(G80), the number of SPs is 128.

  9. Texture Texture Texture Texture Texture Texture Texture Texture Texture Host Input Assembler Thread Execution Manager Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Load/store Load/store Load/store Load/store Load/store Load/store Global Memory Heterogeneous computing:NIVIDA G80 series –CUDA mode

  10. CUDA Programming Model (1/7) • Define • Programming model • Memory model • Help developers map the current applications or algorithms onto CUDA devices more easily and clearly. • NVIDIA GPUs have different architecture compared with common CPUs. It is important to follow CUDA’s programming model to obtain higher performance of program execution.

  11. CUDA Programming Model (2/7) • C/C++ for CUDA • Subset of C with extensions • C++ templates for GPU code • CUDA goals: • Scale code to 100s of cores and 1000s of parallel threads. • Facilitate heterogeneous computing: CPU + GPU • CUDA defines: • Programming model • Memory model

  12. CUDA Programming Model (3/7) • CUDA Kernels and Threads: • Parallel portions of an application are executed on the device as kernels. And only one kernel is executed at a time. • All the threads execute the same kernel at a time. • Differences between CUDA and GPU threads • CUDA threads are extremely lightweight • Very little creation overhead • Fast switching • CUDA uses 1000s of threads to achieve efficiency • Multi-core CPUs can only use a few

  13. CUDA Programming Model (4/7) Arrays of Parallel Threads: • A CUDA kernel is executed by an array of threads • All threads run the same code • Each thread has an ID that it uses to compute memory addresses and make control decisions

  14. CUDA Programming Model (5/7) Thread Batching: • Kernel launches a grid of thread blocks • Threads within a block cooperate via shared memory • Threads in different blocks cannot cooperate • Allows programs to transparently scale to different GPUs

  15. CUDA Programming Model (6/7) CUDA Programming Model: • A kernel is executed by a grid of thread blocks • Block can be 1D or 2D. • A thread block is a batch of threads • Thread can be 1D or 2D or 3D. • Data can be shared through shared memory • Execution synchronization • But threads from different blocks can’t cooperate.

  16. CUDA Programming Model (7/7) Memory Model: • Registers • Per thread • Data lifetime = thread lifetime • Local memory • Per thread off-chip memory (physically in device DRAM) • Data lifetime = thread lifetime • Shared memory • Per thread block on-chip memory • Data lifetime = block lifetime • Global (device) memory • Accessible by all threads as well as host (CPU) • Data lifetime = from allocation to deallocation • Host (CPU) memory • Not directly accessible by CUDA threads

  17. CUDA C Basic

  18. GPU Memory Allocation/Release • Memory allocation on GPU • cudaMalloc(void **pointer, size_tnbytes) • Preset value for specific memory area • cudaMemset(void *pointer, int value, size_t count) • Release memory allocation • cudaFree(void *pointer) int n = 1024; intnbytes = 1024*sizeof(int); int *d_a = 0; cudaMalloc( (void**)&d_a, nbytes ); cudaMemset( d_a, 0, nbytes); cudaFree(d_a);

  19. Data Copies • cudaMemcpy(void *dst, void *src, size_tnbytes, enumcudaMemcpyKind direction); • direction specifies locations (host or device) of srcand dst • Blocks CPU thread: returns after the copy is complete • Doesn’t start copying until previous CUDA calls complete • enumcudaMemcpyKind • cudaMemcpyHostToDevice • cudaMemcpyDeviceToHost • cudaMemcpyDeviceToDevice

  20. Function Qualifiers • __global__ : invoked from within host (CPU) code, cannot be called from device (GPU) code must return void • __device__ : called from other GPU functions, cannot be called from host (CPU) code • __host__ : can only be executed by CPU, called from host

  21. Variable Qualifiers (GPU code) • __device__ • Stored in device memory (large capacity, high latency, uncached) • Allocated with cudaMalloc (__device__ qualifier implied) • Accessible by all threads • Lifetime: application • __shared__ • Stored in on-chip shared memory (SRAM, low latency) • Allocated by execution configuration or at compile time • Accessible by all threads in the same thread block • Lifetime: duration of thread block • Unqualified variables: • Scalars and built-in vector types are stored in registers • Arrays may be in registers or local memory (registers are not addressable)

  22. CUDA Built-in Device Variables • All __global__ and __device__ functions have access to these automatically defined variables • dim3 gridDim; • Dimensions of the grid in blocks (at most 2D) • dim3 blockDim; • Dimensions of the block in threads • dim3 blockIdx; • Block index within the grid • dim3 threadIdx; • Thread index within the block

  23. Executing Code on the GPU • Kernels are C functions with some restrictions • Can only access GPU memory • Must have void return type • No variable number of arguments (“varargs”) • Not recursive • No static variables • Function arguments automatically copied from CPU to GPU memory

  24. Launching Kernels • Modified C function call syntax: • kernel<<<dim3 grid, dim3 block>>>(…); • Execution configuration (“<<< >>>”): • Grid dimensions: x and y • Thread-block dimensions: x, y, and z dim3 grid(16,16); dim3 block(16,16); kernel1<<<grid, block>>>(…); kernel2<<<32, 512>>>(…);

  25. Data Decomposition • Often want each thread in kernel to access a different element of an array. blockIdx.x blockDim.x = 5 threadIdx.x 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 idx = blockIdx.x*blockDim.x + threadIdx.x

  26. Data Decomposition Example:Increment Array Elements • Increment N-element vector a by scalar b CPU programvoid increment_cpu(float *a, float *b, int N) { for(intidx=0;idx<N;idx++) a[idx]=a[idx]+b;} void main() { … increment_cpu(a,b,N); } CUDA program__global__ void increment_gpu(float *a, float *b, int N) { intidx = blockIdx.x*blockDim.x+threadIdx.x; if(idx<N) a[idx]=a[idx]+b;} void main() { … dim3 dimBlock(blocksize); dim3 dimGrid(ceil(N/(float)blocksize)); increment_gpu<<<dimGrid,dimBlock>>>(a,b,N); }

  27. LAB0: Setup CUDA Environment & Device Query

  28. CUDA Environment Setup • Check your NVIDIA GPU • Compute capability • GPU’s generation • https://developer.nvidia.com/cuda-gpus • Download CUDA Development files • https://developer.nvidia.com/cuda-toolkit-archive • CUDA driver • CUDA toolkit • PC Room ED417B used version 2.3. • CUDA SDK (optional) • Install CUDA • Test CUDA • Device Query (check in the sample codes)

  29. Setup CUDA for MS Visual Studio(WindowsXP/VS2005~2010) In PC Room ED417B : • CUDA device: NV 8400GS • CUDA toolkit version: 2.0 or 2.3 • VS20005 or VS2008 • From scratch • http://forums.nvidia.com/index.php?showtopic=30273 • CUDA VS Wizard • http://sourceforge.net/projects/cudavswizard/ • Modified from existing project

  30. Setup CUDA for MS Visual Studio 2012 (Win7/VS2012) • MS Visual Studio 2012 installed • Download and install CUDA toolkit 5.0 • https://developer.nvidia.com/cuda-toolkit-50-archive • Follow the instructions on this blog • http://blog.norture.com/2012/10/gpu-parallel-programming-in-vs2012-with-nvidia-cuda/ • You can start to write a CUDA program by creating new project from VS2012’s template. • If you have previous version of VS projects, it is recommended to re-create the project by using VS2012’s new template.

  31. CUDA Device Query Example:(CUDA toolkit 6.0)

  32. Lab1: First CUDA Program Yet Another Random Number Sequence Generator

  33. Yet Another Random Number Sequence Generator • Implemented by CPU and GPU • Functionality: • Given an random integer array A holding 8192elements • Generated by rand() • Re-generate random number by multiplying itself 256 times without regard to overflow occurred. • B[i]new = B[i]old*A[i] (CPU) • C[i]new= C[i]old*A[i] (GPU) • Check the consistency between the results of CPU and GPU.

  34. Data Manipulation between Host and Device • cudaError_tcudaMalloc( void** devPtr, size_tcount ) • Allocates count bytes of linear memory on the device and return in *devPtr as a pointer to the allocated memory • cudaError_tcudaMemcpy( void* dst, const void* src, size_tcount, enumcudaMemcpyKindkind) • Copies count bytes from memory area pointed to by src to the memory area pointed to by dst • kind indicates the type of memory transfer • cudaMemcpyHostToHost • cudaMemcpyHostToDevice • cudaMemcpyDeviceToHost • cudaMemcpyDeviceToDevice • cudaError_tcudaFree( void* devPtr ) • Frees the memory space pointed to by devPtr

  35. Now, go and finish your first CUDA program !!!

  36. Download http://twins.ee.nctu.edu.tw/~skchen/lab1.zip • Open project with Visual C++ 2008 ( lab1/cuda_lab/cuda_lab.vcproj ) • main.cu • Random input generation, output validation, result reporting • device.cu • Lunch GPU kernel, GPU kernel code • parameter.h • Fill in appropriate APIs • GPU_kernel() in device.cu

  37. Lab2: Make the Parallel Code Faster Yet Another Random Number Sequence Generator

  38. Parallel Processing in CUDA • Parallel code can be partitioned into blocks and threads • cuda_kernel<<<nBlk, nTid>>>(…) • Multiple tasks will be initialized, each with different block id and thread id • The tasks are dynamically scheduled • Tasks within the same block will be scheduled on the same stream multiprocessor • Each task take care of single data partition according to its block id and thread id

  39. Locate Data Partition by Built-in Variables • Built-in Variables • gridDim • x, y • blockIdx • x, y • blockDim • x, y, z • threadIdx • x, y, z

  40. Data Partition for Previous Example When processing 64 integer data: cuda_kernel<<<2, 2>>>(…) int total_task = gridDim.x * blockDim.x ; int task_sn = blockIdx.x * blockDim.x + threadIdx.x ; int length = SIZE / total_task ; int head = task_sn * length ;

  41. Processing Single Data Partition

  42. Parallelize Your Program !!!

  43. Partition kernel into threads • Increase nTid from 1 to 512 • Keep nBlk = 1 • Group threads into blocks • Adjust nBlk and see if it helps • Maintain total number of threads below 512, and make sure that 8192 can be divisible by that number. • e.g. nBlk * nTid < 512

  44. Lab3: Resolve Memory Contention Yet Another Random Number Sequence Generator

  45. Parallel Memory Architecture • Memory is divided into banks to achieve high bandwidth • Each bank can service one address per cycle • Successive 32-bit words are assigned to successive banks

  46. Lab2 Review When processing 64 integer data: cuda_kernel<<<1, 4>>>(…)

  47. How about Interleave Accessing? When processing 64 integer data: cuda_kernel<<<1, 4>>>(…)

  48. Implementation of Interleave Accessing cuda_kernel<<<1, 4>>>(…) • head = task_sn • stripe = total_task

  49. Improve Your Program !!!

  50. Modify original kernel code in interleaving manner • cuda_kernel() in device.cu • Adjusting nBlk and nTid as in Lab2 and examine the effect • Maintain total number of threads below 512, and make sure that 8192 can be divisible by that number. • e.g. nBlk * nTid < 512

More Related