300 likes | 482 Views
Lecture 8: CUDA. CUDA. A scalable parallel programming model for GPUs and multicore CPUs Provides facilities for heterogeneous programming Allows the GPU to be used as both a graphics processor and a computing processor. Pollack’s Rule.
E N D
CUDA A scalable parallel programming model for GPUs and multicore CPUs Provides facilities for heterogeneous programming Allows the GPU to be used as both a graphics processor and a computing processor.
Pollack’s Rule • Performance increase is roughly proportional to the square root of the increase in complexity performance √complexity • Power consumption increase is roughly linearly proportional to the increase in complexity power consumption complexity
CUDA • SPMD (Single Program Multiple Data) Programming Model • Programmer writes code for a single thread and GPU runs thread instances in parallel • Extends C and C++
CUDA Three key abstractions • A hierarchy of thread groups • Shared memories • Barrier synchronization CUDA provides fine-grained data parallelism and thread parallelism nested within coarse-grained data parallelism and task parallelism
CUDA Kernel: a function designed to be executed by many threads Thread block: a set of concurrent threads that can cooperate among themselves through barrier synchronization and through shared-memory access Grid: a set of thread blocks execute the same kernel program function designed to be executed by many threads
CUDA Three key abstractions • A hierarchy of thread groups • Shared memories • Barrier synchronization CUDA provides fine-grained data parallelism and thread parallelism nested within coarse-grained data parallelism and task parallelism
CUDA __ global__ void mykernel (int a, …) { ... } main() { ... nblocks = N/512; // max. 512 threads per block mykernel <<< nblocks, 512 >>> (aa, …); ... }
CUDA • Thread management is performed by hardware • Max. 512 threads per block • The number of blocks can exceed the number of processors • Blocks execute independently and in any order • Threads can communicate through shared memory • Atomic memory operations exist on the global memory
CUDA Memory Types Local Memory: private to a thread Shared Memory: shared by all threads of the block __shared__ Device memory: shared by all threads of an application __device__
Pollack’s Rule • Performance increase is roughly proportional to the square root of the increase in complexity performance √complexity • Power consumption increase is roughly linearly proportional to the increase in complexity power consumption complexity
CUDA Three key abstractions • A hierarchy of thread groups • Shared memories • Barrier synchronization CUDA provides fine-grained data parallelism and thread parallelism nested within coarse-grained data parallel and task parallelism
CUDA __ global__ void mykernel (float* a, …) { ... } main() { ... int nbytes=N*sizeof(float); float* ha=(float*)malloc(nbytes); float* da=0; cudaMalloc((void**)&da, nbytes); cudaMemcpy(da, ha, nbytes, CudaMemcpyHosttoDevice); mykernel <<< N/blocksize, blocksize >>> (da, …); cudaMemcpy(ha, da, nbytes, CudaMemcpyDevicetoHost); cudaFree(da); ... }
CUDA Synchronization Barrier: Threads wait until all threads in the block arrive at the barrier __syncthreads() The thread increments barrier count and the scheduler marks it as waiting. When all threads arrive barrier, scheduler releases all waiting threads.
CUDA __ global__ void shift_reduce (int *inp, int N, int *tot) { unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; __shared__ int x[blocksize]; x[tid] = (i<N) ? inp[i] : 0; __synchthreads(); for (int s=blockDim.x / 2; s>0; s=s/2) { if (tid<s) x[tid] += x[tid+s]; __synchthreads(); } if (tid==0) atomicAdd(tot, x[tid]); }
CUDA SPMD (Single Program Multiple Data) programming model • All threads execute the same program • Threads coordinate with barrier synchronization • Threads of a block express fine-grained data parallelism and thread parallelism • Independent blocks of a grid express coarse-grained data parallelism • Independent grids express coarse-grained task parallelism
CUDA Scheduler • Hardware management and scheduling of threads and thread blocks • Scheduler has minimal runtime overhead
CUDA Multithreading • Memory and texture fetch latency requires hundreds of processor clock cycles • While one thread is waiting for a load or texture fetch, the processor can execute another thread • Thousands of independent threads can keep many processors busy
CUDA GPU Multiprocessor Architecture • Lightweight thread creation • Zero-overhead thread scheduling • Fast barrier synchronization Each thread has its own • Private registers • Private per-thread memory • PC • Thread execution state Support very fine-grained parallelism
CUDA GPU Multiprocessor Architecture Each SP core • contains scalar integer and floating-point units • is hardware multithreaded • supports up to 64 threads • is pipelined and executes one instruction per thread per clock • has a large register file (RF), 1024 32-bit registers, • registers are partitioned among the assigned threads (Programs declare their register demand; compiler optimizes register allocation. Ex: (a) 32 registers per thread => 256 threads per block, or (b) fewer registers – more threads, or (c) more registers – fewer threads)
CUDA Single Instruction Multiple Thread (SIMT) SIMT: a processor architecture that applies one instruction to multiple independent threads in parallel Warp: the set of parallel threads that execute the same instruction together in a SIMT architecture • Warp size is 32 threads (4 threads per SP, executed in 4 clock cycles) • Threads in a warp start at the same program address, but they can branch and execute independently. • Individual threads may be inactive due to independent branching
CUDA SIMT Warp Execution • There are 4 thread lanes per SP • An issued warp instruction executes in 4 processor cycles • Instruction scheduler selects a warp every 4 clocks • The controler: • Collects thread programs into warps • Allocates a warp • Allocates registers for the warp threads (it can start a warp only when it can allocate the requested register count) • Starts warp execution • When all threads exit, it frees the registers
CUDA Streaming Processor (SP) • Has 1024 32-bit registers (RF) • Can perform 32-bit and 64-bit integer operations: arithmetic, comparison, conversion, logic operations • Can perform 32-bit floating-point operations: add, multiply, min, max, multiply-add, etc. SFU (Special Function Unit) • Pipelined unit • Generates one 32-bit floating-point function per cycle: square root, sin, cos, 2x, log2x
CUDA Memory System • Global Memory – external DRAM • Shared Memory – on chip • Per-thread local memory – external DRAM • Constant memory – in external DRAM and cached in shared memory • Texture memory – on chip
Project Performance Measurement, Evaluation and Prediction of Multicore and GPU systems Multicore systems • CPU performance (instruction execution time, pipelining, etc.) • Cache performance • Performance using algorithmic structures GPU systems (NVIDIA-CUDA) • GPU core performance (instruction execution time, pipelining, etc.) • Global and shared memory performance (2) • Performance using algorithmic structures GPU performance in MATLAB environment