1.81k likes | 2.25k Views
CUDA. All you wanted to know about it, but was afraid to ask!. Paulo Ivson Netto Santos Waldemar Celes Filho Nov 2007. CUDA is aimed at GPGPU. What is GPGPU ?. General Purpose computation using GPU Applications other than 3D graphics GPU accelerates critical path of application
E N D
CUDA All you wanted to know about it, but was afraid to ask! Paulo Ivson Netto Santos Waldemar Celes Filho Nov 2007
What is GPGPU ? • General Purpose computation using GPU • Applications other than 3D graphics • GPU accelerates critical path of application • Data parallel algorithms leverage GPU attributes • Large data arrays, streaming throughput • Fine-grain SIMD parallelism • Floating point (FP) computation • Applications – see //GPGPU.org • Game effects (FX) physics, image processing • Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting, etc, etc
Importance of Data Parallelism • GPUs are designed for graphics • Highly parallel tasks • Data-parallel processing • GPUs architecture is ALU-heavy • Multiple pipelines, multiple ALUs per pipe • Large memory latency • HUGE memory bandwidth • Hide memory latency (with more computation)
CPU Strategy: Make a few threads run fast Tactics – minimize latency Big Cache – build for hit Instruction/Data Prefetch Speculative Execution limited by “perimeter” – communication bandwidth GPU Strategy: Make many threads run fast Tactics – maximize throughput Small Cache – build for miss Parallelism (1000s of threads) Pipelining limited by “area” – compute capability CPU vs GPU Design Strategies and Tactics
What a GPU looks like? from graphics point of view
GeForce 7800 GTX Parallelism 8 Vertex Engines Triangle Setup/Raster Z-Cull 24 Pixel Shaders Shader Instruction Dispatch 16 Raster Operation Pipelines Fragment Crossbar Memory Partition Memory Partition Memory Partition Memory Partition
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF Host L1 L1 L1 L1 L1 L1 L1 L1 Data Assembler Setup / Rstr / ZCull Vtx Thread Issue Geom Thread Issue Pixel Thread Issue Thread Processor L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB G80 replaces the pipeline model • The future of GPUs is programmable processing • So – build the architecture around the processor
Work Distribution for Graphics • Vertices are serially distributed to all the SM’s • SPA processes vertices in parallel • Vertices are serially gathered from the SM’s • And sent to Primitive Setup • Pixels are serially distributed in parallel tiles • SPA processes pixels in parallel • Pixels are sent to ROP/FB
Common GPGPU Constraints • Dealing with graphics API • Working with the corner cases of the graphics API • Addressing modes • Limited texture size/dimension • Shader capabilities • Limited outputs • Instruction sets • Lack of Integer & bit ops • Communication limited • Between pixels • Scatter a[i] = p
Just what is CUDA anyway? • “Compute Unified Device Architecture” • General purpose programming model • User kicks off batches of threads on the GPU • GPU is viewed as a dedicated super-threaded co-processor • Targeted software stack • Compute oriented drivers, language, and tools • Driver for loading computation programs into GPU • Standalone driver - optimized for computation • Interface designed for compute - graphics free API • Data sharing with OpenGL buffer objects • Guaranteed maximum download & readback speeds • Explicit GPU memory management • Debugging support on the CPU!
CUDA Performance CUDA/G80 Advantage Over Dual Core 197x 47x 20x 10x Rigid Body Physics Solver Matrix Numerics BLAS1: 60+ GB/s BLAS3: 100+ GFLOPS Wave Equation FDTD: 1.2 Gcells/s FFT: 52 GFLOPS (GFLOPS as defined by benchFFT) BiologicalSequence Match SSEARCH: 5.2 Gcells/s Finance Black Scholes: 4.7 GOptions/s
GPU: A Highly Multithreaded Coprocessor • The GPU is viewed as a computedevicethat: • Is a coprocessor to the CPU or host • Has its own DRAM (device memory) • Runs many threads in parallel • Identify data-parallel portions of an application • Execute them on the device as kernels • Which run in parallel on many threads • Differences between GPU and CPU threads • GPU threads are extremely lightweight • Very little creation overhead • GPU needs 1000s of threads for full efficiency • Multi-core CPU needs only a few
Host Device Kernel 1 Kernel 2 Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 2) Thread (0, 1) Thread (0, 0) Thread (1, 1) Thread (1, 0) Thread (1, 2) Thread (2, 1) Thread (2, 2) Thread (2, 0) Thread (3, 1) Thread (3, 2) Thread (3, 0) Thread (4, 1) Thread (4, 2) Thread (4, 0) Thread Batching: Grids and Blocks • Grid of thread blocks • Corresponds to one kernel • All threads access global memory • Thread block • A batch of threads that can cooperate with each other • Share data through a low latency shared memory • Barrier synchronization for hazard-free shared memory accesses • Threads from different blocks cannot cooperate Courtesy: NVDIA
Device Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (0, 0) Thread (0, 2) Thread (1, 2) Thread (1, 1) Thread (1, 0) Thread (2, 2) Thread (2, 1) Thread (2, 0) Thread (3, 1) Thread (3, 2) Thread (3, 0) Thread (4, 0) Thread (4, 2) Thread (4, 1) Block and Thread IDs • Threads and blocks haveIDs • Each thread can decide what data to work on • Block ID: 1D or 2D • Thread ID: 1D, 2D, or 3D • Multidimensional data • Image processing • Solving PDEs on volumes • … Courtesy: NVDIA
(Device) Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Host Global Memory Constant Memory Texture Memory CUDA Device Memory Overview • Each thread can: • R/W per-thread registers • R/W per-thread local memory • R/W per-block shared memory • R/W per-grid global memory • Read only per-grid constant memory • Read only per-grid texture memory • The host can R/W global, constant, and texture memories
(Device) Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Host Global Memory Constant Memory Texture Memory Global, Constant, and Texture Memories • Global memory • Communicating data between hostand device • Visible to all threads • Texture and Constant memories • Read-only data initialized by host • Visible to all threads Courtesy: NVDIA
A Common Programming Pattern • Local and global memory reside in DRAM • Much slower access than shared memory • Profitable way of performing computation • Block data and computation to take advantage of fast shared memory • Partitiondatainto data subsets that fit into shared memory • Handle each data subset with one thread block by: • Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism • Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element • Copying results from shared memory to global memory
A Common Programming Pattern • Texture and Constant memory also reside in device memory (DRAM) • Much slower access than shared memory • But… cached! • Highly efficient access forread-only data • Carefully divide data according to access patterns • R/O no structure constant memory • R/O array structured texture memory • R/W shared within Block shared memory • R/W registers spill to local memory • R/W inputs/results global memory
Or not... so many things still missing! • How to code? • API, SDK, etc • How does it actually work in the GPU? • HW details that make all the difference • How to get the best of it? • Tips and tricks to get those GFLOPs!
Extended C __device__ float filter[N]; __global__ void convolve (float *image) { __shared__ float region[M]; ... region[threadIdx] = image[i]; __syncthreads() ... image[j] = result; } // Allocate GPU memory void *myimage = cudaMalloc(bytes) // 100 blocks, 10 threads per block convolve<<<100, 10>>> (myimage); • Declspecs • global, device, shared, local, constant • Keywords • threadIdx, blockIdx • Intrinsics • __syncthreads • Runtime API • Memory, symbol, execution management • Function launch
Extended C Integrated source (foo.cu) cudacc EDG C/C++ frontend Open64 Global Optimizer GPU Assembly foo.s CPU Host Code foo.cpp OCG gcc / cl G80 SASS foo.sass
(Device) Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Host Global Memory Constant Memory Texture Memory CUDA Device Memory Allocation • cudaMalloc() • Allocates the device Global Memory • Requires two parameters • Address of a pointer to the allocated object • Size of allocated object • cudaFree() • Frees object from device Global Memory
CUDA Device Memory Allocation • Code example: • Allocate 256 * 256 single precision float array • Use “d” suffix to indicate device data structure float* elementsd; int size = 256 * 256 * sizeof(float); cudaMalloc( (void**)&dataOnDevice, size ); cudaFree( dataOnDevice );
(Device) Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Host Global Memory Constant Memory Texture Memory CUDA Host-Device Data Transfer • cudaMemcpy() • Memory data transfer • Requires four parameters • Pointer to destination • Pointer to source • Number of bytes copied • Type of transfer • Host to Host • Host to Device • Device to Host • Device to Device
CUDA Host-Device Data Transfer(cont.) • Code example: • Transfer a 64 * 64 single precision float array • elements is in host memory • elementsd is in device memory • cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constants cudaMemcpy( elementsd, elements, size, cudaMemcpyHostToDevice ); cudaMemcpy( elements, elementsd, size, cudaMemcpyDeviceToHost );
CUDA Function Declarations • __global__ defines a kernel function • Must return void • __device__ and __host__ can be used together • __host__ is optional
CUDA Function Declarations • __device__ functions cannot have their address taken • For functions executed on the device: • No recursion • No static variable declarations inside the function • No variable number of arguments
Calling a Kernel – Thread Creation • Kernel functions are called with an execution configuration • Calls to a kernel function are asynchronous • But only one kernel active at a time per GPU • Implicit synchronizations • Second kernel launch • Memory read backs • Explicit synchronizations • cudaThreadSynchronize() __global__void KernelFunc(...); dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64; // 64 bytes of shared memory KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...);
Some Additional API Features math functions, thread and block ids, etc
Application Programming Interface • The API is an extension to the C programming language • It consists of: • Language extensions • To target portions of the code for execution on the device • Aruntime library split into: • A common component providing built-in vector types and a subset of the C runtime library in both host and device codes • A host component to control and access one or more devices from the host • A device component providing device-specific functions
Language Extensions:Built-in Variables • dim3 gridDim; • Dimensions of the grid in blocks • Grids are at most 2D! gridDim.z is unused • dim3 blockDim; • Dimensions of the block in threads • dim3 blockIdx; • Block index within the grid • dim3 threadIdx; • Thread index within the block
Common Runtime Component • Provides: • Built-in vector types • A subset of the C runtime library supported in both host and device codes
Built-in Vector Types • [u]char[1..4], [u]short[1..4], [u]int[1..4], [u]long[1..4], float[1..4] • Structures accessed with x, y, z, w fields: uint4 param; int y = param.y; • dim3 • Based on uint3 • Used to specify dimensions
Mathematical Functions • pow, sqrt, cbrt, hypot • exp, exp2, expm1 • log, log2, log10, log1p • sin, cos, tan, asin, acos, atan, atan2 • sinh, cosh, tanh, asinh, acosh, atanh • ceil, floor, trunc, round • Etc. • When executed on the host, a given function uses the C runtime implementation if available • These functions are only supported for scalar types, not vector types
Host Runtime Component • Provides functions to deal with: • Device management (including multi-device systems) • Memory management • Error handling • Initializes the first time a runtime function is called • A host thread can invoke a kernel on only one device • Multiple host threads required to run on multiple devices
Memory Management • Device memory allocation • cudaMalloc(), cudaFree() • Memory copy from host to device, device to host, device to device • cudaMemcpy(), cudaMemcpy2D(), cudaMemcpyToSymbol(), cudaMemcpyFromSymbol() • Memory addressing • cudaGetSymbolAddress() • Used to transfer data to constant memory
Device Mathematical Functions • Some mathematical functions (e.g. sin(x)) have a less accurate, but faster device-only version (e.g. __sin(x)) • __pow • __log, __log2, __log10 • __exp • __sin, __cos, __tan
Device Synchronization Function • void __syncthreads(); • Synchronizes all threads in a block • Once all threads have reached this point, execution resumes normally • Avoid RAW/WAR/WAW hazards when accessing sharedor globalmemory • Allowed in conditional constructs only if the conditional is uniform across the entire thread block
Graphics Interoperability one last API bit...
Overview • Interface to exchange data between OpenGL / D3D and CUDA without reading it back to the host • Buffer objects can be mapped into the CUDA address space and then used as global memory • Textures can be accessed by casting them to buffer objects • Data can be accessed as any other global data in the device code • Useful for • Frame post-processing • Visualization • Physical Simulation • …
OpenGL Interoperability • Mapping GL buffer object to CUDA cudaError_t cudaGLMapBufferObject( unsigned int bufobj, void **Ptr, cudaContext_t ctxt = def) • Unmapping GL buffer object from CUDA cudaError_t cudaGLUnmapBufferObject( unsigned int bufobj, cudaContext_t ctxt = def)
OpenGL Interoperability • Example (from simpleGL in the SDK) float *dptr; cudaGLMapBufferObject( vbo, (void**) &dptr); dim3 grid( 1, 1, 1); dim3 block( num_threads, 1, 1); kernel<<< grid, block>>>(dptr); cudaGLUnmapBufferObject( vbo );
Practical Code Example AKA: breaking the inertia with a simple, illustrative (= useless) example
Matrix Multiplication • Illustrates the basic features of • Global Memory usage • Memory transfer API • Thread allocation • Local, register usage • Thread ID usage • Only example, not efficient! • i.e. Leave shared memory usage for later
A Matrix Data Type • NOT part of CUDA • 2D matrix • single precision float elements • width * height elements • data elements allocated and attached to elements typedef struct { int width; int height; float* elements; } Matrix;
Square Matrix Multiplication N • P = M * N of size WIDTH x WIDTH • Without blocking • One thread handles one element of P WIDTH M P WIDTH WIDTH WIDTH