COMP 5411: ADVANCED COMPUTER GRAPHICS FALL 2013

GPGPU II COMP 5411: ADVANCED COMPUTER GRAPHICSFALL 2013

Plan • Previously • Basic principles • Memory model • Data structures • Optimizations Used OpenGL/DirectX graphics APIs

Next • APIs specific for GPGPU • Direct use of processors (no VS, PS, ...) • Generic programming languages (not graphics specific) • Support • Easier to program in general • Easier scatter data • Inter-processor communication (CUDA)

Graphics Programming Model Application Vertex Program Rasterization Fragment Program Display Memory

GPGPU Programming Model GPGPU app Set up data Compute Program Memory

Pixel Shader Programming Model per thread per Shader per Context Input Registers Fragment Program Texture Constants Temp Registers Output Registers FB Memory

GPGPU Constraints • Dealing with graphics API • Addressing modes • Limited texture size/dimension • Shader capabilities • Limited outputs • Instruction sets • Lack of Integer & bit ops (pre-DX10) • Communication limited • Between pixels • Scatter a[i] = p per thread per Shader per Context Input Registers Fragment Program Texture Constants Temp Registers Output Registers FB Memory

GPGPU Difficulties • Writing a GPGPU program is cumbersome • Defining a texture to represent data • Render-to-texture for writing data to textures • Tying shader variables with C++ variables – Defining uniform variables Can focus on algorithm instead of implementation

Overview • Brook for GPU • CUDA Also see: • Stream SDK • From AMD

Brook for GPUs

Brook: general purpose streaming language • stream programming model • enforce data parallel computing • streams • encourage arithmetic intensity • kernels • C with stream extensions • GPU = streaming coprocessor

system outline .br Brook source files brcc source to source compiler brt Brook run-time library

streams • streams • collection of records requiring similar computation • particle positions, voxels, FEM cell, … float3 positions<200>; float3 velocityfield<100,100,100>; • similar to arrays, but… • index operations disallowed: position[i] • encourage data parallelism

kernels • kernels • functions applied to streams • similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } float a<100>; float b<100>; float c<100>; foo(a,b,c); for (i=0; i<100; i++) c[i] = a[i]+b[i];

kernels • kernels • functions applied to streams • similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } • no dependencies between stream elements • encourage high arithmetic intensity

kernels • ray triangle intersection kernel void krnIntersectTriangle(Ray ray<>, Triangle tris[], RayState oldraystate<>, GridTrilist trilist[], out Hit candidatehit<>) { float idx, det, inv_det; float3 edge1, edge2, pvec, tvec, qvec; if(oldraystate.state.y > 0) { idx = trilist[oldraystate.state.w].trinum; edge1 = tris[idx].v1 - tris[idx].v0; edge2 = tris[idx].v2 - tris[idx].v0; pvec = cross(ray.d, edge2); det = dot(edge1, pvec); inv_det = 1.0f/det; tvec = ray.o - tris[idx].v0; candidatehit.data.y = dot( tvec, pvec ) * inv_det; qvec = cross( tvec, edge1 ); candidatehit.data.z = dot( ray.d, qvec ) * inv_det; candidatehit.data.x = dot( edge2, qvec ) * inv_det; candidatehit.data.w = idx; } else { candidatehit.data = float4(0,0,0,-1); } }

reductions • reductions • compute single value from a stream reduce void sum (float a<>, reduce float r<>) r += a; } float a<100>; float r; sum(a,r); r = a[0]; for (int i=1; i<100; i++) r += a[i];

reductions • reductions • associative operations only (a+b)+c = a+(b+c) • sum, multiply, max, min, OR, AND, XOR • matrix multiply

reductions • multi-dimension reductions reduce void sum (float a<>, reduce float r<>) r += a; } float a<20>; float r<5>; sum(a,r); for (int i=0; i<5; i++) r[i] = a[i*4]; for (int j=1; j<4; j++) r[i] += a[i*4 + j];

stream repeat & stride • kernel arguments of different shape • resolved by repeat and stride kernel void foo (float a<>, float b<>, out float result<>); float a<20>; float b<5>; float c<10>; foo(a,b,c); foo(a[0], b[0], c[0]) foo(a[2], b[0], c[1]) foo(a[4], b[1], c[2]) foo(a[6], b[1], c[3]) foo(a[8], b[2], c[4]) foo(a[10], b[2], c[5]) foo(a[12], b[3], c[6]) foo(a[14], b[3], c[7]) foo(a[16], b[4], c[8]) foo(a[18], b[4], c[9])

matrix vector multiply kernel void mul (float a<>, float b<>, out float result<>) { result = a*b; } reduce void sum (float a<>, reduce float result<>) { result += a; } float matrix<20,10>; float vector<1, 10>; float tempmv<20,10>; float result<20, 1>; mul(matrix,vector,tempmv); sum(tempmv,result); M T V = V V

matrix vector multiply kernel void mul (float a<>, float b<>, out float result<>) { result = a*b; } reduce void sum (float a<>, reduce float result<>) { result += a; } float matrix<20,10>; float vector<1, 10>; float tempmv<20,10>; float result<20, 1>; mul(matrix,vector,tempmv); sum(tempmv,result); T R sum

Brook performance 2-3x faster than CPU implementation ATI Radeon 9800 XT NVIDIA GeForce 6800 GPUs still lose against SSE cache friendly code. Super-optimizations • ATLAS • FFTW compared against 3GHz P4: • Intel Math Library • FFTW • Custom cached-blocked segment C code

Brook for GPUs • Simple programming framework • Operates with streams and kernels • GPGPU applications • Hides all graphics intricacies

CUDA

CUDA • “CompUte Driver Architecture” • General purpose programming model • User kicks off batches of threads on the GPU • GPU = dedicated super-threaded co-processor • Driver for loading computation programs into GPU • Standalone Driver - Optimized for computation • Interface designed for compute - graphics free API • Data sharing with OpenGL buffer objects • Guaranteed maximum download & readback speeds • Explicit GPU memory management

Extended C • Declspecs • global, device, shared, local, constant • Keywords • threadIdx, blockIdx • Intrinsics • __syncthreads • Runtime API • Memory, symbol, execution management • Function launch __device__ float filter[N]; __global__ void convolve (float *image) { __shared__ float region[M]; ... region[threadIdx] = image[i]; __syncthreads() ... image[j] = result; } // Allocate GPU memory void *myimage = cudaMalloc(bytes) // 100 blocks, 10 threads per block convolve<<<100, 10>>> (myimage);

Overview • CUDA programming model – basic concepts and data types • CUDA application programming interface

CUDA Programming Model • Same as other languages • Streams and kernels • Kernels run as threads • Differences between GPU and CPU threads • GPU threads are extremely lightweight • Very little creation overhead • GPU needs 1000s of threads for full efficiency • Multi-core CPU needs only a few

Thread Batching: Grids and Blocks Host Device Kernel 1 Kernel 2 Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (0, 2) Thread (0, 0) Thread (1, 2) Thread (1, 0) Thread (1, 1) Thread (2, 1) Thread (2, 2) Thread (2, 0) Thread (3, 1) Thread (3, 2) Thread (3, 0) Thread (4, 1) Thread (4, 0) Thread (4, 2) • A kernel is executed as a grid of thread blocks • All threads share data memory space • A thread block is a batch of threadsthat can cooperate with each other by: • Synchronizing their execution • For hazard-free shared memoryaccesses • Efficiently sharing data througha low latency shared memory • Two threads from two different blocks cannot cooperate Courtesy: NDVIA

Block and Thread IDs Device Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Block (1, 1) Thread (0, 0) Thread (0, 2) Thread (0, 1) Thread (1, 0) Thread (1, 2) Thread (1, 1) Thread (2, 1) Thread (2, 2) Thread (2, 0) Thread (3, 1) Thread (3, 2) Thread (3, 0) Thread (4, 0) Thread (4, 2) Thread (4, 1) • Threads and blocks have IDs • So each thread can decide what data to work on • Block ID: 1D or 2D • Thread ID: 1D, 2D, or 3D • Simplifies memoryaddressing when processingmultidimensional data • Image processing • Solving PDEs on volumes • …

CUDA Device Memory Space Overview (Device) Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Host Global Memory Constant Memory Texture Memory • Each thread can: • R/W per-thread registers • R/W per-thread local memory • R/W per-block shared memory • R/W per-grid global memory • Read only per-grid constant memory • Read only per-grid texture memory • The host can R/W global, constant, and texture memories

Global, Constant, and Texture Memories(Long Latency Accesses) (Device) Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Host Global Memory Constant Memory Texture Memory • Global memory • Main means of communicatingR/W Data between host and device • Contents visible to all threads • Texture and Constant Memories • Constants initialized by host • Contents visible to all threads Courtesy: NDVIA

(Device) Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Host Global Memory Constant Memory Texture Memory CUDA Device Memory Allocation • cudaMalloc() • Allocates object in the device Global Memory • Requires two parameters • Address of a pointer to the allocated object • Size of allocated object • cudaFree() • Frees object from device Global Memory • Pointer to freed object

CUDA Host-Device Data Transfer (Device) Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Host Global Memory Constant Memory Texture Memory • cudaMemcpy() • memory data transfer • Requires four parameters • Pointer to destination • Pointer to source • Number of bytes copied • Type of transfer • Host to Host • Host to Device • Device to Host • Device to Device

CUDA Function Declarations • __global__ defines a kernel function • Must return void • __device__ and __host__ can be used together

CUDA Function Declarations(cont.) • For functions executed on the device: • No recursion • No static variable declarations insidethe function • No variable number of arguments

Calling a Kernel Function – Thread Creation • A kernel function must be called with an executionconfiguration: __global__ void KernelFunc(...); dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64; // 64 bytes of shared memory KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...); • Any call to a kernel function is synchronous • Blocks until completion

CUDA Performance CUDA/G80 Advantage Over Dual Core 197x 47x 20x 10x Rigid Body Physics Solver Matrix Numerics BLAS1: 60+ GB/s BLAS3: 100+ GFLOPS Wave Equation FDTD: 1.2 Gcells/s FFT: 52 GFLOPS (GFLOPS as defined by benchFFT) BiologicalSequence Match SSEARCH: 5.2 Gcells/s Finance Black Scholes: 4.7 GOptions/s

CUDA Highlights:Easy and Lightweight • The API is an extension to the ANSI C programming language Low learning curve (compared to graphics API’s) • The hardware is designed to enable lightweight runtime and driver High performance

Summary • Brook for GPUs – For GPGPU – Very simple (basically C plus stream types) • CUDA (and Stream) • Performant • For GPGPU and graphics • Closer to hardware

Overall Summary • Overview of GPGPU (last class) • Basic principles • Memory model • Data structures • Optimizations • APIs specific for GPGPU (today) • Easier scatter • Inter-processor communication

Acknowledgement • Ian Buck, David Kirk,for some of the slides on their respective languages

COMP 5411: ADVANCED COMPUTER GRAPHICS FALL 2013