1 / 43

COMP 5411: ADVANCED COMPUTER GRAPHICS FALL 2013

GPGPU II. COMP 5411: ADVANCED COMPUTER GRAPHICS FALL 2013. Plan. Previously Basic principles Memory model Data structures Optimizations Used OpenGL/DirectX graphics APIs. Next. APIs specific for GPGPU Direct use of processors (no VS, PS, ...)

judson
Download Presentation

COMP 5411: ADVANCED COMPUTER GRAPHICS FALL 2013

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GPGPU II COMP 5411: ADVANCED COMPUTER GRAPHICSFALL 2013

  2. Plan • Previously • Basic principles • Memory model • Data structures • Optimizations Used OpenGL/DirectX graphics APIs

  3. Next • APIs specific for GPGPU • Direct use of processors (no VS, PS, ...) • Generic programming languages (not graphics specific) • Support • Easier to program in general • Easier scatter data • Inter-processor communication (CUDA)

  4. Graphics Programming Model Application Vertex Program Rasterization Fragment Program Display Memory

  5. GPGPU Programming Model GPGPU app Set up data Compute Program Memory

  6. Pixel Shader Programming Model per thread per Shader per Context Input Registers Fragment Program Texture Constants Temp Registers Output Registers FB Memory

  7. GPGPU Constraints • Dealing with graphics API • Addressing modes • Limited texture size/dimension • Shader capabilities • Limited outputs • Instruction sets • Lack of Integer & bit ops (pre-DX10) • Communication limited • Between pixels • Scatter a[i] = p per thread per Shader per Context Input Registers Fragment Program Texture Constants Temp Registers Output Registers FB Memory

  8. GPGPU Difficulties • Writing a GPGPU program is cumbersome • Defining a texture to represent data • Render-to-texture for writing data to textures • Tying shader variables with C++ variables – Defining uniform variables Can focus on algorithm instead of implementation

  9. Overview • Brook for GPU • CUDA Also see: • Stream SDK • From AMD

  10. Brook for GPUs

  11. Brook: general purpose streaming language • stream programming model • enforce data parallel computing • streams • encourage arithmetic intensity • kernels • C with stream extensions • GPU = streaming coprocessor

  12. system outline .br Brook source files brcc source to source compiler brt Brook run-time library

  13. streams • streams • collection of records requiring similar computation • particle positions, voxels, FEM cell, … float3 positions<200>; float3 velocityfield<100,100,100>; • similar to arrays, but… • index operations disallowed: position[i] • encourage data parallelism

  14. kernels • kernels • functions applied to streams • similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } float a<100>; float b<100>; float c<100>; foo(a,b,c); for (i=0; i<100; i++) c[i] = a[i]+b[i];

  15. kernels • kernels • functions applied to streams • similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } • no dependencies between stream elements • encourage high arithmetic intensity

  16. kernels • ray triangle intersection kernel void krnIntersectTriangle(Ray ray<>, Triangle tris[], RayState oldraystate<>, GridTrilist trilist[], out Hit candidatehit<>) { float idx, det, inv_det; float3 edge1, edge2, pvec, tvec, qvec; if(oldraystate.state.y > 0) { idx = trilist[oldraystate.state.w].trinum; edge1 = tris[idx].v1 - tris[idx].v0; edge2 = tris[idx].v2 - tris[idx].v0; pvec = cross(ray.d, edge2); det = dot(edge1, pvec); inv_det = 1.0f/det; tvec = ray.o - tris[idx].v0; candidatehit.data.y = dot( tvec, pvec ) * inv_det; qvec = cross( tvec, edge1 ); candidatehit.data.z = dot( ray.d, qvec ) * inv_det; candidatehit.data.x = dot( edge2, qvec ) * inv_det; candidatehit.data.w = idx; } else { candidatehit.data = float4(0,0,0,-1); } }

  17. reductions • reductions • compute single value from a stream reduce void sum (float a<>, reduce float r<>) r += a; } float a<100>; float r; sum(a,r); r = a[0]; for (int i=1; i<100; i++) r += a[i];

  18. reductions • reductions • associative operations only (a+b)+c = a+(b+c) • sum, multiply, max, min, OR, AND, XOR • matrix multiply

  19. reductions • multi-dimension reductions reduce void sum (float a<>, reduce float r<>) r += a; } float a<20>; float r<5>; sum(a,r); for (int i=0; i<5; i++) r[i] = a[i*4]; for (int j=1; j<4; j++) r[i] += a[i*4 + j];

  20. stream repeat & stride • kernel arguments of different shape • resolved by repeat and stride kernel void foo (float a<>, float b<>, out float result<>); float a<20>; float b<5>; float c<10>; foo(a,b,c); foo(a[0], b[0], c[0]) foo(a[2], b[0], c[1]) foo(a[4], b[1], c[2]) foo(a[6], b[1], c[3]) foo(a[8], b[2], c[4]) foo(a[10], b[2], c[5]) foo(a[12], b[3], c[6]) foo(a[14], b[3], c[7]) foo(a[16], b[4], c[8]) foo(a[18], b[4], c[9])

  21. matrix vector multiply kernel void mul (float a<>, float b<>, out float result<>) { result = a*b; } reduce void sum (float a<>, reduce float result<>) { result += a; } float matrix<20,10>; float vector<1, 10>; float tempmv<20,10>; float result<20, 1>; mul(matrix,vector,tempmv); sum(tempmv,result); M T V = V V

  22. matrix vector multiply kernel void mul (float a<>, float b<>, out float result<>) { result = a*b; } reduce void sum (float a<>, reduce float result<>) { result += a; } float matrix<20,10>; float vector<1, 10>; float tempmv<20,10>; float result<20, 1>; mul(matrix,vector,tempmv); sum(tempmv,result); T R sum

  23. Brook performance 2-3x faster than CPU implementation ATI Radeon 9800 XT NVIDIA GeForce 6800 GPUs still lose against SSE cache friendly code. Super-optimizations • ATLAS • FFTW compared against 3GHz P4: • Intel Math Library • FFTW • Custom cached-blocked segment C code

  24. Brook for GPUs • Simple programming framework • Operates with streams and kernels • GPGPU applications • Hides all graphics intricacies

  25. CUDA

  26. CUDA • “CompUte Driver Architecture” • General purpose programming model • User kicks off batches of threads on the GPU • GPU = dedicated super-threaded co-processor • Driver for loading computation programs into GPU • Standalone Driver - Optimized for computation • Interface designed for compute - graphics free API • Data sharing with OpenGL buffer objects • Guaranteed maximum download & readback speeds • Explicit GPU memory management

  27. Extended C • Declspecs • global, device, shared, local, constant • Keywords • threadIdx, blockIdx • Intrinsics • __syncthreads • Runtime API • Memory, symbol, execution management • Function launch __device__ float filter[N]; __global__ void convolve (float *image) { __shared__ float region[M]; ... region[threadIdx] = image[i]; __syncthreads() ... image[j] = result; } // Allocate GPU memory void *myimage = cudaMalloc(bytes) // 100 blocks, 10 threads per block convolve<<<100, 10>>> (myimage);

  28. Overview • CUDA programming model – basic concepts and data types • CUDA application programming interface

  29. CUDA Programming Model • Same as other languages • Streams and kernels • Kernels run as threads • Differences between GPU and CPU threads • GPU threads are extremely lightweight • Very little creation overhead • GPU needs 1000s of threads for full efficiency • Multi-core CPU needs only a few

  30. Thread Batching: Grids and Blocks Host Device Kernel 1 Kernel 2 Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (0, 2) Thread (0, 0) Thread (1, 2) Thread (1, 0) Thread (1, 1) Thread (2, 1) Thread (2, 2) Thread (2, 0) Thread (3, 1) Thread (3, 2) Thread (3, 0) Thread (4, 1) Thread (4, 0) Thread (4, 2) • A kernel is executed as a grid of thread blocks • All threads share data memory space • A thread block is a batch of threadsthat can cooperate with each other by: • Synchronizing their execution • For hazard-free shared memoryaccesses • Efficiently sharing data througha low latency shared memory • Two threads from two different blocks cannot cooperate Courtesy: NDVIA

  31. Block and Thread IDs Device Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Block (1, 1) Thread (0, 0) Thread (0, 2) Thread (0, 1) Thread (1, 0) Thread (1, 2) Thread (1, 1) Thread (2, 1) Thread (2, 2) Thread (2, 0) Thread (3, 1) Thread (3, 2) Thread (3, 0) Thread (4, 0) Thread (4, 2) Thread (4, 1) • Threads and blocks have IDs • So each thread can decide what data to work on • Block ID: 1D or 2D • Thread ID: 1D, 2D, or 3D • Simplifies memoryaddressing when processingmultidimensional data • Image processing • Solving PDEs on volumes • …

  32. CUDA Device Memory Space Overview (Device) Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Host Global Memory Constant Memory Texture Memory • Each thread can: • R/W per-thread registers • R/W per-thread local memory • R/W per-block shared memory • R/W per-grid global memory • Read only per-grid constant memory • Read only per-grid texture memory • The host can R/W global, constant, and texture memories

  33. Global, Constant, and Texture Memories(Long Latency Accesses) (Device) Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Host Global Memory Constant Memory Texture Memory • Global memory • Main means of communicatingR/W Data between host and device • Contents visible to all threads • Texture and Constant Memories • Constants initialized by host • Contents visible to all threads Courtesy: NDVIA

  34. (Device) Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Host Global Memory Constant Memory Texture Memory CUDA Device Memory Allocation • cudaMalloc() • Allocates object in the device Global Memory • Requires two parameters • Address of a pointer to the allocated object • Size of allocated object • cudaFree() • Frees object from device Global Memory • Pointer to freed object

  35. CUDA Host-Device Data Transfer (Device) Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Host Global Memory Constant Memory Texture Memory • cudaMemcpy() • memory data transfer • Requires four parameters • Pointer to destination • Pointer to source • Number of bytes copied • Type of transfer • Host to Host • Host to Device • Device to Host • Device to Device

  36. CUDA Function Declarations • __global__ defines a kernel function • Must return void • __device__ and __host__ can be used together

  37. CUDA Function Declarations(cont.) • For functions executed on the device: • No recursion • No static variable declarations insidethe function • No variable number of arguments

  38. Calling a Kernel Function – Thread Creation • A kernel function must be called with an executionconfiguration: __global__ void KernelFunc(...); dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64; // 64 bytes of shared memory KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...); • Any call to a kernel function is synchronous • Blocks until completion

  39. CUDA Performance CUDA/G80 Advantage Over Dual Core 197x 47x 20x 10x Rigid Body Physics Solver Matrix Numerics BLAS1: 60+ GB/s BLAS3: 100+ GFLOPS Wave Equation FDTD: 1.2 Gcells/s FFT: 52 GFLOPS (GFLOPS as defined by benchFFT) BiologicalSequence Match SSEARCH: 5.2 Gcells/s Finance Black Scholes: 4.7 GOptions/s

  40. CUDA Highlights:Easy and Lightweight • The API is an extension to the ANSI C programming language Low learning curve (compared to graphics API’s) • The hardware is designed to enable lightweight runtime and driver High performance

  41. Summary • Brook for GPUs – For GPGPU – Very simple (basically C plus stream types) • CUDA (and Stream) • Performant • For GPGPU and graphics • Closer to hardware

  42. Overall Summary • Overview of GPGPU (last class) • Basic principles • Memory model • Data structures • Optimizations • APIs specific for GPGPU (today) • Easier scatter • Inter-processor communication

  43. Acknowledgement • Ian Buck, David Kirk,for some of the slides on their respective languages

More Related