120 likes | 274 Views
GPU. Precision, Power, Programmability CPU: x60/decade, 6 GFLOPS, 6GB/sec GPU: x1000/decade, 20 GFLOPs, 25GB/sec Arithmetic heavy (read OR write): faster hardware Parallelization Multi-billion $ entertainment market drives innovation 32-bit Floating point
E N D
GPU • Precision, Power, Programmability • CPU: x60/decade, 6 GFLOPS, 6GB/sec • GPU: x1000/decade, 20 GFLOPs, 25GB/sec • Arithmetic heavy (read OR write): faster hardware • Parallelization • Multi-billion $ entertainment market drives innovation • 32-bit Floating point • Programmable (graphics, physics, general purpose data-flow) • Can’t simply “port” CPU code to GPU David Luebke et al. GPGPU, SIGGRAPH 2004
History of the 3D graphics industry • 60s: • Line drawings, hidden lines, parametric surfaces (B-splines…) • Automated drafting & machining for car, airplane, and ships manufacturers • 70’s: • Mainframes, Vector tubes (HP…) • Software: Solids, (CSG), Ray Tracing, Z-buffer for hidden lines • 80s: • Graphics workstations ($50K-$1M): Frame buffers, rasterizers , GL, Phigs • VR: CAVEs and head-mounted displays • CAD/CAM & GIS: CATIA, SDRC, PTC • Sun, HP, IBM, SGI, E&S, DEC • 90s: • PCs ($2K): Graphics boards, OpenGL, Java3D • CAD+Videogames+Animations: AutoCAD, SolidWorks…, Alias-Wavefront • Intel, many board vendors • 00s: • Laptops, PDAs, Cell Phones: Parallel graphic chips • Everything will be graphics, 3D, animated, interactive • Nvidia, Sony, Nokia
History of GPU • Pre-GPU Graphics Acceleration • SGI, Evans & Sutherland. Introduced concepts like vertex transformation and texture mapping. Very expensive! • First-Generation GPU (-1998) • Nvidia TNT2, ATI Rage, Voodoo3. Vertex transformation on CPU, limited set of math operations. • Second-Generation GPU (1999-2000) • GeForce 256, Geforce2, Radeon 7500, Savage3D. Transformation & Lighting. More configurable, still not programmable. • Third-Generation GPU (2001) • Geforce3, Geforce4 Ti, Xbox, Radeon 8500. Vertex Programmability, pixel-level configurability. • Fourth-Generation GPU (2002-) • Geforce FX series, Radeon 9700 and on. Vertex-level and pixel-level programmability.
Architecture Application Vertex Shader transformed vertices, normals, colors Geometry Shader Rasterizer fragments (surfels per pixel) texture Fragment Shader pixel color, depth, stencil Compositor Display
Buffers • Color: 8-bit index to color table, float/16-bit true color… • Depth: 24-bit or float (0 at back plane) • Back and front: display front, update back, swap • Stereo: Shutter glasses, HMD. Alternate frames • Auxiliary: off-screen working space. Helps reduce passes. • Stencil: 8 bits (left-over of depth buffer). <,>… mask, ++ • Accumulation: sum, scale (supersampling, blur) • P-buffer, superbuffers: Render to texture
Fragment operations • Depth tests: <, <=, >, <=, ==, Zdepth-interval • Stencil test: mask?, counter, parity. • Alpha tests: compare to reference alpha • Alpha blending: + max, min, replace, blend
Data Parallelism in GPUs • Data flow: vertices > fragments > pixels • Parallelism at each stage • No shared or static data (except textures) • ALU-heavy (multiple ALUs per stage in pipe) • Fight memory latency with more computation
GPGPU • Stream: collection of records (pixels, vertices…) • Stored in Textures (a computational grid) • Kernel: Function applied to each element in stream • Transform, evolve (no dependency between records) • Matrix algebra • Image/volume processing • Physical simulation • Global illumination • Ray tracing • Photon mapping • Radiosity
Computational Resources • Programmable parallel processors • Vertex & Fragment pipelines • Rasterizer • Mostly useful for interpolating addresses (texture coordinates) and per-vertex constants • Texture unit • Read-only memory interface • Render to texture (or Copy to texture) • Write-only memory interface
Vertex Processor • Fully programmable (SIMD / MIMD) • Processes 4-vectors (RGBA / XYZW) • Capable of scatter but not gather (A[i,j]=x;) • Can change the location of current vertex • Cannot read info from other vertices • Can only read a small constant memory • Vertex Texture Fetch • Random access memory for vertices • Arguably still not gather
Fragment Processor • May be invoked at each pixel by drawing a full screen quad • Fully programmable (SIMD) • Processes 4-vectors (RGBA / XYZW) • Random access memory read (textures) • Capable of gather(x=A[i+1,j];) and some scatter • RAM read (texture), but no RAM write • Output address fixed to a specific pixel • But can change that address • Typically more useful than vertex processor • More fragment pipelines than vertex pipelines • Gather • Direct output (fragment processor is at end of pipeline)
Branching • Not supported or expensive • Avoid, replace by math • Depth test • Stencil test • Occlusion query (conditional execution) • Pre-computation (region of interest, use to set stencil mask)