GPU

GPU • Precision, Power, Programmability • CPU: x60/decade, 6 GFLOPS, 6GB/sec • GPU: x1000/decade, 20 GFLOPs, 25GB/sec • Arithmetic heavy (read OR write): faster hardware • Parallelization • Multi-billion $ entertainment market drives innovation • 32-bit Floating point • Programmable (graphics, physics, general purpose data-flow) • Can’t simply “port” CPU code to GPU David Luebke et al. GPGPU, SIGGRAPH 2004

History of the 3D graphics industry • 60s: • Line drawings, hidden lines, parametric surfaces (B-splines…) • Automated drafting & machining for car, airplane, and ships manufacturers • 70’s: • Mainframes, Vector tubes (HP…) • Software: Solids, (CSG), Ray Tracing, Z-buffer for hidden lines • 80s: • Graphics workstations ($50K-$1M): Frame buffers, rasterizers , GL, Phigs • VR: CAVEs and head-mounted displays • CAD/CAM & GIS: CATIA, SDRC, PTC • Sun, HP, IBM, SGI, E&S, DEC • 90s: • PCs ($2K): Graphics boards, OpenGL, Java3D • CAD+Videogames+Animations: AutoCAD, SolidWorks…, Alias-Wavefront • Intel, many board vendors • 00s: • Laptops, PDAs, Cell Phones: Parallel graphic chips • Everything will be graphics, 3D, animated, interactive • Nvidia, Sony, Nokia

History of GPU • Pre-GPU Graphics Acceleration • SGI, Evans & Sutherland. Introduced concepts like vertex transformation and texture mapping. Very expensive! • First-Generation GPU (-1998) • Nvidia TNT2, ATI Rage, Voodoo3. Vertex transformation on CPU, limited set of math operations. • Second-Generation GPU (1999-2000) • GeForce 256, Geforce2, Radeon 7500, Savage3D. Transformation & Lighting. More configurable, still not programmable. • Third-Generation GPU (2001) • Geforce3, Geforce4 Ti, Xbox, Radeon 8500. Vertex Programmability, pixel-level configurability. • Fourth-Generation GPU (2002-) • Geforce FX series, Radeon 9700 and on. Vertex-level and pixel-level programmability.

Architecture Application Vertex Shader transformed vertices, normals, colors Geometry Shader Rasterizer fragments (surfels per pixel) texture Fragment Shader pixel color, depth, stencil Compositor Display

Buffers • Color: 8-bit index to color table, float/16-bit true color… • Depth: 24-bit or float (0 at back plane) • Back and front: display front, update back, swap • Stereo: Shutter glasses, HMD. Alternate frames • Auxiliary: off-screen working space. Helps reduce passes. • Stencil: 8 bits (left-over of depth buffer). <,>… mask, ++ • Accumulation: sum, scale (supersampling, blur) • P-buffer, superbuffers: Render to texture

Fragment operations • Depth tests: <, <=, >, <=, ==, Zdepth-interval • Stencil test: mask?, counter, parity. • Alpha tests: compare to reference alpha • Alpha blending: + max, min, replace, blend

Data Parallelism in GPUs • Data flow: vertices > fragments > pixels • Parallelism at each stage • No shared or static data (except textures) • ALU-heavy (multiple ALUs per stage in pipe) • Fight memory latency with more computation

GPGPU • Stream: collection of records (pixels, vertices…) • Stored in Textures (a computational grid) • Kernel: Function applied to each element in stream • Transform, evolve (no dependency between records) • Matrix algebra • Image/volume processing • Physical simulation • Global illumination • Ray tracing • Photon mapping • Radiosity

Computational Resources • Programmable parallel processors • Vertex & Fragment pipelines • Rasterizer • Mostly useful for interpolating addresses (texture coordinates) and per-vertex constants • Texture unit • Read-only memory interface • Render to texture (or Copy to texture) • Write-only memory interface

Vertex Processor • Fully programmable (SIMD / MIMD) • Processes 4-vectors (RGBA / XYZW) • Capable of scatter but not gather (A[i,j]=x;) • Can change the location of current vertex • Cannot read info from other vertices • Can only read a small constant memory • Vertex Texture Fetch • Random access memory for vertices • Arguably still not gather

Fragment Processor • May be invoked at each pixel by drawing a full screen quad • Fully programmable (SIMD) • Processes 4-vectors (RGBA / XYZW) • Random access memory read (textures) • Capable of gather(x=A[i+1,j];) and some scatter • RAM read (texture), but no RAM write • Output address fixed to a specific pixel • But can change that address • Typically more useful than vertex processor • More fragment pipelines than vertex pipelines • Gather • Direct output (fragment processor is at end of pipeline)

Branching • Not supported or expensive • Avoid, replace by math • Depth test • Stencil test • Occlusion query (conditional execution) • Pre-computation (region of interest, use to set stencil mask)

GPU

GPU

Presentation Transcript

GPU

GPU Tutorial

GPU

GPU DAS

Why GPU?

GPU Computing

GPU Architecture

GPU Programming

GPU Brainstorming

GPU

GPU Programming

GPU Libraries

GPU Computing

XMT-GPU

GPU

GPU Manufacturers -JETALL GPU

GPU MANUFACTURERS WITH JETALL GPU

GPU