第七课 GPU & GPGPU

第七课 GPU & GPGPU

Overview • Traditional Graphics Pipeline • Programmable Graphics Pipeline • Vertex Shader • Fragment (Pixel) Shader • Brief Intro of Cg • GPGPU (General Purpose GPU)

Rasterization and Interpolation Raster Operations Generation I: 3dfx Voodoo (1996) • One of the first true 3D game cards • Worked by supplementing standard 2D video card. • Did not do vertex transformations: these were done in the CPU • Did do texture mapping, z-buffering. http://accelenation.com/?ac.id.123.2 Vertex Transforms Primitive Assembly Frame Buffer CPU GPU PCI

Rasterization and Interpolation Raster Operations Generation II: GeForce/Radeon 7500 (1998) • Main innovation: shifting the transformation and lighting calculations to the GPU • Allowed multi-texturing: giving bump maps, light maps, and others.. • Faster AGP bus instead of PCI http://accelenation.com/?ac.id.123.5 Vertex Transforms Primitive Assembly Frame Buffer GPU AGP

Rasterization and Interpolation Raster Operations Generation III: GeForce3/Radeon 8500(2001) • For the first time, allowed limited amount of programmability in the vertex pipeline • Also allowed volume texturing and multi-sampling (for antialiasing) http://accelenation.com/?ac.id.123.7 Vertex Transforms Primitive Assembly Frame Buffer GPU AGP Small vertex shaders

Rasterization and Interpolation Raster Operations Generation IV: Radeon 9700/GeForce FX (2002) • This generation is the first generation of fully-programmable graphics cards • Different versions have different resource limits on fragment/vertex programs http://accelenation.com/?ac.id.123.8 Vertex Transforms Primitive Assembly Frame Buffer AGP Programmable Vertex shader Programmable Fragment Processor

Traditional Graphics PipeLine Graphics State CPU GPU Xformed, Lit Vertices (2D) Screenspace triangles (2D) Fragments (pre-pixels) Final Pixels (Color, Depth) • A simplified graphics pipeline • Note that pipe widths vary • Many caches, FIFOs, and so on not shown Application Transform& Light AssemblePrimitives Rasterize Shade Vertices (3D) VideoMemory(Textures) Render-to-texture

Pipeline : Transform • Transform & light • Transform from “world space” to “image space” • Compute per-vertex lighting

ModelView Transformation • Vertices mapped from object space to world space • M = model transformation (scene) • V = view transformation (camera) Each matrix transform is applied to each vertex in the input stream. Think of this as a kernel operator. X Y Z 1 X’ Y’ Z’ W’ M * V *

Lighting Lighting information is combined with normals and other parameters at each vertex in order to create new colors. Color(v) = emissive + ambient + diffuse + specular Each term in the right hand side is a function of the vertex color, position, normal and material properties.

Pipeline :Rasterizer • Rasterizer • Convert geometric rep. (vertex) to image rep. (fragment) • Fragment = image fragment • Pixel + associated data: color, depth, stencil, etc. • Interpolate per-vertex quantities across pixels

Pipeline: Shade • Fragment processors (multiple in parallel) • Compute a color for each pixel • Optionally read colors from textures (images)

Programmable vertex processor! The ModernGraphics Pipeline Graphics State CPU GPU VertexProcessor FragmentProcessor Xformed, Lit Vertices (2D) Screenspace triangles (2D) Fragments (pre-pixels) Final Pixels (Color, Depth) Application Transform& Light AssemblePrimitives Rasterize Shade Vertices (3D) VideoMemory(Textures) Render-to-texture • Programmable pixel processor!

Programmable primitive assembly! The CurrentGraphics Pipeline Graphics State CPU GPU GeometryProcessor Xformed, Lit Vertices (2D) Screenspace triangles (2D) Fragments (pre-pixels) Final Pixels (Color, Depth) Application VertexProcessor AssemblePrimitives Rasterize FragmentProcessor Vertices (3D) VideoMemory(Textures) Render-to-texture • More flexible memory access!

NVIDIA GeForce 6800 3D Pipeline Vertex Triangle Setup Z-Cull Shader Instruction Dispatch Fragment L2 Tex Fragment Crossbar Composite Memory Partition Memory Partition Memory Partition Memory Partition

Precision • 32-bit IEEE floating-point throughout pipeline • Framebuffer • Textures • Fragment processor • Vertex processor • Interpolants

Vertex Processor • Fully programmable (SIMD / MIMD) • Processes 4-vectors (RGBA / XYZW) • Capable of scatter but not gather • Can change the location of current vertex • Cannot read info from other vertices • Can only read a small constant memory • Latest GPUs: Vertex Texture Fetch • Random access memory for vertices • Gather (But not from the vertex stream itself)

Vertex processor capabilities • 4-vector FP32 operations • Condition codes + true data-dependent control flow • Conditional branches, subroutine calls, jump table • Useful for avoiding extra work, e.g.: • Don’t do animation, skinning if vertex will be clipped • Do displacement mapping only for vertices near silhouette • Transcendental arithmetic instructions (e.g. COS) • User clip-plane support • Texture reads (up to 4 textures, unlimited lookups)

Vertex processor limitations • No arbitrary memory write • No “vertex kill” • Can put vertex off-screen • Can make degenerate primitives • Only 32-bit texture formats supported

Fragment Processor • Fully programmable (SIMD) • Processes 4-component vectors (RGBA / XYZW) • Random access memory read (textures) • Capable of gather but not scatter • RAM read (texture fetch), but no RAM write • Output address fixed to a specific pixel • Typically more useful than vertex processor • More fragment pipelines than vertex pipelines • Direct output (fragment processor is at end of pipeline)

Fragment processor: texture mapping • Texture reads are just another instruction • Allows computed texture coordinates, nested to arbitrary depth • This is a big difference w/ NVIDIA and ATI right now • Allows multiple uses of a single texture unit • Optional LOD control – can specify filter extent • Think of it as a memory-read instruction, with optional user-controlled filtering

Fragment processor capabilities • Dynamic branching • Conditional fragment-kill instruction • Read access to window-space position • Read/write access to fragment Z (but not stencil) • Multiple render targets • Built-in derivative instructions • Partial derivatives w.r.t. screen-space x or y • Useful for anti-aliasing shaders • FP32, FP16, and fixed-point data

Fragment processor limitations • Dynamic branching less efficient than vertex proc. • Especially for non-coherent branching (<~ 30x30 pixels) • Can do a lot with condition codes • No indexed reads from registers • I.e., no indexed arrays • Must use texture reads instead • No arbitrary memory write

GPU vendor differences • Note: this slide will be dated almost instantly • NVIDIA: as described in previous slides • ATI hardware today (1900XT current high-end part): • No vertex texture fetch (but good render-to-vertex-array) • Far fewer levels of computed texture coordinates • Better at fine-grained (less coherent) dynamic branching • ATI Xenos (Xbox 360 chip): • Unified shader model: vertex proc == pixel proc • Scatter support: shaders can write arbitrary memory loc

Cg : C for Graphics • Cg is a high-level GPU programming language • Designed by NVIDIA and Microsoft • Competes with the (quite similar) GL Shading Language, a.k.a GLslang

Programming in assembly is painful Assembly Cg …FRC R2.y, C11.w; ADD R3.x, C11.w, -R2.y; MOV H4.y, R2.y; ADD H4.x, -H4.y, C4.w; MUL R3.xy, R3.xyww, C11.xyww; ADD R3.xy, R3.xyww, C11.z; TEX H5, R3, TEX2, 2D; ADD R3.x, R3.x, C11.x; TEX H6, R3, TEX2, 2D;… … L2weight = timeval – floor(timeval); L1weight = 1.0 – L2weight; ocoord1 = floor(timeval)/64.0 + 1.0/128.0; ocoord2 = ocoord1 + 1.0/64.0; L1offset = f2tex2D(tex2, float2(ocoord1, 1.0/128.0)); L2offset = f2tex2D(tex2, float2(ocoord2, 1.0/128.0)); … • Easier to read and modify • Cross-platform • Combine pieces • etc.

Some points in the design space • CPU languages • C – close to the hardware; general purpose • C++, Java, lisp – require memory management • RenderMan – specialized for shading • Real-time shading languages • Stanford shading language • Creative Labs shading language

Design strategy • Start with C (and a bit of C++) • Minimizes number of decisions • Gives you known mistakes instead of unknown ones • Allow subsetting of the language • Add features desired for GPU’s • To support GPU programming model • To enable high performance • Tweak to make it fit together well

How are GPUs different from CPUs? • GPU is a stream processor • Multiple programmable processing units • Connected by data flows VertexProcessor FragmentProcessor FramebufferOperations Assembly &Rasterization Application Framebuffer Textures

How are GPUs different from CPUs? • Greater variation in basic capabilities • Most processors don’t yet support branching • Vertex processors don’t support texture mapping • Some processors support additional data types • Compiler can’t hide these differences • Least-common-denominator is too restrictive • Cg exposes differences via language profiles(list of capabilities and data types) • Over time, profiles will converge

How are GPUs different from CPUs? • Optimized for 4-vector arithmetic • Useful for graphics – colors, vectors, texcoords • Easy way to get high performance/cost • C philosophy says: expose these HW data types • Cg has vector data types and operationse.g. float2, float3, float4 • Makes it obvious how to get high performance • Cg also has matrix data typese.g. float3x3, float3x4, float4x4

How are GPUs different from CPUs? • No support for pointers • Arrays are first-class data types in Cg • No integer data type • Cg adds “bool” data type for boolean operations • This change isn’t obvious except when declaring vars

Cg basic data types • All profiles: • float • bool • All profiles with texture lookups: • sampler1D, sampler2D, sampler3D,samplerCUBE • NV_fragment_program profile: • half -- half-precision float • fixed -- fixed point [-2,2)

Cg Example • The following fragment program implements a (very) simple toon shader • Flat 3-tone shading • Highlight • Base color • Shadow • Black silhouettes

Cg Example – part 1 // In: // eye_space position = TEX7 // eye space T = (TEX4.x, TEX5.x, TEX6.x) denormalized // eye space B = (TEX4.y, TEX5.y, TEX6.y) denormalized // eye space N = (TEX4.z, TEX5.z, TEX6.z) denormalized fragout frag program main(vf30 In) { float m = 30; // power float3 hiCol = float3( 1.0, 0.1, 0.1 ); // lit color float3 lowCol = float3( 0.3, 0.0, 0.0 ); // dark color float3 specCol = float3( 1.0, 1.0, 1.0 ); // specular color // Get eye-space eye vector. float3 e = normalize( -In.TEX7.xyz ); // Get eye-space normal vector. float3 n = normalize(float3(In.TEX4.z, In.TEX5.z, In.TEX6.z));

Cg Example – part 2 float edgeMask = (dot(e, n) > 0.4) ? 1 : 0; float3 lpos = float3(3,3,3); float3 l = normalize(lpos - In.TEX7.xyz); float3 h = normalize(l + e); float specMask = (pow(dot(h, n), m) > 0.5) ? 1 : 0; float hiMask = (dot(l, n) > 0.4) ? 1 : 0; float3 ocol1 = edgeMask * (lerp(lowCol, hiCol, hiMask) + (specMask *specCol)); fragout O; O.COL = float4(ocol1.x, ocol1.y, ocol1.z, 1); return O; }

GPGPU • The graphics processing unit (GPU) on commodity video cards has evolved into an extremely flexible and powerful processor • Programmability • Precision • Power • GPGPU: an emerging field seeking to harness GPUs for general-purpose computation

Motivation: Computational Power • GPUs are fast… • 3.0 GHz dual-core Pentium4: 24.6 GFLOPS • NVIDIA GeForceFX 7800: 165 GFLOPs • 1066 MHz FSB Pentium Extreme Edition : 8.5 GB/s • ATI Radeon X850 XT Platinum Edition: 37.8 GB/s • GPUs are getting faster, faster • CPUs: 1.4× annual growth • GPUs: 1.7×(pixels) to 2.3× (vertices) annual growth

Motivation: Computational Power

Motivation: Flexible and Precise • Modern GPUs are deeply programmable • Programmable pixel, vertex, video engines • Solidifying high-level language support • Modern GPUs support high precision • 32 bit floating point throughout the pipeline • High enough for many (not all) applications

Motivation: The Potential of GPGPU • The power and flexibility of GPUs makes them an attractive platform for general-purpose computation • Example applications range from in-game physics simulation to conventional computational science • Goal: make the inexpensive power of the GPU available to developers as a sort of computational coprocessor

Problems: Difficult To Use • GPUs designed for & driven by video games • Programming model unusual • Programming idioms tied to computer graphics • Programming environment tightly constrained • Underlying architectures are: • Inherently parallel • Rapidly evolving (even in basic feature set!) • Largely secret • Can’t simply “port” CPU code!

GPGPU • Why GPU for General Purpose Computing? • How Programming?

第七课 GPU & GPGPU