440 likes | 1.3k Views
GPU Architecture & Cg Mark Colbert PhD Candidate UCF Graphics Group / MCL colbert@cs.ucf.edu © 2006 University of Central Florida welcome Assumptions Experienced in CC++ Basic OpenGLDirectX Knowledge Some Graphics Knowledge Familiarity with Geometric Transformations Linear Algebra
E N D
GPU Architecture & Cg Mark ColbertPhD CandidateUCF Graphics Group / MCL colbert@cs.ucf.edu © 2006 University of Central Florida
welcome • Assumptions • Experienced in C\C++ • Basic OpenGL\DirectX Knowledge • Some Graphics Knowledge • Familiarity with Geometric Transformations • Linear Algebra • Purpose • Introduction to the Programmable GPU • Extremely Fast Paced • For the Geeky at Heart
overview • GPU Architecture • GPU Pipeline • Introduction to Cg • Implications for GPGPU
GPU • Graphics Processing Unit • Parallelized SIMD Architecture • Denoted as Pipes • 24 fragment pipes on nVidia 7800 • Each Pipe Handles 4 Vector Operations
rules of the game • Not a Generalized Vector Processor • Cannot read and write to same areas of memory • Limited output capability • Currently, very expensive to output to locations arbitrary locations in memory
notation • Vertex • A data structure for a point in a mesh, containing position, normal, texture coordinates and more… • Fragment • A pixel, possibly sub-pixel, of a rasterized image • Shaders • Small programs ran in the GPU at specific stages of the GPU pipeline
memory constructs • Buffered Objects • Uniform Registers/State Table • Interpolated Registers • Temporary Registers • Textures
memory constructs • Buffered Objects • CPU Generated Streams of Data • Limited Modifiability • Example • Vertex Data of a Mesh
memory constructs • Uniform Registers/State Table • Constant Data through the Pipeline • Only Necessarily Constant for 1 Polygon • 32 general purpose registers • State Table Specific Registers • Projection/Model View Matrices • Lights • … and more
memory constructs • Interpolated Registers • Per Vertex Data of a Polygon • Stores Information Interpolated Across Polygon • 10 General Purpose Interpolated Registers
memory constructs • Temporary Registers • Standard Notion of Registers • Temporary Registers for In Shader Calculations
memory constructs • Textures • Closest to Random Access Memory • Expensive to Access • Multiple Dependent AccessesExtremely Expensive
GPU pipeline Program/ API Driver CPU Bus GPU GPU Front End Vertex Processing Primitive Assembly Rasterization & Interpolation Fragment Processing Raster Operations Framebuffer
Program/ API GPU pipeline • Program • Your Program • API • Either OpenGL or DirectX Interface
Driver GPU pipeline • Driver • Black-box • Implementations are Company Secrets • Largest Bottleneck in many GPU programs
GPU Front End GPU pipeline • GPU Front End • Receives commands & data from driver • PCI Express helps at this stage
Vertex Processing GPU pipeline • Vertex Processing • Normally performs transformations • Programmable data for rasterization POSITION vertex POSITION, NORMAL, BINORMAL*, TANGENT*, TEXCOORD[0-7], COLOR[0-1], PSIZE PSIZE Vertex Processor FOG TEXCOORD[0-7] COLOR[0-1] shader data for interpolation textures
Primitive Assembly GPU pipeline • Primitive Assembly • Compiles Vertices into Points, Lines and/or Polygons • Link elements and set rasterizer
Rasterization & Interpolation GPU pipeline • Rasterization • For each fragment determine respective area of triangle (Barycentric Coordinates) or other primitive • Interpolation Primitive Assembler Primitive Type data for rasterization POSITION Rasterizer rasterized data PSIZE DEPTH Barycentric Coordinates FOG TEXCOORD[0-7] COLOR[0-1] Interpolator TEXCOORD[0-7] COLOR[0-1] interpolated data data for interpolation
Fragment Processing GPU pipeline • Fragment Processing • Programmable rasterized data Fragment Processor data for raster ops DEPTH COLOR[0-3] TEXCOORD[0-7] COLOR[0-1] DEPTH shader interpolated data textures
Raster Operations GPU pipeline • Depth Checking • Check framebuffer to see if lesser depth already exists (Z-Buffer) • Limited Programmability • Blending • Use alpha channel to combine colors already in the framebuffer • Limited Programmability
example Program/ API Code Snippet …. glBegin(GL_TRIANGLES); glTexCoord2f(1,0); glVertex3f(0,1,0); glTexCoord2f(0,1); glVertex3f(-1,-1,0); glTexCoord2f(0,0); glVertex3f(1,-1,0); glEnd(); … Driver Bus GPU Front End Vertex Processing Primitive Assembly Rasterization & Interpolation Fragment Processing Raster Operations Framebuffer(s)
GPU example Program/ API Driver Bus GPU Front End 01001001100…. Vertex Processing Primitive Assembly Rasterization & Interpolation Fragment Processing Raster Operations Framebuffer(s)
example Program/ API Driver Bus GPU Front End Vertex Processing viewing frustum Primitive Assembly Rasterization & Interpolation Fragment Processing Raster Operations Framebuffer(s)
example Program/ API Driver Bus GPU Front End Vertex Processing screen space Primitive Assembly Rasterization & Interpolation Fragment Processing Raster Operations Framebuffer(s)
example Program/ API Driver Bus GPU Front End Vertex Processing framebuffer Primitive Assembly Rasterization & Interpolation Fragment Processing Raster Operations Framebuffer(s)
example Program/ API Driver Bus GPU Front End Vertex Processing framebuffer Primitive Assembly Rasterization & Interpolation Fragment Processing Raster Operations Framebuffer(s)
quick architecture notes • Limits in Shader Size • Pixel Shader 3.0 Spec • Vertex Program – 65535 asm instructions • Fragment Program – 65535+ asm instructions • MIMD • Branches are supported with a large overhead • Rasterizer & Interpolator • Programmable in DirectX 10 • Geometric Shaders • Unified Shading Architecture • Xbox 360 – ATI • Pool of processors with load balancing
higher level shading languages • Vectorized languages for designing shader programs • Easy way out of tedious assembly coding • Not Perfect • Results Are Sometimes Clearly Not Optimized • Examples • Cg • GLSL • HLSL
Cg • nVidia’s Solution • Nearly Identical to HLSL • C++ Based • New Intrinsic Classes • New Intrinsic Functions • Semantics
Cg • Intrinsic Classes • Vectorized Primitives • i.e. float2, float3, float4 • 16-bit Floating Point Constructs • half, half2, half3, half4 • not enabled in ARB shaders • Fixed Precision Decimals • fixed, fixed2, fixed3, fixed4 • Not enabled in ARB shaders
Cg • Intrinsic Classes (cont’d) • Membership Access • Constructor • e.g. float4 v = float4(a,b,c,d); • Array Operator • e.g. v[0], v[1], v[2], or v[3] • Swizzle Operator • Re-order/Build Vectors • e.g. v.xyz, v.xxxz, v.yyx, v.yx, v.xyzw • Replaceable with rgba instead of xyzw
Cg • Intrinsic Classes (cont’d) • Matrices • Compounded Vector Classes • e.g. float4x4 • Constructed with multiple vectors • float4 v = float4(a,b,c,d); float4x4 m = float4x4(v,v,v,v); • Samplers • Texture Data Type • sampler1D, sampler2D, samplerRECT, sampler3D • samplerRECT – Same as sampler2D but uses pixel locations as texture coordinates instead of from [0,1]
Cg • Intrinsic Functions • Many have direct correspondence to assembly instructions or good approximations • Linear Algebra Functions • dot(a,b) – Dot Product • mul – Matrix-Matrix, Vector-Matrix, or Matrix-Vector multiplication • Texture Lookup Functions • tex*(sampler* texture, float* texCoord) • * - The dimensionality of the texture
Cg • Intrinsic Functions (cont’d) • Geometric Intrinsic • distance, faceforward, length, normalize, reflect, refract • A good chunk of math.h • Most Taylor series expansions for two coefficients
Cg • Semantics • Binds variables to GPU Memory Constructs • Uniform Registers • In declaration, use keyword uniform in front of variable type • Vertex Data/Interpolated Registers • float* varName : SEMANTIC • Only used as main function parameteror global variable • Textures • Same as uniform variable
FX Composer • Program for quick shader design • Uses Cg as underlying shading language • Additional Semantic Bindings • NOTE: Uses DirectX as base, so uses vector-matrix multiplication notation
FX Composer • Walkthrough Example
GPGPU • General Purpose GPU Processing • Key Notes • Goal to exploit fragment processor • Each pixel represents a compacted 4-component element of data • Most optimal in gathering algorithms • Vertex shader needed to re-order output • Possibly Optimal in Unified Shading Architecture