370 likes | 511 Views
Status – Week 276. Victor Moya. Hardware Pipeline. Command Processor. Vertex Shader. Rasterization. Pixel Shader. Fragment Operations and Tests. Command Processor. Recieves commands from the CPU (driver, OpenGL/Direct3D). Fetches data from memory: vertex data (DMA).
E N D
Status – Week 276 Victor Moya
Hardware Pipeline • Command Processor. • Vertex Shader. • Rasterization. • Pixel Shader. • Fragment Operations and Tests.
Command Processor • Recieves commands from the CPU (driver, OpenGL/Direct3D). • Fetches data from memory: vertex data (DMA). • Updates and stores OpenGL/Direct3D render state.
Vertex Shader • Transforms and lits vertex streams. • Vertex shader program (from GPU memory?). • Vertex shader constans (from GPU memory?). • Inputs: vertex data 16x4D • Outputs: vertex data 14x4D
Rasterization • Includes: • Clipping • Divide by w • Affine transform • Primitive assembly • Culling • Setup • Fragment generation. • Recieves vertexs and produces fragments. • Uses OpenGL/Direct3D render state. • Input: vertex (15x4D). • Output: fragments (10x4D).
Pixel Shader • Shades fragments: calculate texture address, read texture, color operations. • Pixel Shader program and constants (from GPU memory?). • Texture read: TMU (texture sample, filter unit, texture cache, GPU memory). • Optional: • Modify depth coordinate (1 Z output). • Render to texture (up to 4 colors outputs). • Input: fragment (12x4D). • Output: color (2x4D).
Fragment Operations and Tests • Includes (OpenGL): • Fog. • Color Sum. • Ownership Test. • Scissor Test. • Alpha Test. • Stencil Test. • Depth Test. • Blend. • Logic Operation. • Accesses framebuffer (GPU memory). Updates framebuffer. • Framebuffer: color, Z and stencil. • OpenGL/Direct3D render state defines operations. • Input: color. • Output: FB updated.
Vertex Shader • The command processor sends a vertex stream to the vertex shaders. • A vertex buffer stores data read from DMA. • A vertex cache (~ 10 vertexs) can be used to avoid to execute vertex shader for the same vertex twice. • The vertex stream is grouped in primitives and sent to the rasterizer.
Vertex Shader Architecture • SIMD architecture. Registers are 128b wide, four 32 bit fields. • Instruction set: typical arithmetic instructions (vector mul, add) and some special instructions (ARL, DST), some complex mathematic instructions (EXP, COS), support for branching, loops and procedures. • 3 different sources of data: • Input stream (~ 16 registers). • Constants (~ 256 registers). • Temporaries (~ 16 registers). • 2 different destinations: • Output stream (~ 15 registers). • Temporaries (~ 16 registers). • Conditional registers (NV30) and boolean constants (R300, DX9) for conditional ‘execution’.
Vertex Shader: NV20 • Exposes programmability of a small part of the geometry pipeline. • Vertex load & store, format conversion, primitive assembly, clipping, triangle setup occur completely in parallel, in pipeline fashion. • 4-wide fine grained SIMD FP to provide the necessary performance, and run multiple execution threads to maintain efficiency and provide a very simple programming mode.
NV20: Introduction • Independent vertices. • IEEE single precission FP. • 4 component vectors (x, y, z, w). • Input registers can have their components arbitrarily rearranged/replicated (swizzled). • Any operation generating a scalar must generate that scalar replicated across all components, and output writes have a component write mask.
NV20: Input Attributes • Input Attributes: • 16 quad-float vertex source attribute registers. • Position, normal, two colors, up to 8 texture coordinate sets, skin weights, fog and point size. • Default 0.0 for second and third components, 1.0 for the fourth. • Attributes are persistent. • Only one vertex attribute may be read per program instruction. • Constant memory: • 96 quad floats. • Can only be loaded before vertices are processed. • Only one constant may be read by one program instruction. • The program may not read to constants.
NV20: Input Attributes • Integer address register: • Loaded using ARL. • Indexed constant reads with out-of-range reads returning (0,0,0,0). • Read/Write register file: • 12 quad floats. • Three reads and one write per instruction. • Initialized to (0,0,0,0) per vertex. • Any vector read may be sourced as multiple operands and individually swizzled/negated each time.
NV20: Output attributes • Standard mapping for the fixed function pipeline at the homogeneous clip space point. • Position for clipping. • Vertex color output clamped to the range 0.0 to 1.0. • Fog distance, point size. • 8 texture coordinates. • All instruction writes have an optional 4-component write mask. • Initialized to (0.0, 0.0, 0.0, 1.0).
NV20: Instruction Set. • No branching. • Constant Latency: issue any instruction per clock and execute all instructions with thhe same latency. All operands are immediately available, limiting the size of registers and memory banks.
NV20: Hardware Implementation • Two blocks: vertex attribute buffer (VAB) and the floating point core.
NV20: VAB • The VAB is responsible for vertex attribute persistence. • 16 input attributes • When a write to an addres is recieved defaults (0.0, 0.0, 0.0, 1.0) and the valid data overwrites the components. • The VAB drains into a number of input buffers (IB) that are used to feed the FP core in a round robin fashion. • Dirty bits are maintained in the VAB so only changed attributes are updated when the same buffer is again the drain target. • The transfer of a vertex is triggered by a write to address 0 (vertex position). • To prevent bubbles during simultaneous loading and draining of the VAB, incoming writes may push out th contents of the target address, superceding a default drain sequence.
NV20: Floating Point Core • Processes the instruction set. • Multithreaded vector processor operating on quad-float data. • Vertex data read from input buffers and transformed into output buffers (OB). • Same latency for vector and special function units. • Multiple vertex threads are used to hide this latency. • SIMD VU: MOV, MUL, ADD, MAD, DP3, DP4, DST, MIN, MAX, SLT, SGE. • Special FU: RCP, RSQ, LOG, EXP, LIT. • VU is approximately IEEE (no denormalized numbers or exceptions, rounding always toward negative infinity). • 1 instruction per clock and all input/output options have no performance penalty. • All input vectors are available with no latency.
Vertex Shader: R300 • 4 vertex shader units. • 1 scalar unit, 1 vector unit. • Registers: • ALU Registers: • Constants: 256 read only vectors. • Temporary: 12 read/write vectors • Input: 16 read only vectors. • Output: 15 write only vectors. • Flow Control Registers: • Integer Constat: 16 read only vectors. • Address: 1 read/write vector. • Loop Counter: 1 scalar. • Boolean Constant: 16 read only bits.
R300: Instructions • Up to 256 instructions long shaders. • Up to 64K executed instructions per vertex. • ALU instructions: ADD, DP3, DP4, EXP, EXPP, EXPE, FRAC, LOG, LOGP, MAD, MADDX2, MAX, MIN, MOV, MUL, POW, RCP, RSQ, SGE, SLT. • Control Flow instructions: CALL, LOOP, ENDLOOP, JUMP, JNZ, LABEL, REPEAT, ENDREPEAT, RETURN. • Address Instructions: ARL, ARR. • Graphic Instructions: DST, LIT. • Instructions based in DX9 VS2.0.
NV30: Overview • Supports all VS1 instructions and features. • Beyond VS2? • Condition codes. • Branches and subroutines. • Modifiers: absolute. • User clip support (new output registers CLP0-CLP5). • New instructions. • More registers.
NV30: Overview • Up to 256 instructions per program. • Up to 64K executed instructions per vertex. • 16 temporary registers. • 2 vector address registers. • 256 program parameters (constants).
NV30: Condition Codes • 4 component register: • LT: less than zero. • EQ: equal to zero. • GT: greater than zero. • UN: unordered, for comparisions involving NaN. • Instructions optionally update condition code state: • “C” suffix: DP4C, MOVC. • “CC” pseudo register for update condition codes. • Condition code used in: • Branches and procedure call/return. • Result masking.
NV30: Modifiers • Source: • Swizle • Negate • Absolute • Target • Masking • Conditional masking
NV30: Branching and subroutines • BRA • Unconditional. • Conditional: BRA label (LE.xyww) • Computed (indirect): BRA [A1.z] (GT.x) • Call & return for subroutines. • CAL & RET. • Same options that with branches. • Four levels of subroutin execution. • No parameter stack.
NV30: Clipping • New output registers: o[CLP0]..o[CLP5]. • GL_CLIP_PLANEn enabled. • Clip coordinate n interpolated across the primitive. • Only the portion of the primitive where the clip coordinate is greater than zero is rasterized. • Hardware performs fast trivial reject if all clip coordinats of a primitive are negative.
NV30: New Instructions • ARL: supports loading 4-component A0 and A1 intergre registers now. • ARR: like ARL except rounds rather than truncates before storing integer result in an address register. • BRA, CAL, RET: branching instructions. • COS, SIN: high precision trigonometric functions. • FLR, FRC: floor and fraction of floating point values. • EX2, LG2: high-preccision exponentiation and logarithm functions. • ARA: adds pairs of components of an address register, useful for looping and other operations. • SEQ, SFL, SGT, SLE, SNE, STR: add six “set on” instructions similar to SLT and SGE. • SSG: “set sign” operation generates a vector holding –1.0 for negative operand components , 0 for zero components, and +1.0 for positive components.
NV30: Instruction List • Add & multiply instructions: ADD, DP3, DP4, DPH, MAD, MOV, SUB. • Math functions: ABS, COS, EX2, FLR, FRC, LG2, LOG, RCP, RSQ, SIN. • Set on instructions: SEG, SFL, SGE, SGT, SLE, SLT, SNE, STR. • Branching instructions: BRA, CAL, RET. • Address register instructions: ARL, ARA. • Graphics-oriented instructions: DST, LIT, RCC, SSG. • Minimum/maximum instructions: MAX, MIN
Others • Antialiasing • Anisotropic Filtering (textures). • Line Antialiasing. • Edge Antialiasing • Full Screen Antialiasing (FSAA): • Supersampling. • MultiSampling. • TBDR: Tile Based Deferred Rendering (STMicro PowerVR). • HOS (High Order Surfaces): N-Patches, Bezier, Displacement Mapping, TruForm, Tesselation.