340 likes | 493 Views
Status – Week 242. Victor Moya. Summary. Current status. Tests. XBox documentation. Post Vertex Shader geometry. Rasterization. Current Status. Basic Command Processor. Read/Write GPU registers. Read/Write GPU memory. GPU commands. No DMA/AGP data access. Basic Memory Controller.
E N D
Status – Week 242 Victor Moya
Summary • Current status. • Tests. • XBox documentation. • Post Vertex Shader geometry. • Rasterization.
Current Status • Basic Command Processor. • Read/Write GPU registers. • Read/Write GPU memory. • GPU commands. • No DMA/AGP data access. • Basic Memory Controller. • 1 transaction per cycle served. • Memory module access latency accounted. • Transmission latency accounted. • 3 buses (req/write + data): CP, StreamerFetch, StreamerLoader.
Current Status • Shader (Vertex Shader). • Multithreaded. • F/D/E/W pipeline. • Variable execution latency. • Dependency checking is full register right now, should be component based. • Problems with ‘ending’ instruction (requires something to fetch after it and takes many cycles). • No branches (support code but instructions not implemented). • No texture access (memory).
Current Status • Streamer. • Pipelined: • Hit: Fetch/OCache/Insert/Commit • Miss: Fetch/OCache/IRQInsert/IRQRead/AttrLoad/Sh/Store/Commit. • Stream and index based modes implemented. • No pre T&L cache (should be added to Streamer Loader?). • Supports out of order vertexes (shader or memory). • Doesn’t support data from the AGP.
Current Status • Streamer: • Streamer Loader pipeline should be (in hardware): • Insert in the IRQ. • Load from IRQ. • Setup Input: start address + address increment for each active attribute. • Attribute Load: request attribute to MC, increment address generators. • Issue to Shader. • IRQ should be implemented with a pre T&L cache.
Current Status • Comments: • Currently the signal latency/bandwidth is specified with raw numbers. Alternatives: • Use constants. Store in a single ‘signal definition’ file for all units or in separate units (must be shared between the two boxes connected by the signal). • Use some kind of Architecture Description for signal delays, bandwidth, data bus width (to be used in memory transmission calculations and similar). • Currently most units only support single issue/fetch/process. Should be ‘generalized’ to multiissue/fetch/process and parametrized.
Current Status • Signal Trace Analyzer -> Carlos.
Tests • OpenGL test trace: • Used glutSolidSphere with (1, 100, 100) as parameter: • 100 batches. • 2 triangle strips (200 triangles). • 98 quad strips (9800 quads). • 20000 vertexs. • Added a lightning shader replacing the normal model view + project matrix transformation: one green light in the infinity with diffuse and specular component. • 10 shader instructions.
Tests • Light shader: // // i0 Vertex Position // i2 Vertex Normal // // c0 - c3 Model View-Project Matrix. // c4 Light Direction // c5 Light Half Vector // c6.x Material shininess // c7 Light ambient color // c8 Light diffuse color // c9 Light specular color // // o0 Vertex position (transformed) // o1 Vertex color. //
Tests // Vertex Model View-Project transformation dp4 o0.x, c0, i0 dp4 o0.y, c1, i0 dp4 o0.z, c2, i0 dp4 o0.w, c3, i0 // Compute diffuse and specular dot products and // use LIT to compute lightning coefficients dp3 r0.x, i2, c4 dp3 r0.y, i2, c5 mov r0.w, c6.x lit r0, r0
Tests // Accumulate color contributions mad r1, r0.y, c8, c7 mad o1, r0.z, c9, r1 // Finish shader. end
Tests • Results: • Simulated cycles: ~350K. • Simulation time: ~30s. • Signal trace size: ~150MB.
Tests • Bugs: • TraceReader::parseFP() failed to correctly read a negative number with a 0 before the decimal point. • GPU_CLAMP was using ‘<‘ and ‘>’ when it should be using ‘<=‘ and ‘>=‘. • ShaderDecodeExecute was allowing the execution of the instruction in the same thread after a blocked instruction (data dependency).
Tests • Changes: • Now ShaderDecodeExecute ignores any instruction received after an end instruction. • Added QUAD and QUADSTRIP support to the simulator (GPU.h, Rasterizer, Drawer). • Vertex color is clamped to 0.0 – 1.0 before being send to OpenGL (Drawer). The correct behaviour should be that color attributes should be clampled when they exit the shader. • Added glNormal3f and glFrustum OpenGL functions to the TraceReader and OGLtoAGPTransaction.
Tests • Changes: • OGLtoAGPTransaction now supports a third vertex attribute: normal. • OGLtoAGPTransaction now supports a ‘special’ shader mode (the one used for the light test). No support for OpenGL lightning is implemented.
Tests • Further tests: • Try to implement a sphere using Icosahedron subdivision to create a triangle strip mesh to test the index stream mode.
XBox Documentation • Interesting information about the Vertex Shader architecture and the T&L pipeline down to the Primitive Assembly Cache and the Triangle Setup. • Includes estimated sizes and clock latencies for most of the operations.
Memory 200 MHz cache line (raw vertex data) 4 KB 4-way set associative 128 32-B cache lines Pre T&L Cache raw vertex Vertex Shader transformed and lit vertex Post T&L Cache 16 – 24 entry FIFO transformed and lit vertex Primitive Assembly 3 vertices 3 transformed and lit vertices Triangle Setup Rasterization
XBOX • Differences: • No Pre T&L cache. • The Post T&L cache seems to be accessed by the Primitive Assembly Cache. However we push the vertex to the Rasterizer (or whatever lays after the shader). • Sending the shaded vertex to the primitive assembly takes multiple cycles (2+) depending on the number of attributes used by the vertex.
XBOX Vertex Shader • Registers: • 16 input registers. • 12 temporary registers. • 192 constant registers. • 1 address register. • 11 output registers.
XBOX Vertex Shader • Instructions: • Shader Operations: • 13 MAC opcodes. • 7 ILU (inverse logic unit) opcodes. • 136 microcode instructions. Each instruction can: • Read three register with swizzle and negation. • Compute one MAC op and one ILU op. • Write up one output register and two temporary registers with masking. • Shader types: • Normal vertex shaders. • Read/write vertex shaders. • Vertex state shaders.
XBOX Vertex Shaders • Timing: • The cycle speed is 250 MHz • For normal shaders, instructions take between one-half cycle and one cycle to complete. • For read/write and vertex state shaders, instructions take between one cycle and six cycles to complete.
XBOX Vertex Shaders • Multithreaded: • Two copies of the vertex shader pipeline (2 VS). • Each copy can run up to three threads (3 active threads per shader). • Read/write vertex shaders and vertex state shaders run single threaded, on a single pipeline. • Stalling: • Instructions take six cycles to compute their outputs. • Bypasses: ALU, ILU and MLU bypasses. • Three cycles latency with bypasses. • Bypass allows swizzling and negate of the result.
Post Vertex Shader • (based in 3DLabs OpenGL2 overview). • Primitive assembly. • User clipping. • Frustum clipping. • Perspective projection. • Viewport Mapping. • Polygon offset. • Polygon mode. • Shade mode. • Culling.
Post Vertex Shader • Primitive Assembly: • Get the three vertexes of a triangle. • Triangles: keep the last three vertexes, generate primitive with each new three vertexes. • Triangle strip: keep the last three vertexes, generate primitive with each new vertex (after the second) • Triangle fan: keep the first vertex and the last two vertex, generate primitive with each new vertex (after the second). • Similar with other primitives.
Post Vertex Shader • User clipping: • At least 6 user clip planes. • Define a clip volume. • glClipPlane(enum p, double eqn[4]). • (p1 p2 p3 p4) (x y z w) >= 0 • Frustum clipping: • View volume. • -w <= x <= w • -w <= y <= w • -w <= z <= w
Post Vertex Shader • Clipping: • Clip polygon => add new vertexes => tesselate. • Clip triangle => add new vertexes => retesselate. • Use rasterization in homogeneous coordinates: just add more clipping edges. • Guard Band Clipping (scissor).
Post Vertex Shader • Divide by w. • Viewport transformation. • Scale to screen/window coordinate system. • glViewport(x, y, w, h) • glDepthRange(clampd n, clampd f) • xw = (px/2)*xd + ox • yw = (py/2)*yd + oy • zw = [(f-n)/2]*zd + (n + f)/2 • ox = x + w/2 • oy = y + h/2 • px = w • py = h
Post Vertex Shader • Back face culling: • Can be calculated using the area of the triangle (determinant three vertex in homogeneous coordinates). • Negative or possitive area. • Can be also used to cull zero area triangles
Post Vertex Shader • Discard degenerate triangles: • If two or more vertex are the same (could be index based or full vertex comparition) the triangle can be discarded.
Rasterization • Alternatives: • Scanline incremental interpolation (DDA). • Rasterization in homogeneous coordinates. • Two phases: • Triangle setup. • Set interpolation registers. • Fragment generation. • Incrementally update the interpolants.