200 likes | 331 Views
Status – Week 243. Victor Moya. Summary. Current status. Tests. XBox documentation. Post Vertex Shader geometry. Rasterization. Current Status. Basic Command Processor. Read/Write GPU registers. Read/Write GPU memory. GPU commands. No DMA/AGP data access. Basic Memory Controller.
E N D
Status – Week 243 Victor Moya
Summary • Current status. • Tests. • XBox documentation. • Post Vertex Shader geometry. • Rasterization.
Current Status • Basic Command Processor. • Read/Write GPU registers. • Read/Write GPU memory. • GPU commands. • No DMA/AGP data access. • Basic Memory Controller. • 1 transaction per cycle served. • Memory module access latency accounted. • Transmission latency accounted. • 3 buses (req/write + data): CP, StreamerFetch, StreamerLoader.
Current Status • Shader (Vertex Shader). • Multithreaded. • F/D/E/W pipeline. • Variable execution latency. • Dependency checking is full register right now, should be component based. • Problems with ‘ending’ instruction (requires something to fetch after it and takes many cycles). • No branches (support code but instructions not implemented). • No texture access (memory).
Current Status • Streamer. • Pipelined: • Hit: Fetch/OCache/Insert/Commit • Miss: Fetch/OCache/IRQInsert/IRQRead/AttrLoad/Sh/Store/Commit. • Stream and index based modes implemented. • No pre T&L cache (should be added to Streamer Loader?). • Supports out of order vertexes (shader or memory). • Doesn’t support data from the AGP.
Current Status • Streamer: • Streamer Loader pipeline should be (in hardware): • Insert in the IRQ. • Load from IRQ. • Setup Input: start address + address increment for each active attribute. • Attribute Load: request attribute to MC, increment address generators. • Issue to Shader. • IRQ should be implemented with a pre T&L cache.
Current Status • Comments: • Currently the signal latency/bandwidth is specified with raw numbers. Alternatives: • Use constants. Store in a single ‘signal definition’ file for all units or in separate units (must be shared between the two boxes connected by the signal). • Use some kind of Architecture Description for signal delays, bandwidth, data bus width (to be used in memory transmission calculations and similar). • Currently most units only support single issue/fetch/process. Should be ‘generalized’ to multiissue/fetch/process and parametrized.
Tests • Signal Tracer Analyzer -> Carlos. • OpenGL test trace: • Sphere. • Using glutSolidSphere: 2 triangle fans, n quad strips. • Trying to implement a sphere using Icosahedron subdivision to create a triangle strip mesh to test the index mode. And later add lighting shader. • As many vertexes/polygons as we want (~10000 in current generated trace).
Tests • Changes needed: • Add support for glNormal3f, GL_TRIANGLE_FAN and GL_QUAD_STRIP to the TraceReader/Library/Driver. • Add support for triangle fans and quad strips (?) to the CP and the fake rasterizer (Shader and Streamer don’t care about that).
XBox Documentation • Interesting information about the Vertex Shader architecture and the T&L pipeline down to the Primitive Assembly Cache and the Triangle Setup. • Includes estimated sizes and clock latencies for most of the operations.
Memory 200 MHz cache line (raw vertex data) 4 KB 4-way set associative 128 32-B cache lines Pre T&L Cache raw vertex Vertex Shader transformed and lit vertex Post T&L Cache 16 – 24 entry FIFO transformed and lit vertex Primitive Assembly 3 vertices 3 transformed and lit vertices Triangle Setup Rasterization
XBOX • Differences: • No Pre T&L cache. • The Post T&L cache seems to be accessed by the Primitive Assembly Cache. However we push the vertex to the Rasterizer (or whatever lays after the shader). • Sending the shaded vertex to the primitive assembly takes multiple cycles (2+) depending on the number of attributes used by the vertex.
XBOX Vertex Shader • Registers: • 16 input registers. • 12 temporary registers. • 192 constant registers. • 1 address regsiter. • 11 output registers.
XBOX Vertex Shader • Instructions: • Shader Operations: • 13 MAC opcodes. • 7 ILU (inverse logic unit) opcodes. • 136 microcode instructions. Each instruction can: • Read three register with swizzle and negation. • Compute one MAC op and one ILU op. • Write up one output register and two temporary registers with masking. • Shader types: • Normal vertex shaders. • Read/write vertex shaders. • Vertex state shaders.
XBOX Vertex Shaders • Timing: • The cycle speed is 250 MHz • For normal shaders, instructions take between one-half cycle and one cycle to complete. • For read/write and vertex state shaders, instructions take between one cycle and six cycles to complete.
XBOX Vertex Shaders • Multithreaded: • Two copies of the vertex shader pipeline (2 VS). • Each copy can run up to three threads (3 active threads per shader). • Read/write vertex shaders and vertex state shaders run single threaded, on a single pipeline. • Stalling: • Instructions take six cycles to compute their outputs. • Bypasses: ALU, ILU and MLU bypasses. • Three cycles latency with bypasses. • Bypass allows swizzling and negate of the result.
Post Vertex Shader • Divide by w. • Can be avoided/delayed if rasterization is performed in homogenous coordinates (Olano & Greer). • Viewport transformation. • Scale to screen/window coordinate system. • Primitive Assembly: • Get the three vertexes of a triangle.
Post Vertex Shader • Back face culling: • Can be calculated using the area of the triangle (determinant three vertex in homogeneous coordinates). • Negative or possitive area. • Can be also used to cull zero area triangles • Clipping: • Using rasterization in homogeneous coordinates: just add more clipping edges. • Triangle clipping: ?
Post Vertex Shader • Discard degenerate triangles: • If two or more vertex are the same (could be index based or full vertex comparition) the triangle can be discarded.
Rasterization • Alternatives: • Scanline incremental interpolation (DDA). • Rasterization in homogeneous coordinates. • Two phases: • Triangle setup. • Set interpolation registers. • Fragment generation. • Incrementally update the interpolants.