DirectX 9 & Radeon 9700 Performance Optimizations

DirectX 9 & Radeon 9700Performance Optimizations Richard Huddy RHuddy@ati.com

DirectX 9 and Radeon 9700 considerations • Resources • Sorting and Clearing • Vertex Buffers and Index Buffers • Render States • How to draw primitives • Vertex Data • Vertex Shaders • Pixel Shaders • Textures • Targets (both Z and color) • Miscellaneous

General resource management • Create your most important resources first (that’s targets, shaders, textures, VB’s, IB’s etc) • “Most important” is “most frequently used” • Never call Create in your main loop • So create the main colour and Z buffers before you do anything else… • The “main buffer” is the one through which the largest number of pixels pass…

Sorting • Sort roughly front to back • There’s a staggering amount of hardware devoted to making this highly efficient • Sort by vertex shader …or… • Sort by pixel shader, or • sort by texture • When you change VS or PS it’s good to go back to that shader as soon as possible… • Short shaders are faster^2 when sorted

Clearing • Ideally use Clear once per frame (not less) • Always clear the whole render target • Don’t track dirty regions at all • Always clear colour, Z and stencil together unless you can just clear Z/stencil • Most importantly don’t force us to preserve stencil • Don’t use 2 triangles to clear… • Using Clear() is the way to get all the fancy Z buffer hardware working for you

Vertex Buffers • Use the standard DirectX8/9 VB handling algorithm with NOOVERWRITE etc • Try to always use DISCARD at the start of the frame on dynamic VB’s • Specify write-only whenever possible • Use the default pool whenever possible • Roughly 2 – 4 MB for best performance • This allows large batches • And gives the driver sufficient granularity

Index Buffers • Treat Index Buffers exactly as if they were vertex buffers – except that you always choose the smallest element possible • i.e. Use 32 bit indices only if you need to • Use 16 bit indices whenever you can • All recent ATI hardware treats Index Buffers as ‘first class citizens’ • They don’t have to be copied about before the chip gets access • So keep them out of system memory

Updating Index and Vertex Buffers • IBs and VBs which are optimally located need to be updated with sequential DWORD writes. • AGP memory and LVM both benefit from this treatment…

Handling Render States • Prefer minimal state blocks • ‘minimal’ means you should weed out any redundant state changes where possible • If 5% of state changes are redundant that’s OK • If 50% are redundant then get it fixed! • The expensive state changes: • Switching between VS and FF • Switching Vertex Shader • Changing Texture

How to draw primitives • DrawIndexedPrimitive( strip or list ) • Indexing is a big win on real world data • Long strips beat everything else • Use lists if you would have to add large numbers of degenerate polys to stick with strips (more than ~20% means use lists) • Make sure your VB’s and IB’s are in optimal memory for best performance • Give the card hundreds of polys per call • Small batches kill performance

Vertex data • Don’t scatter it around • Fewer streams give better cache behaviour • Compress it if you can • 16 bits or less per component • Even if it costs you 1 or 2 ops in the shader… • Try to avoid spilling into AGP • Because AGP has high latency • pow2 sizes help – 32 bytes is best • Work the cache on the GPU • Avoid random access patterns where possible by reordering vertex data before the main loop… • That’s at app start up or at authoring time

Compiling and Linking shaders • Do this all “up front” • It may not be obvious to you - but you have to actually use a shader to force it’s complete instantiation in DirectX 9 • So, if you’re not careful you may get linking happening in your main loop • And linking may be time consuming  • Draw a little of everything before you start for real. Think of this as priming the caches…

Vertex shaders I • Shorter shaders are faster – no surprises here… • Avoid all unnecessary writes • This includes the output registers of the VS • So use the write masks aggressively • Pack constants as much as possible • Prefer locality of reference on constants too… • Be aware of the expansion of macros but prefer them anyway if they match exactly what you want • Pack your shader constant updates • You should optimise the algorithm and leave the object-code optimisation to the driver/runtime

Vertex shaders II • Branches and conditionals are fast so use them agressively • That’s not like the CPU where branches are slow… • Longer shaders allow better batching • Shorter shaders are also more cache friendly • i.e. it’s usually faster to switch to the previous shader than to any other • But the shorter your shaders are… • …the more of them fit into the cache.

Vertex shaders II • API Change: • Now you don’t “mov” to the address register, you use “mova” • And this performs round to nearest, not floor • And now A0 is a 4d register • A0.x, A0.y, A0.z, A0.w

Pixel shaders I • API change to accommodate MET’s: • You now have to explicitly write to oC0, oC1, oC2 and 0C3 to set the output colour • And the write has to be with a mov instruction • If you write to 0C[n] you must write to all elements from oC[0] to 0c[n-1] • i.e. Writes must be contiguous starting at oC0 • But the writes can happen in any order • You can also write to oDepth to update the Z buffer but note that this kills the early Z cull… (this replaces ps1.3 texdepth)

Pixel shaders II • Shorter is much faster • It’s much easier to be pixel limited than vertex limited • Short shaders are more cache friendly • Be aggressive with write masks • Think dual-issue (“+”) even though it’s gone from the API (so split colour and alpha out) • Generally prefer to spend cycles on shader ops rather than using texture lookups • Because memory latency is the enemy here

Pixel shaders III • Dual issue? • But that’s not in the 2.0 shader spec… • But remember that DX9 hardware like the Radeon 9700 has to run DirectX 8 apps very fast indeed • And that means it has dual issue hardware ready for you to use

Pixel shaders IV • Example : Diffuse + specular lighting … dp3 r0, r1, r0 // N.H dp3 r2, r1, r2 // N.L mul r2, r2, r3 // * color mul r2, r2, r4 // * texture mul r0.r, r0.r, r0.r // spec^2 mul r0.r, r0.r, r0.r // spec^4 mul r0.r, r0.r, r0.r // spec^8 mad r0.rgb, r0.r, r5, r2 … Total: 8 instructions … dp3 r0, r1, r0 // N.H dp3 r2.r, r1, r2 // N.L mul r6.a, r0.r, r0.r // spec^2 mul r2.rgb, r2.r, r3 // * color mul r6.a, r6.a, r6.a // spec^4 mul r2.rgb, r2, r4 // * texture mul r6.a, r6.a, r6.a // spec^8 mad r0.rgb, r6.a, r5, r2 … Optimized to 5 “DI” instructions

Pixel shaders IV • Texture instructions • Avoid TEXDEPTH to retain the early Z-reject • If you do choose to use TEXKILL then use it as early as possible. [But, the positioning of TEXKILL within texture loading code is unimportant] • Register usage • Minimize total number of registers used • No problems with dependency

Vertex and Pixel shaders • If you’re fed up with writing assembler, and don’t feel excited by the opportunity to code 256 VS ops and 96 PS ops then… • …maybe you should consider HLSL? • In most cases it is as good as hand written assembler • And much faster to author… • Perfect for prototyping • And for release code where you use D3DX

Textures I • API addition • SetSamplerState() • Handles the now-decoupled texture sampler setup. • You may now freely mix and match texture coordinates with texture samplers to fetch texels in arbitrary ways • Texture coordinates are now just iterated floats • Samplers handle clamp, wrap, bias and filter modes • You have 8 texture coordinates • And 16 texture samplers • texld r11, t7, s15 (all register numbers are max)

Textures II • Use compressed textures • Do you need a good compressor? • Use smaller textures • Use 16 bit textures in preference to 32 bit • Use textures with few components • Use an L8 or A8 format if that’s what you want • Pack textures together • e. g. If you’re using two 2D textures then consider using a single RGBA texture • Texture performance is bandwidth limited

Textures III • Filtering modes • Use trilinear filtering to improve texture cache coherency • Only use anisotropic or tri-linear filtering when they make sense - they are more expensive • Avoid using anisotropic filtering with bumpmapping • Avoid using tri-linear anisotropic filtering unless the quality win justifies it • More costly filtering is more affordable with longer pixel shaders

Targets • Always clear the whole of the target • Present(): • WASSTILLDRAWING makes a comeback • Please use it! • Because using it properly will gain you CPU cycles - and that’s typically your scarcest resource

Depth Buffer I • Never lock depth buffers • Clearing depth buffers • Clear the whole surface • When stencil is present clear both depth and stencil simultaneously • If possible disable depth buffering when alpha blending (i.e. drawing HUD’s) • Use as few depth buffers as possible… • i.e. re-use them across multiple render targets

Depth Buffer II • Efficiently use Hyper-Z • Render front to back • Make Znear, Zfar close to active depth range of the scene • The EQUAL and NOT EQUAL depth tests require exact compares which kill the early Z comparisons. Avoid them!

Occlusion query • New to DirectX 9 • In GL you have HP_occlusion_query and NV_occlusion_query to avoid the need for locks • Not free, but much cheaper than Lock() • Supported on all ATI hardware since the Radeon 8500 • CreateQuery(OCCLUSION, ppQuery) • Issue(Begin/End) • GetData() returns S_OK to signal completion - but please don’t spin waiting for the answer…

AGP 8X • Is fast at ~2GB per second • But has high latency compared to LVM • And is 10 times slower than LVM • Radeon 9700 has up to 20GB per sec of bandwidth available when talking to LVM • (LVM = Local Video Memory)

User clip planes • User clip planes are much more efficient than texkill because: • They insert a per-vertex test, rather than a per-pixel test, and vertices are typically fewer in number than pixels • It’s important always to kill data at the earliest stage possible in the pipeline • Plus, clipping is essentially a geometric operation • All hardware which supports ps1.4 supports user clip planes in hardware

Sky box. First or last? • Draw it last because: • That’s a rough front to back sort • In this case you know that most sky pixels will fail the Z test. • Draw it first because: • That way you don’t need any Z tests • In this case you know that most sky pixels would pass the Z test

So, here is our target: • DX9 style mainstream graphics (per frame): • > 500K triangles • < 500 DrawIndexedPrimitive() calls • < 500 VertexBuffer switches • < 200 different textures • < 200 State change groups • Few calls to SetRenderTarget - aim for 0 to 4... • 1 pass per poly is typical, but 2 is sometimes smart • Runs at monitor refresh rate • Which gives more than 40 million polys per second • And everything goes through the programmable pipeline • No occurrences of Lock(0), DrawPrimitive(), DPUP()

Questions… ? Richard Huddy RHuddy@ati.com

DirectX 9 & Radeon 9700 Performance Optimizations