Revving Up Shader Performance

1. Revving Up Shader Performance Shanon Drone Development Engineer XNA Developer ConnectionMicrosoft

2. Key Takeaways Know the shader hardware on a low level Pixel threads, GPRs, fetch efficiency, latency, etc. Learn from the Xbox 360 Architect shaders accordingly Shader optimization is not always intuitive Some things that seem to make sense, actually hurt performance At some point, you must take a trial-n-error approach In other words� You must write tests Real-world measurements are the only way to really know what works Analogy: To optimize shader performance, you need to know how the hardware works �under the hood�, not just the high-level API and shader programming language. Analogy: In this analogy, you are not the race car driver; rather you are the mechanic, or even the mechanical engineer. The car is a finished product on race day, so this talk is about the performance mods you make while building the engine. Analogy: To optimize shader performance, you need to know how the hardware works �under the hood�, not just the high-level API and shader programming language. Analogy: In this analogy, you are not the race car driver; rather you are the mechanic, or even the mechanical engineer. The car is a finished product on race day, so this talk is about the performance mods you make while building the engine.

3. Under the Hood Help the shader compiler by writing shaders for how hardware works �under the hood� Modern GPUs are highly parallel to hide latency Lots of pipelines (maybe even a shared shader core) Works on units larger than single vertices and pixels Rasterization goes to tiles, then to quads, then to (groups of) pixels Texture caches are still small 32 KB on Xbox 360 and similar on PC hardware Shaders use 6�10 textures with 10�20 MB traffic Rasterization order affects texture cache usage Lots of other subtleties affect performance Xbox 360 has 48 ALUs X1300 had 4 shader processors X1600 had 12 shader processors X1800 XT has 16 TMUs, 16 ALUs, 16 ROPs, and 8 vertex units X1900 XT has 16 TMUs, 48 ALUs, 16 ROPs, and 8 vertex units <- Note: 3x ALU perf than texture pipelines. Not unified. Intel GMA 3000 has unified shader core, including dynamic load balancing The NVIDIA D3D10 part, the G80, will have 48 pixel processors and unknown # of vertex processors. Not unified.Xbox 360 has 48 ALUs X1300 had 4 shader processors X1600 had 12 shader processors X1800 XT has 16 TMUs, 16 ALUs, 16 ROPs, and 8 vertex units X1900 XT has 16 TMUs, 48 ALUs, 16 ROPs, and 8 vertex units <- Note: 3x ALU perf than texture pipelines. Not unified. Intel GMA 3000 has unified shader core, including dynamic load balancing The NVIDIA D3D10 part, the G80, will have 48 pixel processors and unknown # of vertex processors. Not unified.

4. Pixel and Vertex �Vectors� GPUs work on groups of vertices and pixels Xbox 360 works on a �vector� of 64 vertices or pixels Meaning each instruction is executed 64 times at once Indexed constants cause �constant waterfalling� For example: �c[a0]� can translate to 64 different constants per one instruction When all pixels take the same branch� �performance can be good But in most cases, both code paths will execute Author shaders accordingly �Early out� code paths may not work out like expected Check assembly output On ATI X1800, thread size is 16 pixels. Up to 512 threads means at least 8192 GPRs. GeForce 7800 may use 24 pixel threads. NVIDIA doesn�t release specific details, but suggested that: - NV40 does 880 pixel blocks - G70 does ~220 pixel blocksOn ATI X1800, thread size is 16 pixels. Up to 512 threads means at least 8192 GPRs. GeForce 7800 may use 24 pixel threads. NVIDIA doesn�t release specific details, but suggested that: - NV40 does 880 pixel blocks - G70 does ~220 pixel blocks

5. Why GPRs Matter GPU can process many ALU ops per clock cycle But fetch results can take hundreds of clock cycles Due to cache misses, texel size, filtering, etc. Threads hide fetch latency While one �thread� is waiting for fetch results, the GPU can start processing another thread Max number of threads is limited by the GPR pool Example: 64 pixels � 32 GPRs = 2048 GPRs per thread The GPU can quickly run out of GPRs Fewer GPRs used generally translates to more threads

6. Minimize GPR Usage Xbox 360 has 24,576 GPRS 64-unit vector � 128 register banks � 3 SIMD units Sounds like a lot, but many shaders are �GPR bound� The issue is the same for PC hardware Tweak shaders to use the least number of GPRs Maybe even at the expense of additional ALUs or unintuitive control flow Unrolling loops usually requires more ALUs� �as does lumping tfetches at the beginning of the shader Ditto for �ber-shaders, even with static control flow Rules get tricky, so we must try out many variations

7. Vertex Shader Performance Lots of reasons to care about vertex shader performance: Modern games have many passes (z-pass, tiling ,etc.) Non-shared cores devote less hardware for vertices For shared cores, resources could be used for pixels For ALU bound shaders� Constant waterfalling can be the biggest problem Matrix order can affect ALU optimization Many vertex shaders are fetch bound Especially lightweight shaders One fetch can cost 32x more cycles than an ALU

8. VFetch Goal is to minimize the number of fetches Which is why vertex compression is so important Vertex data should be aligned to 32 or 64 bytes Multiple streams will multiply the fetch cost Vertex declaration should match fetch order Shader compiler has ZERO info about vertex components Shaders are patched at run time with the vertex declaration Shader patching can�t optimize out unnecessary vfetches

9. Mega vs. Mini Fetches A quick refresher� GPU reads vertex data is groups of bytes Typically 32 bytes (a �mega� or �full� fetch) Additional fetches within the 32-byte range should be free (a �mini� fetch) On Xbox 360: A �vfetch_full� pulls in 32 bytes worth of data Two fetches (times 64 vertices) per clock cycle equals 32 cycles per fetch Without 32 cycles worth of ALU ops, the shader is fetch bound

10. Compress vertices Normals, Tangent, BiNormal -> 11:11:10 Texture coords -> 16:16 Put all non-lighting components first So depth-only shaders do just one fetch FLOAT32 Position[3]; UINT8 BoneWeights[4]; UINT8 BoneIndices[4]; UINT16 DiffuseTexCoords[2]; UINT16 NormalMapCoords[2]; DEC3N Normal; DEC3N Tangent; DEC3N BiNormal; All in one stream, of course Vfetch Recommendations

11. Fetch From Two Streams Cycles/64 vertex vector: ALU 12, vertex 64, sequencer 12 3 GPRs, 31 threads // Fetch position vfetch_full r2.xyz1, r0.x, vf0, Offset=0, DataFormat=FMT_32_32_32_FLOAT // Fetch diffuse texcoord vfetch_full r0.xy0_, r0.x, vf2, Offset=0, DataFormat=FMT_32_32_FLOAT mul r1, r2.y, c1 mad r1, r2.x, c0.wyxz, r1.wyxz mad r1, r2.z, c2.zywx, r1.wyxz mad r2, r2.w, c3, r1.wyxz mul r1, r2.y, c5.wzyx mad r1, r2.x, c4.xzwy, r1.wyxz mad r1, r2.z, c6.yzxw, r1.wyxz mad oPos, r2.w, c7, r1.zxyw nop

12. 2 Fetches From Same Stream Cycles/64 vertex vector: ALU 12, vertex 64, sequencer 12 3 GPRs, 31 threads // Fetch position vfetch_full r2.xyz1, r0.x, vf0, Offset=0, DataFormat=FMT_32_32_32_FLOAT // Fetch diffuse texcoord vfetch_full r0.xy0_, r0.x, vf0, Offset=10, DataFormat=FMT_32_32_FLOAT mul r1, r2.y, c1 mad r1, r2.x, c0.wyxz, r1.wyxz mad r1, r2.z, c2.zywx, r1.wyxz mad r2, r2.w, c3, r1.wyxz mul r1, r2.y, c5.wzyx mad r1, r2.x, c4.xzwy, r1.wyxz mad r1, r2.z, c6.yzxw, r1.wyxz mad oPos, r2.w, c7, r1.zxyw nop

13. 1 Fetch From Single Stream Cycles/64 vertex vector: ALU 12, vertex 32, sequencer 12 3 GPRs, 31 threads // Fetch position vfetch_full r2.xyz1, r0.x, vf0, Offset=0, DataFormat=FMT_32_32_32_FLOAT // Fetch diffuse texcoord vfetch_mini r0.xy0_, r0.x, vf0, Offset=5, DataFormat=FMT_32_32_FLOAT mul r1, r2.y, c1 mad r1, r2.x, c0.wyxz, r1.wyxz mad r1, r2.z, c2.zywx, r1.wyxz mad r2, r2.w, c3, r1.wyxz mul r1, r2.y, c5.wzyx mad r1, r2.x, c4.xzwy, r1.wyxz mad r1, r2.z, c6.yzxw, r1.wyxz mad oPos, r2.w, c7, r1.zxyw nop

14. Triple the Fetch Cost //DepthOnlyVS.hlsl struct VS_INPUT { float4 Position; float4 BoneIndices; float4 BoneWeights; float2 DiffuseTexCoords; }; VS_OUTPUT DepthOnlyVS( VS_INPUT In ) { � }

15. Triple the Fetch Cost Cycles/64 vertex vector: ALU 38, vertex 96, sequencer 22 7 GPRs, 27 threads vfetch_full r6.xyz1, r0.x, vf0, Offset=0 DataFormat=FMT_32_32_32_FLOAT // FLOAT3 POSITION vfetch_full r1, r0.x, vf0, Offset=8, DataFormat=FMT_8_8_8_8 // UBYTE4 BLENDINDICES vfetch_mini r2, Offset=9, DataFormat=FMT_8_8_8_8 // USHORT4N BLENDWEIGHT vfetch_full r0.xy__, r0.x, vf0, Offset=6, DataFormat=FMT_32_32_FLOAT // FLOAT2 TEXCOORD mul r1, r1.wzyx, c255.x movas r0._, r1.x dp4 r3.x, c[8+a0].zxyw, r6.zxyw � This shader does 3 times the fetch cost it needs to. The vertex decl (and vertex layout to match) should be rearranged so that this shader needs just one vetch_full.This shader does 3 times the fetch cost it needs to. The vertex decl (and vertex layout to match) should be rearranged so that this shader needs just one vetch_full.

16. One-third the Fetch Cost Cycles/64 vertex vector: ALU 38, vertex 32, sequencer 22 7 GPRs, 27 threads vfetch_full r6.xyz1, r0.x, vf0, Offset=0 DataFormat=FMT_32_32_32_FLOAT // FLOAT3 POSITION vfetch_mini r1 Offset=3, DataFormat=FMT_8_8_8_8 // UBYTE4 BLENDINDICES vfetch_mini r2, Offset=4, DataFormat=FMT_8_8_8_8 // USHORT4N BLENDWEIGHT vfetch_mini r0.xy__ Offset=5, DataFormat=FMT_32_32_FLOAT // FLOAT2 TEXCOORD mul r1, r1.wzyx, c255.x movas r0._, r1.x dp4 r3.x, c[8+a0].zxyw, r6.zxyw � This is much better.This is much better.

17. Depth-Only Rendering GPUs have perf improvements for depth-buffering Hierarchical-Z Double-speed, depth-only rendering Depth-only rendering is often still fill-bound A few triangles can cover 10�100,000�s of quads True for z-prepass, vis-testing, shadow rendering For vis-testing, use tighter bounding objects and proper culling Since we�re fill fill-bound, consider doing pixels I.e.: Give up the double-speed benefit Lay down something useful to spare an additional pass Velocity, focal plane, etc.

18. Pixel Shader Performance Most calls will be fill-bound Pixel shader optimization is some combination of: Minimizing ALUs Minimizing GPRs Reducing control flow overhead Improving texture cache usage Avoiding expensive work Also, trying to balance the hardware Fetches versus ALUs versus GPUs A big challenge is getting the shader compiler to do exactly what we want

19. Minimizing ALUs Minor modifications to an expression can change the number of ALUs The shader compiler produces slightly different results Play around to try different things out Avoid math on constants Reducing just one multiply has saved 3 ALU ops Using [isolate] can dramatically change results Especially for the ALU ops around texture fetches Verify shader compiler output Get comfortable with assembly Compare with expectations given your HLSL code Finally, start tweaking to get what you want

20. Minimizing GPRs Minimizing ALUs usually saves on GPRs as well Unrolling loops consumes more GPRs Conversely, using loops can save GPRs Lumping tfetches to top of shader costs GPRs Both for calculated tex coords And fetch results Xbox 360 extensions can save ALUs Like tfetch with offsets Shader compiler can be told to make due with a user-specified max number of GPRs

21. Control Flow Flattening or preserving loops can have a huge effect on shader performance One game shaved 4 ms off of a 30 ms scene by unrolling just one loop�its main pixel shader Unrolling allowed for much more aggressive optimization by the compiler However, the DepthOfField shader saves 2 ms by not preserving the loop Using the loop reduced GPR usage dramatically Recommendation is to try both ways Non-branching control flow is still overhead Gets reduced as ALU count goes down

22. �Early Out� Shaders �Early out� shaders may not really do anything On Xbox 360, HLSL �clip� doesn�t really kill a pixel But rather just invalidates the output Remaining texture fetches may be spared� �but all remaining ALU instructions still execute Write a test for other hardware to see if early outs actually improve performance Otherwise, assume they don�t For any gain, all 64 pixels would need to be killed Dynamic control flow via an if-else block should get close to what you intend

23. Dynamic Branching Dynamic branching can help or hurt performance All pixels in a thread must actually take the branch Otherwise, both branches need to be executed Use branching to skip fetches and calculations Like whenever the alpha will be zero But beware of multiple code paths executing if�else statements result in additional overhead The ?: operator turns into a single instruction Avoid static branching masquerading as dynamic Do not use numerical constants for control flow; use booleans instead Special-case simple various code paths, which results in less control-flow and fewer GPRs used

24. Thread Size and Branching

25. Thread Size and Branching 43% of pixel threads take the non-lighting path 14% of pixel threads take the lighting path 43% of pixel threads take the soft shadow path

26. Thread Size and Branching 54% 20% 26%

27. Texture Cache Usage Fetches have latency that become a bottleneck Can be a challenge to fetch 6�10 textures per pixel and many MB of texture traffic through a 32 KB cache Age-old recommendations still apply Compare measured texture traffic to ideal traffic Consider a 1280x720x32-bit post-processing pass 1280x720x32 bits = 3.686 MB of ideal texture traffic But measured result may claim 7.0+ MB Triangle rasterization can and will affect texture cache usage In the case above, it�s the only explanation Pixels are processed in an order that�s causing texels to be evicted/re-fetched to the cache

28. Rasterization Test Use an MxN grid instead of a full-screen quad Smaller primitives confine rasterization to better match usage patterns of the texture cache Prevent premature evictions from the texture cache Ideal grid size varies for different conditions Number of textures, texel width, etc. And surely for different hardware, too Write a test that lets you try different grid configurations for each shader For the DepthOfField shader, an 8x1 grid works best For a different shader, 20x13 worked best In all cases, 1x1 seems to be pretty bad

29. Conditional Processing The shader compiler doesn�t know when intermediate results might be zero Diffuse alpha, N�L, specular, bone weight, lightmap contribution, etc. Pixel is in shadow Some constant value is zero (or one) Try making expressions conditional when using these values Experiment to see if even small branches pay off Use a texture mask to mask off areas where expensive calculations can be avoided For PCF, take a few samples to see if you need to do more

30. Multiple Passes Multiple passes cause lots of re-fetching of textures (normal maps, etc.) However, separate passes are better for the tcache Resolve and fetch cost of scene and depth textures adds up Tiling overhead may be worth it to save passes Alpha blending between passes eats bandwidth ALU power is 4x that of texture hardware Branching in shader can handle multiple lights Consider multiple render targets Meshes are often transformed many times Try to skin meshes just once for all N passes Consider memexport or StreamOut for skinning

31. Xbox 360 Platform Fixed hardware offers lots of low-level manipulations to tweak performance Consistent hardware characteristics, like fetch rules Shader GPR allocation Tfetch with offsets Ability to view/author shader microcode Access to hardware performance counters Custom formats (7e3) and blend modes Predicated tiling Custom tools Microcode-aware shader compiler, with custom extensions and attributes PIX for Xbox 360 It�s still important to measure real-world performance

32. Windows Platform Hardware can obviously vary a lot Hard to get low-level knowledge of what the rules are Driver performs additional compiling of shaders Some things to check on include: Effect of thread size on dynamic branching Effect of GPR usage 32-bit versus 16-bit float performance 64-bit render target and texture performance Multiple render target performance Z-prepass effectiveness Hardware support for shadowmap filtering If possible, consider shader authoring on an Xbox 360 dev kit Get ready for D3D 10

33. The HLSL Shader Compiler Don�t expect the shader compiler to solve all your problems It can�t be perfect (and it�s not) Garbage in == garbage out It can�t know what you�re really trying to do It�s easy to trick the compiler, especially with constants It can�t know the situation the shader will run in Texture cache usage, 64-bit textures, filtering, etc. Vertex declarations Interpolators Alpha-blending Neighboring vertices and pixels Besides, what does the driver do with your shader once it gets it? Other things it can�t really know are control flow, constant waterfalling, cache misses, fetch strides, bandwidth, mip levels, etc.Other things it can�t really know are control flow, constant waterfalling, cache misses, fetch strides, bandwidth, mip levels, etc.

34. Shader Compiler The shader compiler can generate variants that perform dramatically different Loops and branching versus flat control flow Grouping tfetches versus minimizing GPR count Isolated operations versus intertwining instructions Reactions to a subtle difference in an expression Be accountable for shader compiler output Always verify the output of the shader compiler Know what you want, then try to get shader compiler to play along Rough-count number of hypothetical instructions for verification

35. Controlling the HLSL Compiler Options to affect compiler output include: Compiler switches and HLSL attributes to mandate control flow Manually unrolling loops Rearranging HLSL code The Xbox 360 compiler has a few more switches and attributes: /Xmaxtempreg to limit GPR usage [isolate]ing blocks can have a huge effect on code generation Especially around fetches Ditto for [noExpressionOptimizations] Compiler output can often be improved by massaging the actual HLSL code

36. Massaging HLSL Code Changing one simple operation can improve or degrade output The input changes, so the rules change, so code generation changes But new code is not always better Send weird cases into developer support The one operation may be as simple as an add Or moving an expression to earlier/later in the shader Or needless math on constants: rcp r0.x, c20.x mul r0.xyz, c5.xyz, c6.w movs r0.y, c0.z cndeq r0, c6.x, r0, r1 Always verify results in assembly

37. Where�s the HLSL? Make sure your art pipeline lets you access/view/tweak HLSL shaders Many engines assemble shader fragments dynamically Meaning there�s not complete HLSL source lying around for every variation of every shader used in-game You must solve this problem Recommendation is to spit out exact HLSL immediately after compilation Save the HLSL to a sequentially named file Then add PIX output to your scene with the ID of each shader used That way, you can trace draw calls back to an HLSL file that you can experiment with

38. Test Things Out Optimizing shaders means trying out a lot of different things under real-world conditions So it�s imperative to test things out Shaders need to be buildable from the command line And be able to be dropped into a test framework And hooked up to performance tools like PIX Get comfortable with shader compiler output Verify that the assembly looks close to expected Like GPR usage, control flow, tfetch placement Isolate shaders and exaggerate their effect Draw 100 full-screen quads instead of one Draw objects off-screen to eliminate fill cost Then, start tweaking�

Revving Up Shader Performance

Revving Up Shader Performance

Presentation Transcript

Revving Up for OAA Reauthorization: What’s on Your Mind?

Shader Metaprogramming

Shader Components: Modular and High Performance Shader Development

Shader Code

Shader Performance Analysis on a Modern GPU Architecture

Revving Up Revenue

Geometry Shader

Shader Model 5.0 and Compute Shader

From 0 to 60+: Revving up FYE at Warp Speed

60 Seconds to Win Revving Up Your Leadership Acumen

Driving up performance

Assembly Shader Language

Pixel Shader

Revving Up for OAA Reauthorization: What’s on Your Mind?

Shader System