370 likes | 759 Views
Key Takeaways. Know the shader hardware on a low levelPixel threads, GPRs, fetch efficiency, latency, etc.Learn from the Xbox 360Architect shaders accordinglyShader optimization is not always intuitiveSome things that seem to make sense, actually hurt performanceAt some point, you must take a trial-n-error approachIn other words
E N D
1. Revving Up Shader Performance Shanon Drone
Development Engineer
XNA Developer ConnectionMicrosoft
2. Key Takeaways Know the shader hardware on a low level
Pixel threads, GPRs, fetch efficiency, latency, etc.
Learn from the Xbox 360
Architect shaders accordingly
Shader optimization is not always intuitive
Some things that seem to make sense, actually hurt performance
At some point, you must take a trial-n-error approach
In other words…
You must write tests
Real-world measurements are the only way to really know what works Analogy: To optimize shader performance, you need to know how the hardware works “under the hood”, not just the high-level API and shader programming language.
Analogy: In this analogy, you are not the race car driver; rather you are the mechanic, or even the mechanical engineer. The car is a finished product on race day, so this talk is about the performance mods you make while building the engine.
Analogy: To optimize shader performance, you need to know how the hardware works “under the hood”, not just the high-level API and shader programming language.
Analogy: In this analogy, you are not the race car driver; rather you are the mechanic, or even the mechanical engineer. The car is a finished product on race day, so this talk is about the performance mods you make while building the engine.
3. Under the Hood Help the shader compiler by writing shaders for how hardware works “under the hood”
Modern GPUs are highly parallel to hide latency
Lots of pipelines (maybe even a shared shader core)
Works on units larger than single vertices and pixels
Rasterization goes to tiles, then to quads, then to (groups of) pixels
Texture caches are still small
32 KB on Xbox 360 and similar on PC hardware
Shaders use 6–10 textures with 10–20 MB traffic
Rasterization order affects texture cache usage
Lots of other subtleties affect performance Xbox 360 has 48 ALUs
X1300 had 4 shader processors
X1600 had 12 shader processors
X1800 XT has 16 TMUs, 16 ALUs, 16 ROPs, and 8 vertex units
X1900 XT has 16 TMUs, 48 ALUs, 16 ROPs, and 8 vertex units <- Note: 3x ALU perf than texture pipelines. Not unified.
Intel GMA 3000 has unified shader core, including dynamic load balancing
The NVIDIA D3D10 part, the G80, will have 48 pixel processors and unknown # of vertex processors. Not unified.Xbox 360 has 48 ALUs
X1300 had 4 shader processors
X1600 had 12 shader processors
X1800 XT has 16 TMUs, 16 ALUs, 16 ROPs, and 8 vertex units
X1900 XT has 16 TMUs, 48 ALUs, 16 ROPs, and 8 vertex units <- Note: 3x ALU perf than texture pipelines. Not unified.
Intel GMA 3000 has unified shader core, including dynamic load balancing
The NVIDIA D3D10 part, the G80, will have 48 pixel processors and unknown # of vertex processors. Not unified.
4. Pixel and Vertex “Vectors” GPUs work on groups of vertices and pixels
Xbox 360 works on a “vector” of 64 vertices or pixels
Meaning each instruction is executed 64 times at once
Indexed constants cause “constant waterfalling”
For example: “c[a0]” can translate to 64 different constants per one instruction
When all pixels take the same branch…
…performance can be good
But in most cases, both code paths will execute
Author shaders accordingly
“Early out” code paths may not work out like expected
Check assembly output On ATI X1800, thread size is 16 pixels. Up to 512 threads means at least 8192 GPRs.
GeForce 7800 may use 24 pixel threads.
NVIDIA doesn’t release specific details, but suggested that:
- NV40 does 880 pixel blocks
- G70 does ~220 pixel blocksOn ATI X1800, thread size is 16 pixels. Up to 512 threads means at least 8192 GPRs.
GeForce 7800 may use 24 pixel threads.
NVIDIA doesn’t release specific details, but suggested that:
- NV40 does 880 pixel blocks
- G70 does ~220 pixel blocks
5. Why GPRs Matter GPU can process many ALU ops per clock cycle
But fetch results can take hundreds of clock cycles
Due to cache misses, texel size, filtering, etc.
Threads hide fetch latency
While one “thread” is waiting for fetch results, the GPU can start processing another thread
Max number of threads is limited by the GPR pool
Example: 64 pixels × 32 GPRs = 2048 GPRs per thread
The GPU can quickly run out of GPRs
Fewer GPRs used generally translates to more threads
6. Minimize GPR Usage Xbox 360 has 24,576 GPRS
64-unit vector × 128 register banks × 3 SIMD units
Sounds like a lot, but many shaders are “GPR bound”
The issue is the same for PC hardware
Tweak shaders to use the least number of GPRs
Maybe even at the expense of additional ALUs or unintuitive control flow
Unrolling loops usually requires more ALUs…
…as does lumping tfetches at the beginning of the shader
Ditto for über-shaders, even with static control flow
Rules get tricky, so we must try out many variations
7. Vertex Shader Performance Lots of reasons to care about vertex shader performance:
Modern games have many passes (z-pass, tiling ,etc.)
Non-shared cores devote less hardware for vertices
For shared cores, resources could be used for pixels
For ALU bound shaders…
Constant waterfalling can be the biggest problem
Matrix order can affect ALU optimization
Many vertex shaders are fetch bound
Especially lightweight shaders
One fetch can cost 32x more cycles than an ALU
8. VFetch Goal is to minimize the number of fetches
Which is why vertex compression is so important
Vertex data should be aligned to 32 or 64 bytes
Multiple streams will multiply the fetch cost
Vertex declaration should match fetch order
Shader compiler has ZERO info about vertex components
Shaders are patched at run time with the vertex declaration
Shader patching can’t optimize out unnecessary vfetches
9. Mega vs. Mini Fetches A quick refresher…
GPU reads vertex data is groups of bytes
Typically 32 bytes (a “mega” or “full” fetch)
Additional fetches within the 32-byte range should be free (a “mini” fetch)
On Xbox 360:
A “vfetch_full” pulls in 32 bytes worth of data
Two fetches (times 64 vertices) per clock cycle equals 32 cycles per fetch
Without 32 cycles worth of ALU ops, the shader is fetch bound
10. Compress vertices
Normals, Tangent, BiNormal -> 11:11:10
Texture coords -> 16:16
Put all non-lighting components first
So depth-only shaders do just one fetch
FLOAT32 Position[3];
UINT8 BoneWeights[4];
UINT8 BoneIndices[4];
UINT16 DiffuseTexCoords[2];
UINT16 NormalMapCoords[2];
DEC3N Normal;
DEC3N Tangent;
DEC3N BiNormal;
All in one stream, of course Vfetch Recommendations
11. Fetch From Two Streams Cycles/64 vertex vector: ALU 12, vertex 64, sequencer 12
3 GPRs, 31 threads
// Fetch position
vfetch_full r2.xyz1, r0.x, vf0,
Offset=0,
DataFormat=FMT_32_32_32_FLOAT
// Fetch diffuse texcoord
vfetch_full r0.xy0_, r0.x, vf2,
Offset=0,
DataFormat=FMT_32_32_FLOAT
mul r1, r2.y, c1
mad r1, r2.x, c0.wyxz, r1.wyxz
mad r1, r2.z, c2.zywx, r1.wyxz
mad r2, r2.w, c3, r1.wyxz
mul r1, r2.y, c5.wzyx
mad r1, r2.x, c4.xzwy, r1.wyxz
mad r1, r2.z, c6.yzxw, r1.wyxz
mad oPos, r2.w, c7, r1.zxyw
nop
12. 2 Fetches From Same Stream Cycles/64 vertex vector: ALU 12, vertex 64, sequencer 12
3 GPRs, 31 threads
// Fetch position
vfetch_full r2.xyz1, r0.x, vf0,
Offset=0,
DataFormat=FMT_32_32_32_FLOAT
// Fetch diffuse texcoord
vfetch_full r0.xy0_, r0.x, vf0,
Offset=10,
DataFormat=FMT_32_32_FLOAT
mul r1, r2.y, c1
mad r1, r2.x, c0.wyxz, r1.wyxz
mad r1, r2.z, c2.zywx, r1.wyxz
mad r2, r2.w, c3, r1.wyxz
mul r1, r2.y, c5.wzyx
mad r1, r2.x, c4.xzwy, r1.wyxz
mad r1, r2.z, c6.yzxw, r1.wyxz
mad oPos, r2.w, c7, r1.zxyw
nop
13. 1 Fetch From Single Stream Cycles/64 vertex vector: ALU 12, vertex 32, sequencer 12
3 GPRs, 31 threads
// Fetch position
vfetch_full r2.xyz1, r0.x, vf0,
Offset=0,
DataFormat=FMT_32_32_32_FLOAT
// Fetch diffuse texcoord
vfetch_mini r0.xy0_, r0.x, vf0,
Offset=5,
DataFormat=FMT_32_32_FLOAT
mul r1, r2.y, c1
mad r1, r2.x, c0.wyxz, r1.wyxz
mad r1, r2.z, c2.zywx, r1.wyxz
mad r2, r2.w, c3, r1.wyxz
mul r1, r2.y, c5.wzyx
mad r1, r2.x, c4.xzwy, r1.wyxz
mad r1, r2.z, c6.yzxw, r1.wyxz
mad oPos, r2.w, c7, r1.zxyw
nop
14. Triple the Fetch Cost //DepthOnlyVS.hlsl
struct VS_INPUT
{
float4 Position;
float4 BoneIndices;
float4 BoneWeights;
float2 DiffuseTexCoords;
};
VS_OUTPUT DepthOnlyVS( VS_INPUT In )
{
…
}
15. Triple the Fetch Cost Cycles/64 vertex vector: ALU 38, vertex 96, sequencer 22
7 GPRs, 27 threads
vfetch_full r6.xyz1, r0.x, vf0,
Offset=0
DataFormat=FMT_32_32_32_FLOAT // FLOAT3 POSITION
vfetch_full r1, r0.x, vf0,
Offset=8,
DataFormat=FMT_8_8_8_8 // UBYTE4 BLENDINDICES
vfetch_mini r2,
Offset=9,
DataFormat=FMT_8_8_8_8 // USHORT4N BLENDWEIGHT
vfetch_full r0.xy__, r0.x, vf0,
Offset=6,
DataFormat=FMT_32_32_FLOAT // FLOAT2 TEXCOORD
mul r1, r1.wzyx, c255.x
movas r0._, r1.x
dp4 r3.x, c[8+a0].zxyw, r6.zxyw
… This shader does 3 times the fetch cost it needs to.
The vertex decl (and vertex layout to match) should be rearranged so that this shader needs just one vetch_full.This shader does 3 times the fetch cost it needs to.
The vertex decl (and vertex layout to match) should be rearranged so that this shader needs just one vetch_full.
16. One-third the Fetch Cost Cycles/64 vertex vector: ALU 38, vertex 32, sequencer 22
7 GPRs, 27 threads
vfetch_full r6.xyz1, r0.x, vf0,
Offset=0
DataFormat=FMT_32_32_32_FLOAT // FLOAT3 POSITION
vfetch_mini r1
Offset=3,
DataFormat=FMT_8_8_8_8 // UBYTE4 BLENDINDICES
vfetch_mini r2,
Offset=4,
DataFormat=FMT_8_8_8_8 // USHORT4N BLENDWEIGHT
vfetch_mini r0.xy__
Offset=5,
DataFormat=FMT_32_32_FLOAT // FLOAT2 TEXCOORD
mul r1, r1.wzyx, c255.x
movas r0._, r1.x
dp4 r3.x, c[8+a0].zxyw, r6.zxyw
… This is much better.This is much better.
17. Depth-Only Rendering GPUs have perf improvements for depth-buffering
Hierarchical-Z
Double-speed, depth-only rendering
Depth-only rendering is often still fill-bound
A few triangles can cover 10–100,000’s of quads
True for z-prepass, vis-testing, shadow rendering
For vis-testing, use tighter bounding objects and proper culling
Since we’re fill fill-bound, consider doing pixels
I.e.: Give up the double-speed benefit
Lay down something useful to spare an additional pass
Velocity, focal plane, etc.
18. Pixel Shader Performance Most calls will be fill-bound
Pixel shader optimization is some combination of:
Minimizing ALUs
Minimizing GPRs
Reducing control flow overhead
Improving texture cache usage
Avoiding expensive work
Also, trying to balance the hardware
Fetches versus ALUs versus GPUs
A big challenge is getting the shader compiler to do exactly what we want
19. Minimizing ALUs Minor modifications to an expression can change the number of ALUs
The shader compiler produces slightly different results
Play around to try different things out
Avoid math on constants
Reducing just one multiply has saved 3 ALU ops
Using [isolate] can dramatically change results
Especially for the ALU ops around texture fetches
Verify shader compiler output
Get comfortable with assembly
Compare with expectations given your HLSL code
Finally, start tweaking to get what you want
20. Minimizing GPRs Minimizing ALUs usually saves on GPRs as well
Unrolling loops consumes more GPRs
Conversely, using loops can save GPRs
Lumping tfetches to top of shader costs GPRs
Both for calculated tex coords
And fetch results
Xbox 360 extensions can save ALUs
Like tfetch with offsets
Shader compiler can be told to make due with a user-specified max number of GPRs
21. Control Flow Flattening or preserving loops can have a huge effect on shader performance
One game shaved 4 ms off of a 30 ms scene by unrolling just one loop—its main pixel shader
Unrolling allowed for much more aggressive optimization by the compiler
However, the DepthOfField shader saves 2 ms by not preserving the loop
Using the loop reduced GPR usage dramatically
Recommendation is to try both ways
Non-branching control flow is still overhead
Gets reduced as ALU count goes down
22. “Early Out” Shaders “Early out” shaders may not really do anything
On Xbox 360, HLSL “clip” doesn’t really kill a pixel
But rather just invalidates the output
Remaining texture fetches may be spared…
…but all remaining ALU instructions still execute
Write a test for other hardware to see if early outs actually improve performance
Otherwise, assume they don’t
For any gain, all 64 pixels would need to be killed
Dynamic control flow via an if-else block should get close to what you intend
23. Dynamic Branching Dynamic branching can help or hurt performance
All pixels in a thread must actually take the branch
Otherwise, both branches need to be executed
Use branching to skip fetches and calculations
Like whenever the alpha will be zero
But beware of multiple code paths executing
if…else statements result in additional overhead
The ?: operator turns into a single instruction
Avoid static branching masquerading as dynamic
Do not use numerical constants for control flow; use booleans instead
Special-case simple various code paths, which results in less control-flow and fewer GPRs used
24. Thread Size and Branching
25. Thread Size and Branching 43% of pixel threads take the non-lighting path
14% of pixel threads take the lighting path
43% of pixel threads take the soft shadow path
26. Thread Size and Branching 54%
20%
26%
27. Texture Cache Usage Fetches have latency that become a bottleneck
Can be a challenge to fetch 6–10 textures per pixel and many MB of texture traffic through a 32 KB cache
Age-old recommendations still apply
Compare measured texture traffic to ideal traffic
Consider a 1280x720x32-bit post-processing pass
1280x720x32 bits = 3.686 MB of ideal texture traffic
But measured result may claim 7.0+ MB
Triangle rasterization can and will affect texture cache usage
In the case above, it’s the only explanation
Pixels are processed in an order that’s causing texels to be evicted/re-fetched to the cache
28. Rasterization Test Use an MxN grid instead of a full-screen quad
Smaller primitives confine rasterization to better match usage patterns of the texture cache
Prevent premature evictions from the texture cache
Ideal grid size varies for different conditions
Number of textures, texel width, etc.
And surely for different hardware, too
Write a test that lets you try different grid configurations for each shader
For the DepthOfField shader, an 8x1 grid works best
For a different shader, 20x13 worked best
In all cases, 1x1 seems to be pretty bad
29. Conditional Processing The shader compiler doesn’t know when intermediate results might be zero
Diffuse alpha, N·L, specular, bone weight, lightmap contribution, etc.
Pixel is in shadow
Some constant value is zero (or one)
Try making expressions conditional when using these values
Experiment to see if even small branches pay off
Use a texture mask to mask off areas where expensive calculations can be avoided
For PCF, take a few samples to see if you need to do more
30. Multiple Passes Multiple passes cause lots of re-fetching of textures (normal maps, etc.)
However, separate passes are better for the tcache
Resolve and fetch cost of scene and depth textures adds up
Tiling overhead may be worth it to save passes
Alpha blending between passes eats bandwidth
ALU power is 4x that of texture hardware
Branching in shader can handle multiple lights
Consider multiple render targets
Meshes are often transformed many times
Try to skin meshes just once for all N passes
Consider memexport or StreamOut for skinning
31. Xbox 360 Platform Fixed hardware offers lots of low-level manipulations to tweak performance
Consistent hardware characteristics, like fetch rules
Shader GPR allocation
Tfetch with offsets
Ability to view/author shader microcode
Access to hardware performance counters
Custom formats (7e3) and blend modes
Predicated tiling
Custom tools
Microcode-aware shader compiler, with custom extensions and attributes
PIX for Xbox 360
It’s still important to measure real-world performance
32. Windows Platform Hardware can obviously vary a lot
Hard to get low-level knowledge of what the rules are
Driver performs additional compiling of shaders
Some things to check on include:
Effect of thread size on dynamic branching
Effect of GPR usage
32-bit versus 16-bit float performance
64-bit render target and texture performance
Multiple render target performance
Z-prepass effectiveness
Hardware support for shadowmap filtering
If possible, consider shader authoring on an Xbox 360 dev kit
Get ready for D3D 10
33. The HLSL Shader Compiler Don’t expect the shader compiler to solve all your problems
It can’t be perfect (and it’s not)
Garbage in == garbage out
It can’t know what you’re really trying to do
It’s easy to trick the compiler, especially with constants
It can’t know the situation the shader will run in
Texture cache usage, 64-bit textures, filtering, etc.
Vertex declarations
Interpolators
Alpha-blending
Neighboring vertices and pixels
Besides, what does the driver do with your shader once it gets it? Other things it can’t really know are control flow, constant waterfalling, cache misses, fetch strides, bandwidth, mip levels, etc.Other things it can’t really know are control flow, constant waterfalling, cache misses, fetch strides, bandwidth, mip levels, etc.
34. Shader Compiler The shader compiler can generate variants that perform dramatically different
Loops and branching versus flat control flow
Grouping tfetches versus minimizing GPR count
Isolated operations versus intertwining instructions
Reactions to a subtle difference in an expression
Be accountable for shader compiler output
Always verify the output of the shader compiler
Know what you want, then try to get shader compiler to play along
Rough-count number of hypothetical instructions for verification
35. Controlling the HLSL Compiler Options to affect compiler output include:
Compiler switches and HLSL attributes to mandate control flow
Manually unrolling loops
Rearranging HLSL code
The Xbox 360 compiler has a few more switches and attributes:
/Xmaxtempreg to limit GPR usage
[isolate]ing blocks can have a huge effect on code generation
Especially around fetches
Ditto for [noExpressionOptimizations]
Compiler output can often be improved by massaging the actual HLSL code
36. Massaging HLSL Code Changing one simple operation can improve or degrade output
The input changes, so the rules change, so code generation changes
But new code is not always better
Send weird cases into developer support
The one operation may be as simple as an add
Or moving an expression to earlier/later in the shader
Or needless math on constants:
rcp r0.x, c20.x
mul r0.xyz, c5.xyz, c6.w
movs r0.y, c0.z
cndeq r0, c6.x, r0, r1
Always verify results in assembly
37. Where’s the HLSL? Make sure your art pipeline lets you access/view/tweak HLSL shaders
Many engines assemble shader fragments dynamically
Meaning there’s not complete HLSL source lying around for every variation of every shader used in-game
You must solve this problem
Recommendation is to spit out exact HLSL immediately after compilation
Save the HLSL to a sequentially named file
Then add PIX output to your scene with the ID of each shader used
That way, you can trace draw calls back to an HLSL file that you can experiment with
38. Test Things Out Optimizing shaders means trying out a lot of different things under real-world conditions
So it’s imperative to test things out
Shaders need to be buildable from the command line
And be able to be dropped into a test framework
And hooked up to performance tools like PIX
Get comfortable with shader compiler output
Verify that the assembly looks close to expected
Like GPR usage, control flow, tfetch placement
Isolate shaders and exaggerate their effect
Draw 100 full-screen quads instead of one
Draw objects off-screen to eliminate fill cost
Then, start tweaking…