270 likes | 768 Views
Advanced D3D10 Rendering. Emil Persson May 24, 2007 . Overview. Introduction to D3D10 Rendering techniques in D3D10 Optimizations. Introduction. Best D3D revision yet! Clean and powerful API Lots of new features SM 4.0 New geometry shader Stream Out Texture arrays
E N D
Advanced D3D10 Rendering Emil Persson May 24, 2007
Overview • Introduction to D3D10 • Rendering techniques in D3D10 • Optimizations Advanced D3D10 Rendering
Introduction • Best D3D revision yet! • Clean and powerful API • Lots of new features • SM 4.0 • New geometry shader • Stream Out • Texture arrays • Render to volume texture • MSAA individual sample access • Constant buffers • Sampler state decoupled from texture unit • Dual-source blending • Etc… Advanced D3D10 Rendering
Clean API • Vista only • Everything is mandatory (almost) • No legacy hardware support • Clean starting point for future evolution of the API • Limited market short-term • Some old features deprecated • Fixed function • Assembly shaders • Alpha test • Triangle fans • Point sprites • Clip planes Advanced D3D10 Rendering
Dealing with deprecated features • Fixed function • Write a few über-shaders • Assembly shaders • Convert to HLSL • Alpha test • Use discard or clip() in pixel shader • Use alpha-to-coverage • Triangle fans • Seldom used anyway, usually just for a quad • Convert to triangle list or strip • Point sprites • Expand point to 2 triangles in GS • Clip planes • Use clip distance and/or cull distance Advanced D3D10 Rendering
SM 4.0 • Geometry shader • Processes a full primitive (point, line, triangle) • Has access to adjacency information (optional) • Useful for silhouette detection, shadow volume extrusion etc. • May output multiple primitives • Output limitation is 1024 floats • May output nothing (to kill primitive) Advanced D3D10 Rendering
SM 4.0 • Infinite instruction count • Very long shaders may have lower throughput though • Integer and bitwise instruction • Indexable temporaries • Allows for local arrays • May be used to emulate a stack • Useful system generated values • SV_VertexID • SV_PrimitiveID • SV_InstanceID • SV_Position (Like VPOS, but now .zw are defined too) • SV_IsFrontFace (Like VFACE) • SV_RenderTargetArrayIndex • SV_ViewportArrayIndex • SV_ClipDistance • SV_CullDistance Advanced D3D10 Rendering
SM 4.0 • Integer & bitwise instructions • Signed and unsigned • No idiv though, just udiv • Same registers as floats • Can alias without conversion with asint(), asuint(), asfloat() etc. • Integer texture sample values • Syntax: Texture2D <uint4> myTex; • Access to individual samples in MSAA surface • Allows for custom AA resolve • Syntax: Texture2DMS <float4, 4> myTex; Advanced D3D10 Rendering
Pixel center • Half pixel offset is gone! • Affects SV_Position as well • Now matches OpenGL • DX10 DX9 Advanced D3D10 Rendering
Pixel center • Pixels and texels align • TexCoord = SV_Position.xy / float2(width, height) Texel center Screenspace Advanced D3D10 Rendering
The small batch problem • D3D10 designed to minimize batch overhead • Pulls work from draw time to creation time • Validation • Shader input/output configuration • Immutable State Objects • Input layout • Rasterizer state • Sampler state • Depth stencil state • Blend state Advanced D3D10 Rendering
The small batch problem • D3D10 also provides tools to reduce draw calls • Improved instancing interface • Geometry shader • More shader resources • Constant indexing in PS • Render target arrays • Texture arrays Advanced D3D10 Rendering
Rendering techniques in D3D10 Advanced D3D10 Rendering
Global Illumination Advanced D3D10 Rendering
Global Illumination • Probes on a volume grid across the scene • Each probe captures light environment into a tiny “cubemap” • Probes are converted to Spherical Harmonics coefficients • Indirect lighting is computed using interpolated SH coefficients • Do the same in probe passes to get multiple light bounces Advanced D3D10 Rendering
Global Illumination • Awful lot of work • Each probe is 6 slices. We need loads of probes. • Sample scene has over 300 probes • Solution • Use geometry shader to reduce work • Distribute work across multiple frames • Sample updates 40 cubes per frame • Scatter updates to hide artifacts • Skip over “empty” space probes Advanced D3D10 Rendering
Global Illumination • The Geometry Shader advantage • 40 cubes x 6 faces x n draw calls = Pain • DX9 style unrealistic even for simple scenes • Update multiple slices per pass with GS • GS output limit is 1024 floats • Keep number of interpolators down to maximize primitive count • Managed to update 5 probes (30 slices) per pass • 8 passes is more manageable than 240 ... Advanced D3D10 Rendering
Post tone-mapping resolve • D3D10 allows for custom AA resolves • Can drastically improve HDR AA quality • Standard resolve occurs before tone-mapping • Ideally resolve should be done after tone-mapping • Standard resolve Custom resolve Advanced D3D10 Rendering
Post-tonemapping resolve • Texture2DMS<float4, SAMPLES> tHDR; • float4 main(float4 pos: SV_Position) : SV_Target • { • int3 coord; • coord.xy = (int2) pos.xy; • coord.z = 0; • // Tone-map individual samples and sum it up • float4 sum = 0; • [unroll] • for (int i = 0; i < SAMPLES; i++) • { • float4 c = tHDR.Load(coord, i); • sum.rgb += 1.0 – exp2(-exposure * c.rgb); • } • // Average • sum *= (1.0 / SAMPLES); • // sRGB • sum.rgb = pow(sum.rgb, 1.0 / 2.2); • return sum; • } Advanced D3D10 Rendering
Optimizations Advanced D3D10 Rendering
Geometry shader • GS optimizations • Input/output usually the bottleneck • Reduce outputs with frustum and/or backface culling • Keep input small by packing data • TexCoord could be 2x16 bits in an uint • Or use for instance asuint(normal.w) • Merge to full float4 vectors • Don’t do 2x float2 • Keep output small • Could be faster to trade for some work in PS • Pass just position, don’t interpolate both lightVec and viewVec • Or even back-project SV_Position.xyz to world space in PS • Small output means more work fits within 1024 floats limit Advanced D3D10 Rendering
GS frustum and backface culling • // Transform to clip space • float4 pos[3]; • pos[0] = mul(mvp, In[0].pos); • pos[1] = mul(mvp, In[1].pos); • pos[2] = mul(mvp, In[2].pos); • // Use frustum culling to improve performance • float4 t0 = saturate(pos[0].xyxy * float4(-1, -1, 1, 1) - pos[0].w); • float4 t1 = saturate(pos[1].xyxy * float4(-1, -1, 1, 1) - pos[1].w); • float4 t2 = saturate(pos[2].xyxy * float4(-1, -1, 1, 1) - pos[2].w); • float4 t = t0 * t1 * t2; • [branch] • if (!any(t)) • { • // Use backface culling to improve performance • float2 d0 = pos[1].xy * pos[0].w - pos[0].xy * pos[1].w; • float2 d1 = pos[2].xy * pos[0].w - pos[0].xy * pos[2].w; • [branch] • if (d1.x * d0.y > d0.x * d1.y || min(min(pos[0].w, pos[1].w), pos[2].w) < 0.0) • { • // Output primitive here ... • } • } Advanced D3D10 Rendering
Miscellaneous optimizations • Pre-baked constant buffers • Don’t update per-material constants in DX9 style • PS don’t need to return float4 anymore • Use float3 if you only care about RGB • May reduce instruction count • Use GS to reduce draw calls • Single pass render-to-cubemap • Update multiple render targets per pass Advanced D3D10 Rendering
The new shader compiler • SM4 shader compiler preserves semantics better • This means more responsibility for you guys • Be careful about your assumptions • Periodically check the resulting assembly • D3D10DisassembleShader() • Use GPUShaderAnalyzer for performance critical shaders Advanced D3D10 Rendering
The new shader compiler HLSL code: float4 main(float4 t: TEXCOORD0) : SV_Target { if (t.x > t.y) return t.xyzw; else return t.wzyx; } • Example: DX9 assembly: add r0.x, -v0.x, v0.y cmp oC0, r0.x, v0.wzyx, v0 DX10 assembly: lt r0.x, v0.y, v0.x if_nz r0.x // <--- Did you really want a branch here? mov o0.xyzw, v0.xyzw ret else mov o0.xyzw, v0.wzyx ret endif Advanced D3D10 Rendering
The new shader compiler • Use [branch], [flatten], [unroll] & [loop] to control output code • This is not for everyone • Poor use could reduce performance • Make sure you know what you’re doing • Only use if you’re familiar with assembly code • Verify that you get the code you expect • Always benchmark both options New DX10 assembly (using [flatten]): lt r0.x, v0.y, v0.x movc o0.xyzw, r0.xxxx, v0.xyzw, v0.wzyx ret Advanced D3D10 Rendering
Questions? emil.persson@amd.com Advanced D3D10 Rendering