340 likes | 590 Views
Vertex Shader Tricks New Ways to Use the Vertex Shader to Improve Performance Bill Bilodeau Developer Technology Engineer, AMD. Topics Covered. Overview of the DX11 front-end pipeline Common bottlenecks Advanced Vertex Shader Features Vertex Shader Techniques Samples and Results.
E N D
Vertex Shader Tricks New Ways to Use the Vertex Shader to Improve Performance Bill BilodeauDeveloper Technology Engineer, AMD
Topics Covered • Overview of the DX11 front-end pipeline • Common bottlenecks • Advanced Vertex Shader Features • Vertex Shader Techniques • Samples and Results
DX11 Front-End Pipeline Input Assembler CB, SRV, or UAV Vertex Shader • VS –vertex data • HS – control points • Tessellator • DS – generated vertices • GS – primitives • Write to UAV at all stages • Starting with DX11.1 Hull Shader Tessellator Domain Shader Geometry Shader . . . Stream Out Graphics Hardware
Bottlenecks - VS • VS Attributes • Limit outputs to 4 attributes (AMD) • This applies to all shader stages (except PS) • VS Texture Fetches • Too many texture fetches can add latency • Especially dependent texture fetches • Group fetches together for better performance • Hide latency with ALU instructions
Bottlenecks - VS Input Assembler Pre-VS Cache (Hides Latency) • Use the caches wisely • Avoid large vertex formats that waste pre-VS cache space • DrawIndexed() allows for reuse of processed vertices saved in the post-VS cache • Vertices with the same index only need to get processed once Vertex Shader Post-VS Cache (Vertex Reuse)
Bottlenecks - GS • GS • Can add or remove primitives • Adding new primitives requires storing new vertices • Going off chip to store data can be a bandwidth issue • Using the GS means another shader stage • This means more competition for shader resources • Better if you can do everything in the VS
Advanced Vertex Shader Features • SV_VertexID, SV_InstanceID • UAV output (DX11.1) • NULL vertex buffer • VS can create its own vertex data
SV_VertexID • Can use the vertex id to decide what vertex data to fetch • Fetch from SRV, or procedurally create a vertex VSOutVertexShader(SV_VertexID id) { float3 vertex = g_VertexBuffer[id]; … }
UAV buffers • Write to UAVs from a Vertex Shader • New feature in DX11.1 (UAV at any stage) • Can be used instead of stream-outfor writing vertex data • Triangle output not limited to strips • You can use whatever format you want • Can output anything useful to a UAV
NULL Vertex Buffer • DX11/DX10 allows this • Just set the number of vertices in Draw() • VS will execute without a vertex buffer bound • Can be used for instancing • Call Draw() with the total number of vertices • Bind mesh and instance data as SRVs
Vertex Shader Techniques • Full Screen Triangle • Vertex Shader Instancing • Merged Instancing • Vertex Shader UAVs
Full Screen Triangle • For post-processing effects • Triangle has better performance than quad • Fast and easy with VS generated coordinates • No IB or VB is necessary • Something you should be using for full screen effects (-1, 3, 0) (3, -1, 0) (-1, -1, 0) Clip Space Coordinates
Full Screen Triangle: C++ code // Null VB, IB pd3dImmediateContext->IASetVertexBuffers( 0, 0, NULL, NULL, NULL ); pd3dImmediateContext->IASetIndexBuffer( NULL, (DXGI_FORMAT)0, 0 ); pd3dImmediateContext->IASetInputLayout( NULL ); // Set Shaders pd3dImmediateContext->VSSetShader( g_pFullScreenVS, NULL, 0 ); pd3dImmediateContext->PSSetShader( … ); pd3dImmediateContext->PSSetShaderResources( … ); pd3dImmediateContext->IASetPrimitiveTopology( D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST ); // Render 3 vertices for the triangle pd3dImmediateContext->Draw(3, 0);
Full Screen Triangle: HLSL Code VSOutputVSFullScreenTest(uintid:SV_VERTEXID) { VSOutput output; // generate clip space position output.pos.x = (float)(id / 2) * 4.0 - 1.0; output.pos.y = (float)(id % 2) * 4.0 - 1.0; output.pos.z = 0.0; output.pos.w = 1.0; // texture coordinates output.tex.x = (float)(id / 2) * 2.0; output.tex.y = 1.0 - (float)(id % 2) * 2.0; // color output.color = float4(1, 1, 1, 1); return output; } (-1, 3, 0) (-1, -1, 0) (3, -1, 0) Clip Space Coordinates
VS Instancing: Point Sprites • Often done on GS, but can be faster on VS • Create an SRV point buffer and bind to VS • Call Draw or DrawIndexed to render the full triangle list. • Read the location from the point buffer and expand to vertex location in quad • Can be used for particles or Bokeh DOF sprites • Don’t use DrawInstanced for a small mesh
Point Sprites: C++ Code pd3d->IASetIndexBuffer( g_pParticleIndexBuffer, DXGI_FORMAT_R32_UINT, 0 ); pd3d->IASetPrimitiveTopology( D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST ); pd3dImmediateContext->DrawIndexed( g_particleCount * 6, 0, 0);
Point Sprites: HLSL Code VSInstancedParticleDrawOutVSIndexBuffer(uintid:SV_VERTEXID) { VSInstancedParticleDrawOutoutput; uintparticleIndex = id / 4; uintvertexInQuad = id % 4; // calculate the position of the vertex float3 position; position.x = (vertexInQuad % 2) ? 1.0 : -1.0; position.y = (vertexInQuad & 2) ? -1.0 : 1.0; position.z = 0.0; position.xy *= PARTICLE_RADIUS; position = mul( position, (float3x3)g_mInvView ) + g_bufPosColor[particleIndex].pos.xyz; output.pos= mul( float4(position,1.0), g_mWorldViewProj ); output.color= g_bufPosColor[particleIndex].color; // texture coordinate output.tex.x = (vertexInQuad % 2) ? 1.0 : 0.0; output.tex.y = (vertexInQuad & 2) ? 1.0 : 0.0; return output; }
Point Sprite Performance AMD Radeon R9 290x Nvidia Titan
Point Sprite Performance • DrawIndexed() is the fastest method • Draw() is slower but doesn’t need an IB • Don’t use DrawInstanced() for creating sprites on either AMD or NVidia hardware • Not recommended for a small number of vertices
Merge Instancing • Combine multiple meshes that can be instanced many times • Better than normal instancing which renders only one mesh • Instance nearby meshes for smaller bounding box • Each mesh is a page in the vertex data • Fixed vertex count for each mesh • Meshes smaller than page size use degenerate triangles
Merge Instancing Mesh Data 0 Instance 0 Mesh Index 2 Mesh Data 1 Vertex 0 Vertex 1 Vertex 2 Vertex 3 . . . 0 0 0 Mesh Data 2 Instance 1 Mesh Index 0 . . . . . . Degenerate Triangle Fixed Length Page Mesh Vertex Data Mesh Instance Data
Merged Instancing using VS • Use the vertex ID to look up the mesh to instance • All meshes are the same size, so (id / SIZE) can be used as an offset to the mesh • Faster than using DrawInstanced()
Merge Instancing Performance • Instancing performance test by Cloud Imperium Games for Star Citizen • Renders 13.5M triangles (~40M verts) • DrawInstanced version calls DrawInstanced() and uses instance data in a vertex buffer • Soft Instancing version uses vertex instancing with Draw() calls and fetches instance data from SRV AMD Radeon R9 290X ms Nvidia GTX 780
Vertex Shader UAVs • Random access Read/Write in a VS • Can be used to store transformed vertex data for use in multi-pass algorithms • Can be used for passing constant attributes between any shader stage (not just from VS)
Skinning to UAV • Skin vertex data then output to UAV • Instance the skinned UAV data multiple times • Can also be used for non-instanced data • Multiple passes can reuse the transformed vertex data – Shadow map rendering • Performance is about the same as stream-out, but you can do more …
Bounding Box to UAV • Can calculate and store Bbox in the VS • Use a UAV to store the min/max values (6) • InterlockedMin/InterlockedMax determine min and max of the bbox • Need to use integer values with atomics • Use the stored bbox in later passes • GPU physics (collision) • Tile based processing
Bounding Box: HLSL Code void UAVBBoxSkinVS(VSSkinnedIn input, uintid:SV_VERTEXID ) { // skin the vertex . . . // output the max and min for the bounding box int x = (int) (vSkinned.Pos.x * FLOAT_SCALE); // convert to integer int y = (int) (vSkinned.Pos.y * FLOAT_SCALE); int z = (int) (vSkinned.Pos.z * FLOAT_SCALE); InterlockedMin(g_BBoxUAV[0], x); InterlockedMin(g_BBoxUAV[1], y); InterlockedMin(g_BBoxUAV[2], z); InterlockedMax(g_BBoxUAV[3], x); InterlockedMax(g_BBoxUAV[4], y); InterlockedMax(g_BBoxUAV[5], z); . . .
Particle System UAV • Single pass GPU-only particle system • In the VS: • Generate sprites for rendering • Do Euler integration and update the particle system state to a UAV
Particle System: HLSL Code uintparticleIndex = id / 4; uintvertexInQuad = id % 4; // calculate the new position of the vertex float3 oldPosition = g_bufPosColor[particleIndex].pos.xyz; float3 oldVelocity = g_bufPosColor[particleIndex].velocity.xyz; // Euler integration to find new position and velocity float3 acceleration = normalize(oldVelocity) * ACCELLERATION; float3 newVelocity = acceleration * g_deltaT+ oldVelocity; float3 newPosition = newVelocity * g_deltaT+ oldPosition; g_particleUAV[particleIndex].pos = float4(newPosition, 1.0); g_particleUAV[particleIndex].velocity = float4(newVelocity, 0.0); // Generate sprite vertices . . .
Conclusion • Vertex shader “tricks” can be more efficient than more commonly used methods • Use SV_Vertex ID for smarter instancing • Sprites • Merge Instancing • UAVs add lots of freedom to vertex shaders • Bounding box calculation • Single pass VS particle system
Demos • Particle System • UAV Skinning • Bbox
Acknowledgements • Merge Instancing • Emil Person, “Graphics Gems for Games” SIGGRAPH 2011 • Brendan Jackson, Cloud Imperium • Thanks to • Nick Thibieroz, AMD • Raul Aguaviva (particle system UAV), AMD • Alex Kharlamov, AMD
Questions • bill.bilodeau@amd.com