460 likes | 637 Views
Performance tips for Windows Store apps using DirectX and C++. Max McMullen Principal Development Lead – Direct3D Microsoft Corporation 4-102. Agenda. Overview Measuring rendering performance Power efficient GPU characteristics Optimizing for power efficient GPUs. Overview.
E N D
Performance tips for Windows Store apps using DirectX and C++ Max McMullen Principal Development Lead – Direct3D Microsoft Corporation 4-102
Agenda • Overview • Measuring rendering performance • Power efficient GPU characteristics • Optimizing for power efficient GPUs
Optimizing for the Windows 8/RT OS • New form factors and platforms require new optimizations • Windows uses DirectX to get every pixel on screen • Direct3D 11.1 provides new APIs to optimize rendering
Use optimized Windows 8/RT platforms • All Windows Store apps use DirectX for rendering • WWA & XAML optimized use of Direct2D and Direct3D 11.1 • Direct2D and Direct2D Effects fully leverage Direct3D 11.1 • But sometimes you really need to use Direct3D itself…
What you should know Basics of building a C++ Windows Store app Direct3D fundamentals
How do you measure rendering performance? • Many useful tools for Windows performance optimization: • Visual Studio Performance Profiler, Visual Studio Graphics Diagnostics, hardware partner tools… • Two primary tools used to optimize Direct3D usage in the Windows 8/RT OS: • Basic: FPS/time measurement in app/microbenchmarks • Advanced: GPUView
Frames per second (FPS) • Quick but sometimes misleading • C++/DirectX Windows Store apps sync to the display refresh • Measure render time, not present • Call ID3D11DeviceContext::Flush instead of IDXGISwapchain::Present • Infrequent output: file output • Frequent output: look at FPSCounter.cpp in the GeometryRealization sample
GPUView • Part of the Windows Performance Toolkit • ETW Logging of CPU and GPU work • Measures graphics performance • FPS, startup time, glitching, render time, latency • Enables detailed analysis of CPU and GPU workloads and interdependencies
GPUView – Record and Analyze • Install • x86: Windows Performance Toolkit • ARM: Windows Kits\8.0\Windows Performance Toolkit\Redistributables\WPTarm-arm_en-us.msi • Record • Run log.cmd to start • Perform action • Run log.cmd to stop • Analyze • Data captured in merged.etl, load in GPUView
GPUView - Interface GPU Hardware Queue Flip Queue CPU Queues CPU Threads
GPUView Interface: GPU Hardware Queue • The GPU Hardware Queue shows command buffers rendering on the GPU. • CPU Queue command buffers moved to the GPU Hardware Queue when the hardware is ready to receive more commands.
What to expect with power efficient GPUs • Feature level 9_1 or 9_3 • Limited available bandwidth • Both immediate render and tiled render GPUs • Limited shader instruction throughput
Feature Level 9.x (FL9.1, FL9.3) • Real-time render limitations generally occur before reaching these maximums
GPU Memory Bandwidth • Baseline requirement: 1.9 GB/sec benchmarked • 7.5 I/O operation per screen pixel, 1366x768x32bpp@60hz
Immediate render GPU shadercores Graphics memory Memory bus
Tiled render GPU shader cores Graphics memory Memory bus
Tiled render GPU shader cores Graphics memory Memory bus
Tiled render GPU shader cores Graphics memory Memory bus
Shader instruction throughput • Fill rates on GPUs depend on a number of factors • Memory bandwidth • Blend mode • Shader cores • Shader complexity • Etc • Power efficient GPUs become shader throughput bound at approximately ~4 pixel shader instructions
Bandwidth optimization: basics • Render opaque objects front-to-back with z-buffering • Disable alpha blending for opaque objects • Use geometry to trim large transparent areas
Bandwidth optimization: compress resources • Direct3D supports texture compression at all feature levels • BC1 4-bits/pixel for RGB formats - 6x compression ratio • BC2,3 8-bits/pixel for RGBA formats - 4x compression ratio • Smaller resources also means faster downloads of your app
Bandwidth optimization: quantize resources • Use the 16 bit formats added to Direct3D 11.1: • DXGI_FORMAT_B5G6R5_UNORM • DXGI_FORMAT_B5G5R5A1_UNORM • DXGI_FORMAT_B4G4R4A4_UNORM
Bandwidth optimization: flip present • Must use DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL • OS automatically uses “fullscreen” flips when: • Swapchain buffer dimensions match the desktop resolution • Swapchain format is DXGIFMT_B8G8R8A8_UNORM* • App is the only content onscreen • Buffer dimensions need to be converted correctly from device independent pixels (dips) • Just create the swapchain with zero width and height to get the right size
using namespace Windows::Graphics::Display; • float ConvertDipsToPixels(float dips) • { • static const float dipsPerInch = 96.0f; • return floor(dips*DisplayProperties::LogicalDpi/dipsPerInch+0.5f); • } • … • Platform::Agile<Windows::UI::Core::CoreWindow> m_window; • float swapchainWidth= ConvertDipsToPixels(m_window->Bounds.Width); • float swapchainHeight= ConvertDipsToPixels(m_window->Bounds.Height);
Bandwidth optimization: tiled render GPUs • Minimize command buffer flushes • Don’t map resources in use by the GPU, use DISCARD and NO_OVERWRITE • Minimize scene flushes • Visit RenderTargets only once per frame • Don’t update resources in use by the GPU from the CPU, use DISCARD and NO_OVERWRITE with ID3D11DeviceContext::CopySubresourceRegion1 • Use scissors when updating small portions of a RenderTarget
Bandwidth optimization: tiled render GPUs • New Direct3D APIs provide hints to avoid unnecessary copies • Rendering artifacts if used incorrectly
Bandwidth optimization: Discard* APIs • m_swapChain->Present(1, 0); // present the image on the display • ComPtr<ID3D11View> view; • m_renderTargetView.As(&view); // get the view on the RT • m_d3dContext->DiscardView(view.Get()); // discard the view Use ID3D11DeviceContext1::DiscardView and ID3D11DeviceContext1::DiscardResource1 to prevent unnecessary tile copies Artifacts if used incorrectly
Tiled render GPU shader cores Graphics memory Memory bus
Tiled render GPU shader cores Graphics memory Memory bus
Shader instruction throughput • Power efficient GPUs have limited throughput for full precision • Minimum precision hints increase throughput when precision doesn’t matter • Specifies minimum rather than actual precision • min16float, min16int, min10int • Don’t change precision often • 20-25% improvement in practice with min16float
Minimum precision • static constfloatbrightThreshold = 0.5f; • Texture2D sourceTexture : register(t0); • float4 DownScale3x3BrightPass(QuadVertexShaderOutput input) : SV_TARGET • { • float3brightColor = 0; • // Gather 16 adjacent pixels (each bilinear sample reads a 2x2 region) • brightColor = sourceTexture.Sample(linearSampler, input.tex, int2(-1,-1)).rgb; • brightColor += sourceTexture.Sample(linearSampler, input.tex, int2( 1,-1)).rgb; • brightColor += sourceTexture.Sample(linearSampler, input.tex, int2(-1, 1)).rgb; • brightColor += sourceTexture.Sample(linearSampler, input.tex, int2( 1, 1)).rgb; • brightColor /= 4.0f; • // Brightness thresholding • brightColor = max(0, brightColor - brightThreshold); • return float4(brightColor, 1.0f); • }
Minimum precision • static constmin16floatbrightThreshold = (min16float)0.5; • Texture2D<min16float4> sourceTexture : register(t0); • float4 DownScale3x3BrightPass(QuadVertexShaderOutput input) : SV_TARGET • { • min16float3brightColor = 0; • // Gather 16 adjacent pixels (each bilinear sample reads a 2x2 region) • brightColor = sourceTexture.Sample(linearSampler, input.tex, int2(-1,-1)).rgb; • brightColor += sourceTexture.Sample(linearSampler, input.tex, int2( 1,-1)).rgb; • brightColor += sourceTexture.Sample(linearSampler, input.tex, int2(-1, 1)).rgb; • brightColor += sourceTexture.Sample(linearSampler, input.tex, int2( 1, 1)).rgb; • brightColor /= (min16float)4.0; • // Brightness thresholding • brightColor = max(0, brightColor - brightThreshold); • return float4(brightColor, 1.0f); • }
Minimum precision – bad usage • static constmin16floatbrightThreshold = (min16float)0.5; • Texture2D<min16float4> sourceTexture : register(t0); • float4 DownScale3x3BrightPass(QuadVertexShaderOutput input) : SV_TARGET • { • min16float3brightColor = 0; • // Gather 16 adjacent pixels (each bilinear sample reads a 2x2 region) • brightColor = sourceTexture.Sample(linearSampler, input.tex, int2(-1,-1)).rgb; • brightColor += sourceTexture.Sample(linearSampler, input.tex, int2( 1,-1)).rgb; • brightColor += sourceTexture.Sample(linearSampler, input.tex, int2(-1, 1)).rgb; • brightColor += sourceTexture.Sample(linearSampler, input.tex, int2( 1, 1)).rgb; • brightColor /= (min10int)4.0; • // Brightness thresholding • brightColor = max(0, brightColor - brightThreshold); • return float4(brightColor, 1.0f); • }
Wrap-up • Optimize! • Use the right tools and techniques to measure performance • Tune for power efficient GPUs’ unique performance characteristics • Direct3D 11.1 and Windows 8 provide the APIs to fully leverage power efficient GPUs
Build 2012 Talk: 3-113 Graphics with the Direct3D11.1 API made easy • Build 2012 Talk: 3-109 Developing a Windows Store app using C++ and DirectX • Visual Studio 2012 Remote Debugging: http://blogs.msdn.com/b/dsvc/archive/2012/10/26/windows-rt-windows-store-app-debugging.aspx • FPS Counter in GeometryRealization sample: http://code.msdn.microsoft.com/windowsapps/Geometry-Realization-963be8b7#content • GPUView: http://msdn.microsoft.com/en-us/library/windows/desktop/jj585574(v=vs.85).aspx • Direct3D11.1: http://msdn.microsoft.com/en-us/library/windows/desktop/hh404562(v=vs.85).aspx
Resources • Develop: http://msdn.microsoft.com/en-US/windows/apps/br229512 • Design: http://design.windows.com/ • Samples: http://code.msdn.microsoft.com/windowsapps/Windows-8-Modern-Style-App-Samples • Videos: http://channel9.msdn.com/Windows Please submit session evals by using the Build Windows 8 app or at http://aka.ms/BuildSessions