400 likes | 533 Views
Direct3D and the future of graphics APIs. Dave Oldcorn, AMD Dan Baker, Oxide Games Johan Andersson, EA / DICE. NITROUS AND DX12. Dan Baker Partner, Oxide Games. Haven’t we been here before?. Goal of DX9 Remember State blocks? Goal of DX10 Large state groups Goal of DX11
E N D
Direct3D and the future of graphics APIs Dave Oldcorn, AMD Dan Baker, Oxide Games Johan Andersson, EA / DICE
NITROUS AND DX12 Dan Baker Partner, Oxide Games
Haven’t we been here before? • Goal of DX9 • Remember State blocks? • Goal of DX10 • Large state groups • Goal of DX11 • Deferred contexts • Are we actually getting faster, or are CPUs just faster? • Quite possible no perf improvements due to API features in 10 years Maybe adding features isn’t the answer…
Deeply ROOTED pROBLEM • Coding design philosophies clash with real world • OOP, data hiding, polymorphic design clashes with task-driven, data parallel • Evident in language trends, striking disconnect between what is considered good code, and what is fast • Gap has always been there, but has grown in recent years • 15 years ago, processors often bound by computation • Now, usually bound by cache misses, serialization, pipeline stalls, etc. • Multi-Core CPUs are ineffectively utilized • ‘Heavy Iron’ , e.g. Big Object, Opaque memory is a dead end for performance • The revolt is beginning in high performance graphics APIS, but will spread
But… How much faster? • Biggest problem with industry today: Acceptance • Only 1 secret in API design: That it can be done. • And isn’t that hard • And our code isn’t that ugly • Star Swarm already demonstrating what is possible on a PC
D3D12 Features That NITROUS USES • True de-coupled multi-core rendering • Expecting near linear thread scheduling • Manual Hazard tracking • Hazards have been resolved already • Memory Heaps • Bigger chunks of memory pool grouping make management simpler • Descriptor Tables • Table exposure allows a cheaper way of binding textures • Allows texture bindings to be shared between non-adjacent batches
What’s different now? Thenn
What’s different now? Start Here If Ready, exit here to prep for release Nown
In the spirit of contributing • Oxide proud to announce that we have a proto-type of Nitrous running on D3D12 • *PR DISCLAIMER* This is not an official announcement regarding D3D12 support • Porting from other modern APIs is much simpler than porting from D3D11 to D3D12
EXPECTED RESULTS • CPU Driver overhead largely put to rest • Huge increases in driver reliability • Huge decreases in frame latency, expecting median frame latency to be 1.5 frames • Increased perceptual responsiveness • Never a dropped frame or stall due to driver API issues • *Other OS events could cause stalls • Driver should be far smaller, simpler to implement, IHVs can spend more time on optimizations
Direct3D12 and the future of graphics APIs Dave Oldcorn, Direct3D12 Driver Architect, AMD
The problem • Mismatch between existing Direct3D and hardware capabilities • Lots of CPU cores, but only one stream of data • State communication in small chunks • “Hidden” work • Hard to predict from any one given call what the overhead might be • Implicit memory management • Hardware evolving away from classical register programming
API landscape • Gap between PC ‘raw’ 3D APIs and the hardware has opened up • Very high level APIs now ubiquitous; easy to access even for casual developers, plenty of choice • Where the PC APIs are is a middle ground Game Engines Unreal Frostbite BlitzTech Unity CryEngine Flash / Silverlight Capability, ease of use, distance from 3D engine Application OpenGL D3D11 D3D9 D3D7/8 Opportunity Metal (register level access) Console APIs
Sequential API State contributing to draw API input • Sequential API: state for given draw comes from arbitrary previous time • Some states must be reconciled on the CPU (“delayed validation”) • All contributing state needs to be visible • GPU isn’t like this, uses command buffers • Must save and restore state at start and end
Threading a sequential API Application simulation • Sequential API threading • Simple producer / consumer model • Extra latency • Buffering has a cost • More threading would mean dividing tasks on finer grain • Bottlenecked on application or driver thread • Difficult to extract parallelism (Amdahl’s Law) ... Prebuild Thread 0 Prebuild Thread 1 Application Render Thread Application Driver Thread Runtime / Driver GPU Execution Queue Queued Buffer 0 Queued Buffer 1 Queued Buffer 2
Command buffer API Application simulation • GPUs only listen to command buffers • Let the app build them • Command Lists, at the API level • Solves sequential API CPU issues ... Thread 1 Thread 0 Build Cmd Buffer Application Build Cmd Buffer Runtime / Driver GPU Execution Queue Queued Buffer 0 Queued Buffer 1
Better scheduling • App has much more control over scheduling work • Both CPU side and GPU • Threads don’t really share much resource • Many more options for streaming assets D3D11: CB building threads tend to interfere Create thread Driver thread D3D12: CB building threads more independent Create thread Build threads GPU load still added but only after queuing Create work Render work GPU executes
Pipeline objects • Pipeline objects get rid of JIT and enable LTCG for GPUs • Decouple interface and implementation • We’re aware that this is a hairpin bend for many graphics engines to negotiate. • Many engines don’t think in terms of predicting state up front • The benefits are worth it Simplified dataflow through pipeline Index Process VS ? Primitive Generation Rasteriser ? PS ? Rendertarget Output
render object binding mismatch GPU Memory SRD table GPU Memory resource On-chip root table (1 per stage) • Hardware uses tables in video memory • BUT still programmed like a register solution • So one bind becomes: • Allocate a new chunk of video memory • Create a new copy of the entire table • Update the one entry • Write the register with the new table base address Pointer to (+ params of) resource Pointer to table (here, textures) SR CB Pointer to table (constant buffers)
Descriptor Tables • Several tables of each type of resource • Easy to divide up by frequency • Tables can be of arbitrary size; dynamically indexed to provide bindless textures • Changing a table pointer is cheap • Updating a descriptor in a table is not GPU Memory SRD table On-chip table Pointer to table (textures table 0) SR.T[0][0] SR.T[0] SR.T[0][1] SR.T[1] SR.T[0][2] SR.T[2] SR.T[3] UAV Samp CB.T[0] CB.T[1][0] CB.T[1] CB.T[1][1] Pointer to table (constbuf table 1)
New visible limits • More draws in does not automatically mean more triangles out • You will not see full rendering rates with triangles averaging 1 pixel each. • Wireframe mode should look different to filled rendering
New visible limits • Feeding the GPU much more efficiently means exploring interesting new limits that weren’t visible before • 10k/frame of anything is ~1µs per thing. • GPU pipeline depth is likely to be 1-10µs (1k-10k cycles). • Specific limit: context registers • Shader tables are NOT in the context • Compute doesn’t bottleneck on context
Application in charge • Application is arbiter of correct rendering • This is a serious responsibility • The benefits of D3D12 aren’t readily available without this condition • Applications must be warning-free on the debug layer • Different opportunities for driver intervention
Application in charge • No driver thread in play • App can target much lower latency • BUT implies app has to be ready with new GPU work D3D11: No dead GPU time after 1st frame (but extra latency) Frame 2 App Render Frame 1 Frame 3 First work sent to driver Driver buffers Present; no future dead time F2 Driver F1 F3 Dead Time F2 GPU F1 F3 No buffered present reveals dead time on GPU
Use command buffers sparingly • Each API command list maps to a single hardware command buffer • Starting / ending a command list has an overhead • Writes full 3D state, may flush caches or idle GPU • We think a good rule of thumb will be to target around 100 command buffers/frame • Use the multiple submission API where possible Multiple applications running on system Application 0 queue CB0 CB1 CB2 Application 1 queue CB0 GPU executes CB0 CB1 CB0 CB2
All-new • There’s a learning curve here for all of us • In the main it’s a shallow one • Compared at least to the general problem of multithreaded rendering • Multithread is always hard. • Simpler design means fewer bugs and more predictable performance
What AMD plan to deliver • An early preview driver “soon” • Release driver for Direct3D12 launch • Continuous engagement • With Microsoft • With ISVs • Bring your opinions to us and to Microsoft.
DX12 and Frostbite Johan Andersson Technical Director
Dx12 and frostbite • PC is very important for EA and we’ve been pushing hard to improve graphics capabilities on Windows • Excited to be working with Microsoft and the IHVs on Direct3D again! • Good & very healthy collaboration between Microsoft, the IHVs and us game/engine developers • DX12 is a really big step forward from DX11 or GL4
Dx12 features and frostbite • Key DX12 features that are a great fit for Frostbite: • Efficient parallel command buffers • Descriptor tables • Pipeline objects • Explicit resource synchronization • Explicit memory management • DX12 is still in development so actively working with Microsoft & the IHVs to help make sure all of it fits together and is efficient
Dx12 platforms • DX12 support on Windows 7 & most existing PC hardware is critical for us • Huge user base still on Windows 7 • Gamers would see major benefits without upgrading • DX12 support on Xbox One is critical for us • Will lead to improved performance & quality for future Xbox One titles • Almost all of our games are cross platform Gen4/PC • Easier development – renderer is shared between Windows & Xbox One • Looking forward to DX12 on mobile/tablets • Power efficiency & low overhead is really key • Need larger user base to target on Windows for mobile
Dx12 and frostbite • We are building a DX12 renderer for Frostbite! • Will work on GPUs from all vendors – benefits a wide set of gamers • Expected benefits over DX11: • More stable and consistent performance • Higher overall performance • Move our design target – more richer & more detailed game worlds • Thinner drivers – easier to work with / less of a black box • More control for us developers – new techniques & optimizations • Really happy that the full Windows & Xbox eco systems are moving to low-level graphics API!