260 likes | 418 Views
Inside Xbox One Martin Fuller Xbox Advanced Technology Group AMD AND MICROSOFT GAME DEVELOPER DAY - June 2 2014, STOCKHOLM. NDA. This is a non-NDA event That means there is a limit to how much I can say, go easy!. CPU. AMD Jaguar (x64 ) - 8-cores arranged in 2x clusters of 4 cores each
E N D
Inside Xbox OneMartin FullerXbox Advanced Technology GroupAMD AND MICROSOFT GAME DEVELOPER DAY - June 2 2014, STOCKHOLM
NDA • This is a non-NDA event • That means there is a limit to how much I can say, go easy!
CPU AMD Jaguar (x64) - 8-cores arranged in 2x clusters of 4 cores each • 1.75 GHz • Dual issue • Out of order execution • Speculative execution • Store-to-load forwarding • SSE4.2 and AVX • (Dot product!) • 16 x 256-bit wide floating point registers • Hardware pre-fetch
Memory • 8 GiB of DDR3 at 68 GiB/s • Low latency • Not enough bandwidth to touch all of memory a frame, RAM as a super fast cache • 48-bit virtual address space • 256 terabytes • Tricky to fragment! • Synced between CPU and GPU • 4 MiBof L2 cache • 2 MiBper cluster • MOESI protocol for cache coherency • 16-way set associative • Per core, up to eight cache requests in flight at once
CPU – Recommendations • Store to load forwarding saves the dreaded LHS stall • But not spilling out registers is even better • The branch predictor is not a crystal ball • Branchless tricks learnt in Xbox 360 era can still apply • Hardware data pre-fetch is awesome • Only works with arrays • Avoid aliasing load/stores on 2KiB alignments • This causes a false positive that delays load execution • Go wide with SSE and leverage all cores • No brainer
GPU • AMD GCN 768-SPU • 853 MHz • 32 MiB of ESRAM at 109 GiB/s • 4 Move Engines • 3 hardware display planes • Resolution independent • Frame rate independent • Exact sRGB this time! • (oh, and its free) • Hardware video encode and decode • HDMI 1.4a in and out
Move Engines • More than just DMA copy • Memory set • Texture swizzle • JPEG decompress • LZ compress and decompress
ESRAM • 32MiB of general purpose RAM • Not like EDRAM on Xbox 360 • 109 GiB/s • Sometimes faster in practice! • Zero contention • Not shared with CPU, SRA’s or video out • ESRAM makes everything better • Render targets • Textures • Geometry • Compute tasks
ESRAM – Sometimes faster in practice? • ESRAM can handle concurrent read/writes: Increasing effective bandwidth above 109 GiB/s • Operations that can take advantage of this: • Read modify write operations • Depth buffer / HTILE update • Alpha blending • Oh, and concurrently DMA’ing resources in/out of ESRAM while also rendering • How much effective bandwidth can titles achieve? • The current record holder achieved 141 GiB/s from ESRAM (this is a post processing pass in a real title) • Of course all titles combine ESRAM’s >= 109 GiB/s with DRAM’s 68 GiB/s
ESRAM – The Four Stages of Adoption • Statically allocate a small number of render targets in ESRAM • Alias the same memory for re-use later • Partial residency • Put the top strip of render targets (sky) in DRAM, the rest in ESRAM • Asynchronously DMA resources in/out of ESRAM • Launch titles were at 1 - 2 • 2nd wave of titles are now starting to tackle points 3 and/or 4 • 3rd+ wave will get really good at this!
ESRAM – Memory Maps! • It’s like 8 bit days all over again! (Sort of) • Plan the asynchronous moves • Move resources in/out asynchronous while also rendering • New memory map at each stage of the render pipeline • Don’t forget, swizzle textures on DMA
Maxing out the GPU • Are you bandwidth limited? • Have you maxed out the fixed function hardware? • Do you have spare compute resource? • Then use async compute! • Titles have barely scratched the surface yet: • Watch this space!
The usual GPU recommendations • Use ESRAM • First for depth / stencil • Then colour targets • Then everything else • Sort by state / shader / use hardware instancing • (Batch batch batch!) • Always swizzle textures • Be wary of using too many general purpose registers • Keep an eye on occupancy in PIX, we normally recommend >= 4 • Avoid reading DRAM via the CPU-coherent bus • There is no hardware integer divide
Graphics API • DX11 was designed for the desktop (a long time ago, 2008!) • Abstracts a variety of different GPU architectures • Manages VRAM residency for you • Over subscribing VRAM is a serious performance pitfall • Handles hazards • Developers can handle these at a higher level => less cost • Xbox One will run vanilla DX11 PC code • Easy port • Extensions available for low level access
Graphics API • DX11.X • Some DX12 features available right now on Xbox: • Turn off hazard tracking • Simple fence API • Deferred contexts re-implemented • New resource descriptor model • Draw bundles • (Xbox specific, not the DX12 API)
DRAM - Contention • The CPU cannot saturate DRAM bandwidth on its own, the GPU can! • Significant performance degradation from DRAM contention • Fancy CPU features don’t help if memory starved • 10. Use ESRAM as much as possible 20. Leave DRAM for the CPU and DMA • 30. goto 10;
DRAM – Love your bandwidth • Hardware data cache pre-fetch units are awesome • Manual pre-fetch is near pointless once hardware pre-fetch is spinning • Wasting bandwidth if only operating on small arrays • Write combined memory pages and SSE streaming store instructions by-pass the cache • No load - halves the bandwidth consumed by the CPU • Pack your data! • Expanding / compressing data is cheap (CPU & GPU) • F16C (half <-> float) CPU instructions • Store to load forwarding avoids LHS stalls • Swizzle your textures • Move engines can swizzle on copy
Audio • Custom audio hardware • Very fast • Lots of features • Kinda cool! • Nuff said
3x Operating Systems • ERA • Exclusive Resource Allocation • Only one active at a time • Custom OS • (Games!) • SRA • Shared Resource Allocation • Win8 core • (Apps) • Hypervisor • SRA and ERA use different virtual address space
PLM (Program Lifetime Management) • ERA can be in one of several states • Full screen • Full resources (even with snapped app up) • Constrained (Windowed) • Slightly less CPU and GPU resource • No input • Same amount of memory • Suspended • Zero CPU and GPU resource • No input • Same amount of memory • Limited time to save after receiving a suspend message
Kinect 2.0 • Hardware: • Higher resolution colour and depth • Better ranges • New – infrared! • Microphone array • No tilt motor • Software: • Improved skeletal tracking • Improved biometrics
Streaming install • 6x Bluray = ~26 MiB/s • To install a 50 GiBBluray at ~26 MiB/s = ~33 minutes • Too long to wait… bored now… • Game must start after an initial payload has been installed. • When running title can hint as to what to install next. • No direct access to Bluray. • Could be digital download • It’s obvious but I’ll say it anyway – compress you assets!
The Cloud • Cloud compute: • Developer’s code is hosted and executed in Windows Azure • Game code execution automatically scales based upon usage • Live services: • Stats, analytics, matchmaking & storage. • Secure!
Challenges • Is your code 64-bit compliant? • Can you scale to 6 cores? • Adopt new DX11.X API extensions • Manage your own resource hazards • Make sure you use ESRAM effectively • Package content for streaming install • Game design considerations • Quick save on ERA termination • Kinect, Smartglass • Cloud services
Thank You! – Questions? • (That I’m allowed to answer)