390 likes | 1.14k Views
Deferred Lighting and Post Processing on PLAYSTATION®3. Matt Swoboda PhyreEngine™ Team Sony Computer Entertainment Europe (SCEE) R&D. Where Are We Now?. PS3 into its 3 rd year Many developers on their 2 nd generation engines Solved the basic problems SPUs STILL underused
E N D
Deferred Lighting and Post Processing on PLAYSTATION®3 Matt Swoboda PhyreEngine™ Team Sony Computer Entertainment Europe (SCEE) R&D
Where Are We Now? • PS3 into its 3rd year • Many developers on their 2nd generation engines • Solved the basic problems • SPUs STILL underused • But it’s improving
But.. • GPU now the most common bottleneck • Usually limited by fragment operations • Many titles take > 1/3 of their time in post processing • Most developers want to do even more fragment work • More / heavier post processing effects • Better lighting techniques / more lights / softer shadows • Longer shaders • Features ported from PC / other console hardware
“We fixed the vertex bottleneck..” • Many possible solutions to improve geometry performance beyond just “optimising the shader” • LOD • Occlusion culling & visibility culling • Move large vertex operations to SPU, e.g. skinning • SPU triangle culling
What About Pixels? • Fragment operations / post processing rarely optimised like geometry operations • Throw whole operation at the GPU • Same operation done for every pixel • Spatial optimization / branching considered too slow • SPU not considered: “too slow”, “uses too much bandwidth”
SPU pixel processing • Yes, the SPU is fast enough to process pixels • Won’t beat the GPU in a brute force race • GPU specialises in rasterising triangles and sampling textures – has dedicated hardware • SPU is a general purpose processor • Use flexibility to your advantage • Choose different code branches and fast paths
Post Processing Effects on SPU A Whirlwind Tour
What to do on SPU • Options: • Offload whole processes from GPU to SPU • Or use SPU and GPU together to do one process
Depth Of Field Pre-Process • High quality depth of field requires a long fragment shader • Read depth samples and colour samples in a kernel / disc • Check depths against centre pixel depth • Weight colours by depth check results • Wasteful for “most” of the screen • All depth checks pass (out of focus) or all fail (in focus) • All fail == pass through original buffer • All pass == use pre-blurred buffer – separable gaussian blur • Categorise the screen for these cases on SPU
Depth Of Field Classification Results • Post process depth buffer • Classify by min/max depth • Green: fully in focus • Blue: fully out of focus • Red: neither fully in or out
Depth Of Field Pre-process results • Pre-process only on SPU, blur operations on GPU • Goal: minimise overall frame time and latency • Large blur w.r.t. depth • 15 ms+ on GPU alone • 1.5-2ms on SPU + 3 ms on GPU
Screen Tile Classification • Categorise the screen using the range of depth values within a tile • Powerful technique with many applications • Full screen effect optimization - DOF, SSAO.. • Soft particles • Affecting lights • Occluder information
Screen Space Ambient Occlusion (SSAO) • Generate an ambient occlusion approximation using the depth buffer alone • Perform a large kernel-based series of depth comparisons and sum the results • Downsample output to ½ size for performance • Output normals for bilateral upsampling
SPU Screen Space Ambient Occlusion Results • GPU version: 10ms+ • SPU version: 6ms on 2 SPUs • Used in “Donkey Trader” PhyreEngine game template
Deferred Shading Overview • Rasterise geometry information to multiple “GBuffers” (geometry buffers) • Apply lighting and shading in a post process
Deferred Lighting on SPU • The SPU can handle the deferred lighting process • The GPU renders the geometry to GBuffers • SPU and GPU execute in parallel • Total time : max( geometry, lighting )
Deferred Lighting on SPU: Implementation (1) • Process each pixel once • Work out which lights affect each pixel • Apply the N affecting lights in a loop • Process the screen in tiles • Use classification techniques per tile to optimise
Deferred Lighting on SPU: Implementation (2) • Calculate affecting lights per tile • Build a frustum around the tile using the min and max depth values in that tile • Perform frustum check with each light’s bounding volume • Compare light direction with tile average normal value • Choose fast paths based on tile contents • No lights affect the tile? Use fast path • Check material values to see if any pixels are marked as lit
Deferred Lighting on SPU: Implementation (3) • Choose whether to process MSAA per tile • If no sample pair values differ, light only one sample from the pair, otherwise light both samples separately • Typically quite few tiles need both MSAA samples lit Tiles requiring MSAA
Deferred Lighting on SPU: Results • 3 shadow casting lights, 100 point lights • 2x MSAA, 720p • Lighting performed per sample • Apply tone mapping on SPU • Virtually free • Performance: > 60 fps, 3 SPUs for 11ms each • No MSAA: 2 SPUs for 11ms
Deferred Lighting on SPU: Issues • Potential latency • Must keep GPU busy while SPU process is running • Render something else or add a frame of latency • Main memory requirements • Shadows • Requires “random” texture access – not ideal for SPU • Can render shadows on GPU to a full screen buffer and use it on SPU
Flavours of Deferred Lighting on SPU • Full deferred render on SPU • Input all GBuffers, output final composited result • Light pre-pass render on SPU • Input normal and depth only; calculate light result; sample in 2nd geometry pass • Light tile classification data output? • SPU outputs information per tile about affecting lights • Do lighting calculations on GPU
Volumetric Lighting • Also known as “god rays” or “light beams” • Simulates the effect of light illuminating dust particles in the air • Numerous fakes exist • Artist-placed geometry • Artist-placed particles • Better: generate using the shadow map • Works in a “general case”
Volumetric Lighting • Ray march through the shadow map • Trace one ray per pixel in screen space • Sample the depth buffer to determine the end of the ray • Sample the shadow map at N points along the ray • N ~= 50 • Attenuate and sum up the number of samples that passed • Blur and add noise
Volumetric Lighting • Effect is a bit too slow to be practical on GPU: ~5ms • Do it on SPU instead • Parallelises with GPU easily • Result needed late in the render at compositing stage • Only needs depth and shadow map inputs • Problem: must randomly sample from the shadow map
Texture sampling on SPU • “Random access” texture sampling is bad for SPU • It’s bad for GPU, too, but sometimes you just have to do it • GPU: • Fast access from texture cache; cache miss is slow • Dedicated hardware handles lookups, filtering and wrapping • SPU: • Fast access from “texture cache” (SPU local memory) • Slow access on cache miss (DMA from main memory) • Cache lookups slow (no dedicated hardware) • Must manually handle filtering and wrapping (again, slow)
Texture sampling on SPU • Either: • Make the texture entirely fit in SPU local memory • Problem solved! • Still inefficient: random accesses reduce register parallelism • Or • Write a very good software cache • Locate potential cache misses early - long before you need the values • Avoid branches in sampling code
Volumetric Lighting on SPU • Volumetric light result will be blurred • Don’t need full shadow map accuracy • No filtering on texture samples needed • Downsample shadow map from 1024x1024, 32 bit to 256x256, 16 bit • 128k – fits in SPU local memory • Fast enough to sample on SPU
Volumetric Lighting on SPU: Results • Takes ~11 ms on 1 SPU
Shadow Mapping on SPU (1) • Needs the full-size shadow map • 1024x1024x32 bit == 4mb : won’t fit in SPU local memory • We’ll have to write that “very good software cache”, then • Pre-process the shadow map on SPU • Calculate min and max depth for each tile • Store in a low resolution depth hierarchy map • Output high resolution shadow map as cache tiles
Shadow Mapping on SPU (2) • Software cache with 32 entries • Each entry is a shadow map tile • Branchless determination of cache entry index for tile index • Locate cache misses early • While detiling depth data – work out required shadow tiles • Pull in all cache-missed tiles • Sample shadow map during lighting calculations • All required shadow tiles are now definitely in cache – lookup is branchless • It’s quite slow • Locate tile in cache per pixel
Shadow Mapping on SPU (3) • Optimise via special cases to win back performance • Use the low resolution shadow tile map • Always in SPU local memory • If pixel shadow z > tile max Z : definitely in shadow • If pixel shadow z < tile min Z : definitely not in shadow • Check low resolution map before triggering cache fetches • Classify whole screen tiles as in or out of shadow • Don’t need to sample high resolution shadow map at all for those tiles Tiles requiring high resolution shadow samples
Conclusion • New additions to your toolbox: • Tile-based classification techniques on SPU • Deferred lighting on SPU • Texture sampling on SPU • Rendering is no longer just a GPU problem • Use general purpose nature of the SPU to your advantage • Rethink fragment processing optimisation strategies • Make the GPU work smarter, not harder
Conclusion • Some titles are already using SPU post processing • Killzone 2 • PhyreEngine™ is here to help • (If you’re a registered PS3 developer) it’s on DevNet now • Not just an engine: also a reference • Comes with full source • Download it, learn from it, steal bits of the code • Check out the PhyreEngine™ SPU Post Processing Library