Allegorithmic Substance

AllegorithmicSubstance Threaded Middleware

Procedural textures on multi-core • Other than framerate and features, what else can you do with extra CPU power ? • We’ll look at Allegorithmic’s middleware, Substance

Procedural textures are valuable for modern games • Have a LOT of textures. • Want shorter loading times‏‏ (faster starts, teleportations or zooms)‏. • Need to reduce texture memory on a disc, for download, and/or in RAM. • Can benefit from more flexible and reusable assets.

Introducing Substance • In Q2 2007 Allegorithmic started a complete reengineering of ProFX2, authoring tool and engine, named Substance. • Unit tests were done very early to ensure that Substance could target streaming. • Cross-platform : PC, PS3, XBOX, etc. • Expected linear multi-thread scalability.

What is Substance ? • Substance is a middleware product composed of two elements. • Substance Authoring Tool lets you • create procedural textures • create texture packages of a few kilobytes ! • A cooker compiles generic data into binaries optimized for a specific platform or user. • Substance Engine • generates bitmap textures on the fly.

Less FPS ? • More textures, not less FPS • Substance consumes idle cycles, not frames • Graphics bitrates follow Moore's law • Higher poly count → bigger worlds • Higher filter rate → larger textures • Desired texture volume grows faster than RAM • Streaming is a necessity • But HDD net bitrate does not follow. Bottleneck ! • Modern gameplay entails sudden bitrate bursts • This is worsened by HDD seeks and entails stalls.

No, a stable and high FPS. • Even masked, a stall is actually a FPS drop • Substance works in Random Access Memory • The gamer zooms or teleports: • Give 4 cores and a GPU to Substance • Sacrifice 1 or 2 frames • Substance gen. & cache 1-2M new texels. • The stall does not hinder game play. • Substance diminishes stalls • Substance helps to maintain a high FPS.

Performance issue:streaming in games • DVD or HDD net bitrate is 2 or 6 MB/s • Our aim: add a stable 4MB/s without the GPU • Requires billions of intermediate pixels/s. • Can CPUs compete with GPUs ? • Opportunity: cores are still under-exploited in most game engines. • Texture processing is privileged in the new multi-core architectures.

The architecture was designed with these issues in mind: • Homogeneous CPU and GPU versions • Streaming (~1-10 CPU cycles per pixel)‏ • SIMD & MT for the multi-core generations • No cache nor threading pollution • Fine grained jobs and lockless sync. • Low memory footprint

The theoretical benefit was calculated • New architectures come with enhanced SIMD. Expected x10 compared to std C++ • Tricks and algorithmic changes could give another x10 on some filters, like DXT • We were confident that our image processes could be well threaded. Partly because we generate textures asynchronously • Hence the CPU version of ProFX2 could be accelerated by a factor x25-x100

This is the approach taken to address the issue: • Simple innerloop tests actually showed that optimized SSE2-4 code could give a boost of x10 • Find a data layout coherent with micro parallelism (SIMD and pipeline), low level threading, cache and memory handling. • OpenMP is then used to test strategies before designing a specific MT HAL

Here’s the code that was developed to make this possible: • A SIMD HAL is ready for PC, Xbox, PS3. • OpenMP easily gives a 85% MT linearity. • Our MT HAL is converging towards a model of lockless synchronization, 95% expected. • The cooker precomputes data that will help synchronization and MT efficiency. • Our API exposes asynchronous commands. Perfect to share cores with a game loop !

The compositing graph,node based image processing • Authoring Tool: non linear editing • Engine: efficient high level structure • Graph (DAG) contains 3 types of nodes: • Sources: procedural noise, bitmaps, SVGs • Filters: blend, HSL, TRS, warp, blur, etc. • Outputs: coherent diffuse & normal maps, etc. • Main advantages: • Libraries, capsules: instanciation of subgraphs • Complex variants: fast to create and compute • Dynamic custom branches (ex: aging textures)‏

The compositing graph,node based image processing

Threading strategies • High level threading: • Task decomposition : 1 node (filter) per thread • Graph splitting ensures task independency • Low level threading: • Data decomposition : 1 strip of blocks per thread • Dispatcher ensures non conflicting areas • Pixel to pixel filters are concatenated. • Streamed R/W, no L2 cache pollution • Temporary blocks in private L1 double buffers • Intermediate images never allocated • Lockless reactive sync and cache friendly

Threading sub graphs (1/11)by nodes (high level)‏

Threading sub graphs (2/11)by nodes, caching

Threading sub graphs (3/11)by nodes

Threading sub graphs (4/11)by strips (low level)‏

Threading sub graphs (5/11)remove from cache

Threading sub graphs (6/11)by strips

Threading sub graphs (11/11)update cache, and finished

Expect more streaming bandwidth • Substance generates 4MB/s of compressed textures per second • Cumulate this with classical streaming • 50+ MB/s loading with 4 cores and 1 GPU

Here’s how close we got to the theoretical best performance: • DXT compression at 2G pixels/s (same as what hi-end GPUs can do in 2007). • 8 bits SVG (cooked) rendering at 20G/s. 8G/s anti-aliasing with 4 sub-samples. • In most cases 4 cores give a x3.8 boost • Some filters are more problematic, but solutions have been imagined in details, and will be implemented between Q2 and Q4 2008.

Here’s the new performance profile: • Substance and ProFX2 figures are for one core. • 4 cores: 3.8 times more fillrate. • ProFX2: SVG GPU • Substance: SVG CPU • SVG AA: 2G pixels/s per core

This is future-proofed • The cooker precomputes whatever helps to linearise computations. • Scalable code: SSE4 added in one day thanks to the SIMD HAL • Scalable threading: our two strategies scale • A few functions dispatch virtual CPU "shaders" • 64-cores ready ↔ code a new dispatcher ? • Multiplatform design.

What’s next?

Procedural diffuse map

Coherent procedural normal map

Complex procedural environment map

This scene is made entirely of proceduraltextures

Future sources of bandwidth • SIMD code can be better pipelined in ASM. • Our cooker can optimize a lot of things. • Authoring tool will have a RT profiler • Artists gaining experience with Substance will also optimize their packages better. • Artist feedback will also help us to improve the expressiveness of each filter • ~30-50 filters per texture, main perf. divisor.

Here’s how you can best take advantage of procedural textures • Anticipate texture generation requests. • Predict visibility (HOM, PVS)‏. • Create mipmaps. Access levels JIT. • Cache the useful texels. • Adapt texture resolution to workload. • Use texture variants, less tiling textures or details. Show a higher texel/pixel ratio.

What do you think? • Have you tried something like this? • Have you rejected trying something like this?

Allegorithmic Substance