Optimizing Game Architectures with Task-based Parallelism

Optimizing Game Architectures with Task-based Parallelism Brad Werth Intel Senior Software Engineer

Parallelism in games is no longer optional The unending quest for realism in games is causing game content and gameplay to become increasingly complex. More complicated scenes + more complicated behavior = increased computation. CPUs and GPUs are no longer competing on clock speed, but on degree of parallelism. High-end games require threading. You can't go home again.

Threaded architectures for games are challenging to design Techniques for threading individual computations/systems are well-known, but... • the techniques often have inefficient interactions. • games rely on middleware to provide some functionality – more potential conflict. • the moment-to-moment workload can change dramatically. • the variety of CPU topologies complicates tuning. Task-based parallelism is a viable way out of this mess. But first, let's gaze into the abyss...

Particles Jobs Bones Physics Sound A threaded game architecture –full of pain and oversubscription • Particles array could be partitioned. • One-off jobs run on "job threads". • Physics threads are created by middleware. • Sound mixing is on a dedicated thread. • Bones/skinning is a Directed Acyclic Graph. #? #? ? ? + ? = ??? @$#!!

Tasks are an efficient method for handling all of this parallel work With a task scheduler, and with all of this work decomposed into tasks, then... • one thread pool can process all work. • oversubscription will be avoided by using the same threads for all parallel work. • the game will scale well to different core topologies without painful tuning. Tasks can do it!

A thread is... run on the OS. able to be pre-empted. expected to wait on events. most efficient with some oversubscription. optimized for a specific core topology. A task is... run on a thread pool. run to completion. heavily penalized for blocking. efficient by avoiding oversubscription. able to adapt to any number of threads/cores. Task-based parallelism is agile threading

Tasks are the uninterrupted portions of threaded work Texture Lookup Processing Setup Data Parallelism

Tasks can be arranged in a dependency graph Texture Lookup Processing Setup Data Parallelism

Dependency graph can be mapped to a thread pool Lots of work means lots of tasks which fill in the gaps in the thread pool. The decomposition of tasks and mapping to threads is the job of the task scheduler.

Task schedulers have similar ingredients but different flavors Cilk scheduler has been extremely influential. Most have task queues per thread to avoid contention (often multiple queues per thread). Cache-aware distribution of work is a key performance feature. Most prevent direct manipulation of queues. The APIs vary in some ways: • Constructive schedulers define tasks a priori. • Reductive schedulers subdivide tasks in flight. • Event-driven schedulers trigger off of I/O. • Computation schedulers are triggered manually.

Threading Building Blocks is Intel's Open Source task-based scheduler TBB is a reductive, computation scheduler designed to... • be cross-platform (Windows*, OS X*, Linux, XBox360*). • simplify data parallelism coding. • provide scalability and high performance. TBB has a high-level API for easy parallelism, low-level API for control. API is not so low-level that it exposes threads or queues. *Other names and brands may be claimed as the property of others.

Enough! Let's look at code This talk shows code solutions to threaded game architecture problems. Common threading patterns in games are decomposed into tasks, using the TBB API. The code is available: http://software.intel.com/file/14997

Start with the easy stuff – turn independent loops into tasks The TBB high-level API provides parallel_for(). Behold, the humble for loop: for(int i = 0; i < ELEMENT_MAX; ++i) { doSomethingWith(element[i]); }

Using parallel_for() is a 2-step process; step 1 is objectify the loop class DoSomethingContext { void operator()( const tbb::blocked_range<int> &range ) { for(int i = range.begin(); i != range.end(); ++i) { doSomethingWith(element[i]); } } }

parallel_for() step 2: invoke the objectified loop with a range For more general task decomposition problems, we need a low-level API... tbb::parallel_for( tbb::blocked_range<int>(0, ELEMENT_MAX), *pDoSomethingContext ); Particles

TBB low-level API: work trees with blocking calls and/or continuations Wait Continuations go up Root Root Root Spawn Spawn & Wait Task Task Callback Blocking calls go down More More

Work trees can implement common game threading patterns The TBB low-level API creates and processes trees of work – each node is a task. Work trees of tasks can be made to process: • Callbacks • Promises • Synchronized callbacks • Long, low priority operations • Directed acyclic graph We'll look at how these patterns can be decomposed into tasks using the TBB low-level API.

Callbacks – send it off and never wait Callbacks are function pointers executed on another thread. Execution begins immediately. No waiting on individual callbacks - can wait in aggregate. void doCallback(FunctionPointer fFunc, void *pParam);

Spawn Root Task Callback More Code and tree: Callback void doCallback(FunctionPointer fCallback, void *pParam) { // allocation with "placement new" syntax CallbackTask *pCallbackTask = new( s_pCallbackRoot->allocate_additional_child_of( *s_pCallbackRoot ) ) CallbackTask(fCallback, pParam); s_pCallbackRoot->spawn(*pCallbackTask); }

Callbacks are simple and powerful, but have limits No waiting! Callbacks are run on demand. No waiting? Callback has to report its own completion. No waiting?! Need special case code to run on 1-core system. If this is a deal-breaker, there are other options...

Wait Spawn Promise Root Task Callback More Promises – come back for it later Promises are an evolution of Callbacks. Like Callbacks: • Promises are function pointers executed on another thread. • Execution begins immediately. Unlike Callbacks: • Promises provide a method for efficient waiting. Promise*doPromise(FunctionPointer fFunc, void *pParam);

Code and tree: Promise setup void doPromise(FunctionPointer fCallback, void *pParam, Promise *pPromise) { // allocation with "placement new" syntax tbb::task *pParentTask = new( tbb::task::allocate_root() ) tbb::empty_task(); pPromise->setRoot(pParentTask); PromiseTask *pPromiseTask = new( pParentTask->allocate_child() ) PromiseTask(fCallback, pParam, pPromise); pParentTask->set_ref_count(2); pParentTask->spawn(*pPromiseTask); }

Code and tree: Promise execution void Promise::waitUntilDone() { if(m_pRoot != NULL) { tbb::spin_mutex::scoped_lock(m_tMutex); if(m_pRoot != NULL) { m_pRoot->wait_for_all(); m_pRoot->destroy(*m_pRoot); m_pRoot = NULL; } } }

Promises seem almost too good to be true Blocking wait only if result is not available when requested. If wait blocks, the current thread actively contributes to completion. 2 files, 3 classes, ~150 lines of code. Robust Promise systems can also: • Cancel jobs in progress • Get partial progress updates

Spawn + Wait Root Task Callback Test + Wait Task Task Synchronized Call – wait until all threads call it exactly once Synchronized Calls can be useful for: • Initialization of thread-specific data • Coordination with some middleware • Instrumentation and profiling Trivial if you have direct access to threads, but trickier with a task-based system. void doSynchronizedCallback( FunctionPointer fFunc, void *pParam);

Code and tree: Synchronized Call setup void doSynchronizedCallback(FunctionPointer fCallback, void *pParam, int iThreads) { tbb::atomic<int> tAtomicCount; tAtomicCount = iThreads; tbb::task *pRootTask = new(tbb::task::allocate_root()) tbb::empty_task; tbb::task_list tList; for(int i = 0; i < iThreads; i++) { tbb::task *pSynchronizeTask = new( pRootTask->allocate_child() ) SynchronizeTask(fCallback, pParam, &tAtomicCount); tList.push_back(*pSynchronizeTask); } pRootTask->set_ref_count(iThreads + 1); pRootTask->spawn_and_wait_for_all(tList); pRootTask->destroy(*pRootTask); }

Code and tree: Synchronized Call execution tbb::task *SynchronizeTask::execute() { m_fCallback(m_pParam); m_pAtomicCount->fetch_and_decrement(); while(*m_pAtomicCount > 0) { // yield while waiting tbb::this_tbb_thread::yield(); } return NULL; }

Synchronized Calls are useful, but not efficient Don't make Synchronized Calls in the middle of other work. Performance penalty is negated if work queue is empty.

Parent Root Test + Clear Spawn Task Low-Priority Task Set More More Long, low priority operation – hide some time-slicing Many games have long operations that run in parallel to the main computation: • Asset loading/decompression • Sound mixing • Texture tweaking • AI pathfinding It's not necessary to create a new thread to handle these operations! Use the time-honored technique of time-slicing.

Code and tree: Long, low priority operation tbb::task *BaseTask::execute() { if(s_tLowPriorityTaskFlag.compare_and_swap(false, true) == true) { // allocation with "placement new" syntax tbb::task *pLowPriorityTask = new( this->allocate_additional_child_of( *s_pLowPriorityRoot ) ) LowPriorityTask(); spawn(*pLowPriorityTask); } // spawn other children ... }

Long, low priority operations are tricky to get right Task-based schedulers won’t swap out a task that runs a long time. A low-priority task can’t reschedule itself naively, or it will create an infinite loop. Even if scheduler designed with priority in mind, it only matters when a thread runs dry. This approach doesn’t guarantee any minimum frequency of execution.

Directed Acyclic Graph – everyone's favorite paradigm Directed Acyclic Graphs are popular for executing workflows and kernels in games. Interface varies, but generally construct a DAG and then execute and wait. How can work trees represent a DAG?

Tree: Directed Acyclic Graph Spawn Root Root Root More More More Root Spawn Root More Spawn More

Directed Acyclic Graph gets the job done The DAGs created by this approach are destroyed by waiting on them. Persistent DAGs are possible, for re-use across several frames. A scheduler could be DAG-based to begin with, making this trivial. Remember, get the code from: http://software.intel.com/file/14997

Soon, rendering may also be decomposable into tasks DirectX* 11 is designed for use on multi-core CPUs. Multiple threads can draw to local DirectX contexts ("devices"), and those draw calls are aggregated once per frame. All those draw calls can be done as tasks! All the threads can be initialized with a DirectX context using Synchronized Callbacks! This is an extremely positive development; Intel will produce lots of samples to help promote to the industry. *Other names and brands may be claimed as the property of others.

Particles Jobs Bones Physics Sound Our sample architecture can be handled by tasks top-to-bottom • Particles partitioning handled by parallel_for(). • One-off jobs using Callbacks or Promises. • Physics uses job threads via Synchronized Calls. • Sound mixing is time-sliced as Low-Priority job. • Bones/skinning DAG uses the job threads, too. √ √

TBB has other helpful features we didn't cover Beyond the high-level and low-level threading APIs, TBB has: • Atomic variables • Scalable memory allocators • Efficient thread-safe containers (vector, hash, etc.) • High-precision time intervals • Core count detection • Tunable thread pool sizes • Hardware thread abstraction

Using task parallelism will ensure continued game performance Task-based parallelism scales performance on varying architectures. Break loops into tasks for the maximum performance benefit. Use tasks to implement a game's preferred threading paradigms.

Want more? Here's more http://www.threadingbuildingblocks.org bradley.j.werth@intel.com

Optimizing Game Architectures with Task-based Parallelism

Optimizing Game Architectures with Task-based Parallelism

Presentation Transcript

Optimizing for Intel multi-/many-core architectures

Task-based Teaching

Task Parallelism and Task Superscalar Processing

TASK BASED

Dynamic Task Allocation in a turn based strategy game

Task Parallelism

Task Based Learning

Dynamic Task Parallelism with a GPU Work-Stealing Runtime System

Task-based Approach

Optimizing Post-Game Nutrition

MODEL-BASED SOFTWARE ARCHITECTURES

Data-Level Parallelism in Vector and GPU Architectures

Dynamic Task Parallelism with a GPU Work-Stealing Runtime System

Semantic-based Architectures

Web-based Information Architectures

Optimizing Loop Performance for Clustered VLIW Architectures

Task Parallelism

Fiber-Based Collocation Architectures

PARALLELISM PARALLELISM PARALLELISM

Grid services based architectures