Physics in Parallel: Simulation on 7th Generation Hardware

Physics in Parallel: Simulation on 7th Generation Hardware David Wu Pseudo Interactive

Why are we here? • The 7th generation is approaching. • We are no longer next gen • We are all scrambling to adopt to the new stuff, so that we can stay • on the bleeding edge • And push the envelope • and take things to the next level.

What’s Next Gen? • Multiple Processors • not entirely new, but more than before. • Parallelism • not entirely new, but more than before. • Physics • not entirely new, but more than before.

Take-Away • So much to cover • General Principles • Useful Concepts • Techniques • Tips • Bad Jokes • Goal is to save you time during the transition to.. • Next Gen

Format for presentation Every year we discover new ways to communicate information.

Patterns • A description of a recurrent problem and of the core of possible solutions • Difficult to write • Too pretentious • Inviting criticism

Gems • Valuable bits of information • Too 6th Gen

Blog • Free Form • Continuity not required • Subjective/opinionated is okay • Arbitrary Tangents are okay • Catchy Title need not match article • No quality bar • This sounds 7th Gen to me.

Disclaimer • My information sources range from: • press releases • Patents • other Blogs on the net • random probabilistic guesses. • Much of the information is probably wrong.

1-Mar-05 Multi-threaded programming • I participated in some in depth discussions on this topic, after weeks of debate, the conclusion was: • “Multi-threaded programming is hard”

2-Mar-05 What is 7th Gen Hardware? • Fast • Many parallel processors • Very High peak Flops • In order execution

2-Mar-05 What is 7th Gen Hardware? • High memory latency • Not enough Bandwidth • Moderate clock speed improvements • Not enough Memory • CPU-GPU convergence

3-Mar-05 Hardware usually sucks • Is Multi-Processor Revolutionary? • It is kind of here already • Hyper Threading • Dual Processor • Sega Saturn • not entirely new, but more than before.

3-Mar-05 Hardware usually sucks • Hardware advances require years of preparatory hype: • 3D Accelerators • Online • SIMD • “Not with a bang but with a whimper”

3-Mar-05 Hardware usually sucks • The big problem with hardware advances is sofware. • We don’t like to do things that are hard. • If there is a big enough payoff we do it. • This time there is a big enough payoff.

4-Mar-05 Types of Parallelism • Task Parallelism • Render+physics • Data Parallelism • collision detection on two objects at a time • Instruction Parallelism • multiple elements in a vector • Use all three

5-Mar-05 Techniques Pipeline Work Crew Forking

5-Mar-05 Pipeline – Task Parallelism • Subdivide problem into discrete tasks • Solve tasks in parallel, spreading them across multiple processors.

5-Mar-05 Pipeline – Task Parallelism Thread 0 collision detection Frame 3 Thread 0 collision detection Frame 4 Thread 1 Logic/AI Frame 2 Thread 1 Logic/AI Frame 3 Thread 2 Integration Frame 1 Thread 2 Integration Frame 2

5-Mar-05 Pipeline Similar to CPU/GPU parallelism CPU Frame 3 CPU Frame 4 GPU Frame 2 GPU Frame 3

5-Mar-05 Pipeline: notes • Dependencies explicit • Communication explicit • I.e. through FIFO • Avoids deadlock issues • Avoids most race conditions • Load balancing is not great • Does not reduce latency vs. singled threaded case

5-Mar-05 Pipeline: notes • Feedback between tasks is difficult • Best for open loop tasks • Secondary dynamics, I.e. pony tail • Effects • Suitable for specialized hardware, because task requirements are cleanly divided.

5-Mar-05 Pipeline: notes • Suitable for restricted memory architectures, as seen in a certain proposed 7th gen console design. • Adds bandwidth overhead and memory use overhead to SMP systems that would otherwise communicate via the cache.

5-Mar-05 Work Crew Component wise division of system Collision Detection Integration Particle System Fluid Simulation Audio Rendering AI/Logic IO

5-Mar-05 Work Crew – Task Parallelism • Similar to pipeline but without explicit ordering. • Dependencies are handled on a case by case basis. • i.e. particles that do not effect game play might not need to be deterministic, so they can run without explicit synchronization. • Components without interdependencies can run asynchronously, e.g. kinematics and AI.

5-Mar-05 Work Crew • Suitable for some external processes such as IO, Gamepad, Sound, Sockets. • Suitable for decoupled systems: • particle simulations that do not effect game play • Fluid dynamics • Visual damage simulation • Cloth simulation

5-Mar-05 Work Crew • Scalability is limited by the number of discrete tasks • Load balancing is limited by the asymmetric nature of the components and their requirements. • Higher risk of deadlocks • Higher risk of race conditions

5-Mar-05 Work Crew • May require double buffering of some data to avoid race conditions. • Poor data coherency • Good code coherency

5-Mar-05 Forking – Data Parallelism • Perform the same task on multiple objects in parallel. • Thread “forks” into multiple threads across multiple processors • All threads repeatedly grab pending objects indiscriminately and execute the task on them • When finished, threads combine back into the original thread.

5-Mar-05 Forking Fork Object A Thread 2 Object B Thread 0 Object C Thread 1 combine

5-Mar-05 Forking • Task assignment can often be done using simple interlocked primitives: • I.e. Int i = InterlockedIncrement(&nextTodo); • OpenMP adds compiler support for this via pragmas

5-Mar-05 Forking • Externally Synchronous • external callers don’t have to worry about being thread safe • thread safety requirements are limited to the scope of the code within the forked section. • This is a big deal. • good for isolated engine components and middle ware

5-Mar-05 Forking – Example AI running in thread 0 AI calls RayQuery() for a line of sight check RayQuery forks into 6 threads, computes the ray query, and then returns the results through thread 0 AI, running in thread 0 uses the result.

5-Mar-05 Forking • Minimizes Latency for a given task • Good data and code coherency • Potentially high synchronization overhead, depending on the coupling. • Highly scalable if you have many tasks with few dependencies • Ideal for Collision detection.

5-Mar-05 Forking - Batches Reduces inter-thread communication Reduces potential for load balancing. Improves Instruction level parallelism Fork Objects 21..30 Thread 2 Objects 0..10 Thread 0 Objects 11..20 Thread 1 combine

6-Mar-05 Our Approach 1) Collision Detection Forked 2) AI/Logic Single threaded 2b) Damage Effects Contractor Queue All extra threads 2a) engine calls Forked 3) Integration Forked Audio Whatever 4) Rendering Forked/Pipeline

7-Mar-05 Multithreaded programming is Hard • Solutions that directly expose multiple threads to leaf code are a bad idea. • Sequential, single threaded, synchronous code is the fastest to write and debug • In order to meet schedules most leaf code will stay this way.

7-Mar-05 Notes on Collision detection • All collision prims are stored in a global search tree. • Bounding Kdop tree with 8 children per node. • The most common case is when 0 or 1 children need to be traversed • 8 children results in fewer branches • 8 Children allows better Prefetching

7-Mar-05 Collision detection • Each moving object is a “task” • Each object is independently queried vs. all other objects in the tree. • Results are output to a global list of contacts and collisions • To avoid duplicates, moving object vs. moving object collisions are only processed if the active moving object’s memory address is <= the other moving object.

7-Mar-05 Collision detection • Threads pop objects off of the todo list one by one using interlocked access until they are all processed. • Each query takes O(lgN) time. • Very little data contention • output operations are rare and quick • task allocation uses InterlockedIncrement • On 2 Cpus with many objects I see a 80% performance increase. • Hopefully scalable to many CPUs

7-Mar-05 Collision detection • We try to keep collision code and data in the cache as much as possible • We try to finish Collision detection as soon as possible because there are dependencies on it • All threads attack the problem at once

8-Mar-05 Notes on Integration The process that steps objects forward in time, in a manner consistent with all contacts and constraints.

8-Mar-05 Integration • Each batch of coupled objects is a task. • Each Batch is solved independently • Threads pop batches with no dependencies off of the todo list one by one using interlocked access until they are all processed.

8-Mar-05 Integration • When a dynamic object does not interact with other dynamic objects, it’s batch contains only that object. • When dynamic objects interact, they are coupled, their solutions are dependant on each other and they most be solved together.

8-Mar-05 Integration • In some cases, objects can be artificially decoupled. • I.e. assume object A weighs 2000kg, and object B weighs 1 kg. In some cases we can assume that the dynamics of B do not effect the dynamics of A. • In this case, A can first be solved independently, and the resulting dynamics can be fed into the solution for B. • This creates an ordering dependency. • A must be solved before B.

8-Mar-05 Integration • When objects are moved they must be updated in the global collision tree. • Transactions need to be atomic, this is accomplished with locks/critical sections • Ditto for the VSD tree • Task allocation is slightly more complex due to dependencies • Despite all this we see a 75% performance increase on 2 CPUs with many objects.

8-Mar-05 Integration • We use a discrete newton solver, which works okay with our task dicretization • I.e. One thread per batch • If there where hundreds of processors and not as many batches, we would fork the solver itself and use Jacobi iterations

9-Mar-05 Transactions • With fine grained data parallelism, we require many, light weight atomic transactions. • For this we use either: • Interlocked primitives • Critical Sections • Spin Locks

9-Mar-05 Transactions • Whenever possible, interlocked primitives are used. • Interlocked primitives are simple atomic transactions on single words • If the transaction is short a spin Lock is used. • Otherwise a critical section is used. • A Spin Lock is like a critical section, except that it spins rather than sleeps when blocking

9-Mar-05 CPU’s are difficult There are some processor specific nuances to consider when writing your own locks: Due to out of order reads, data access following the acquisition of a lock should be proceeded by a load fence or isync. Otherwise the processor might preload old data that changes right before the lock is released.

Physics in Parallel: Simulation on 7th Generation Hardware