820 likes | 1.02k Views
Physics in Parallel: Simulation on 7th Generation Hardware . David Wu Pseudo Interactive. Why are we here?. The 7th generation is approaching. We are no longer next gen We are all scrambling to adopt to the new stuff, so that we can stay on the bleeding edge And push the envelope
E N D
Physics in Parallel: Simulation on 7th Generation Hardware David Wu Pseudo Interactive
Why are we here? • The 7th generation is approaching. • We are no longer next gen • We are all scrambling to adopt to the new stuff, so that we can stay • on the bleeding edge • And push the envelope • and take things to the next level.
What’s Next Gen? • Multiple Processors • not entirely new, but more than before. • Parallelism • not entirely new, but more than before. • Physics • not entirely new, but more than before.
Take-Away • So much to cover • General Principles • Useful Concepts • Techniques • Tips • Bad Jokes • Goal is to save you time during the transition to.. • Next Gen
Format for presentation Every year we discover new ways to communicate information.
Patterns • A description of a recurrent problem and of the core of possible solutions • Difficult to write • Too pretentious • Inviting criticism
Gems • Valuable bits of information • Too 6th Gen
Blog • Free Form • Continuity not required • Subjective/opinionated is okay • Arbitrary Tangents are okay • Catchy Title need not match article • No quality bar • This sounds 7th Gen to me.
Disclaimer • My information sources range from: • press releases • Patents • other Blogs on the net • random probabilistic guesses. • Much of the information is probably wrong.
1-Mar-05 Multi-threaded programming • I participated in some in depth discussions on this topic, after weeks of debate, the conclusion was: • “Multi-threaded programming is hard”
2-Mar-05 What is 7th Gen Hardware? • Fast • Many parallel processors • Very High peak Flops • In order execution
2-Mar-05 What is 7th Gen Hardware? • High memory latency • Not enough Bandwidth • Moderate clock speed improvements • Not enough Memory • CPU-GPU convergence
3-Mar-05 Hardware usually sucks • Is Multi-Processor Revolutionary? • It is kind of here already • Hyper Threading • Dual Processor • Sega Saturn • not entirely new, but more than before.
3-Mar-05 Hardware usually sucks • Hardware advances require years of preparatory hype: • 3D Accelerators • Online • SIMD • “Not with a bang but with a whimper”
3-Mar-05 Hardware usually sucks • The big problem with hardware advances is sofware. • We don’t like to do things that are hard. • If there is a big enough payoff we do it. • This time there is a big enough payoff.
4-Mar-05 Types of Parallelism • Task Parallelism • Render+physics • Data Parallelism • collision detection on two objects at a time • Instruction Parallelism • multiple elements in a vector • Use all three
5-Mar-05 Techniques Pipeline Work Crew Forking
5-Mar-05 Pipeline – Task Parallelism • Subdivide problem into discrete tasks • Solve tasks in parallel, spreading them across multiple processors.
5-Mar-05 Pipeline – Task Parallelism Thread 0 collision detection Frame 3 Thread 0 collision detection Frame 4 Thread 1 Logic/AI Frame 2 Thread 1 Logic/AI Frame 3 Thread 2 Integration Frame 1 Thread 2 Integration Frame 2
5-Mar-05 Pipeline Similar to CPU/GPU parallelism CPU Frame 3 CPU Frame 4 GPU Frame 2 GPU Frame 3
5-Mar-05 Pipeline: notes • Dependencies explicit • Communication explicit • I.e. through FIFO • Avoids deadlock issues • Avoids most race conditions • Load balancing is not great • Does not reduce latency vs. singled threaded case
5-Mar-05 Pipeline: notes • Feedback between tasks is difficult • Best for open loop tasks • Secondary dynamics, I.e. pony tail • Effects • Suitable for specialized hardware, because task requirements are cleanly divided.
5-Mar-05 Pipeline: notes • Suitable for restricted memory architectures, as seen in a certain proposed 7th gen console design. • Adds bandwidth overhead and memory use overhead to SMP systems that would otherwise communicate via the cache.
5-Mar-05 Work Crew Component wise division of system Collision Detection Integration Particle System Fluid Simulation Audio Rendering AI/Logic IO
5-Mar-05 Work Crew – Task Parallelism • Similar to pipeline but without explicit ordering. • Dependencies are handled on a case by case basis. • i.e. particles that do not effect game play might not need to be deterministic, so they can run without explicit synchronization. • Components without interdependencies can run asynchronously, e.g. kinematics and AI.
5-Mar-05 Work Crew • Suitable for some external processes such as IO, Gamepad, Sound, Sockets. • Suitable for decoupled systems: • particle simulations that do not effect game play • Fluid dynamics • Visual damage simulation • Cloth simulation
5-Mar-05 Work Crew • Scalability is limited by the number of discrete tasks • Load balancing is limited by the asymmetric nature of the components and their requirements. • Higher risk of deadlocks • Higher risk of race conditions
5-Mar-05 Work Crew • May require double buffering of some data to avoid race conditions. • Poor data coherency • Good code coherency
5-Mar-05 Forking – Data Parallelism • Perform the same task on multiple objects in parallel. • Thread “forks” into multiple threads across multiple processors • All threads repeatedly grab pending objects indiscriminately and execute the task on them • When finished, threads combine back into the original thread.
5-Mar-05 Forking Fork Object A Thread 2 Object B Thread 0 Object C Thread 1 combine
5-Mar-05 Forking • Task assignment can often be done using simple interlocked primitives: • I.e. Int i = InterlockedIncrement(&nextTodo); • OpenMP adds compiler support for this via pragmas
5-Mar-05 Forking • Externally Synchronous • external callers don’t have to worry about being thread safe • thread safety requirements are limited to the scope of the code within the forked section. • This is a big deal. • good for isolated engine components and middle ware
5-Mar-05 Forking – Example AI running in thread 0 AI calls RayQuery() for a line of sight check RayQuery forks into 6 threads, computes the ray query, and then returns the results through thread 0 AI, running in thread 0 uses the result.
5-Mar-05 Forking • Minimizes Latency for a given task • Good data and code coherency • Potentially high synchronization overhead, depending on the coupling. • Highly scalable if you have many tasks with few dependencies • Ideal for Collision detection.
5-Mar-05 Forking - Batches Reduces inter-thread communication Reduces potential for load balancing. Improves Instruction level parallelism Fork Objects 21..30 Thread 2 Objects 0..10 Thread 0 Objects 11..20 Thread 1 combine
6-Mar-05 Our Approach 1) Collision Detection Forked 2) AI/Logic Single threaded 2b) Damage Effects Contractor Queue All extra threads 2a) engine calls Forked 3) Integration Forked Audio Whatever 4) Rendering Forked/Pipeline
7-Mar-05 Multithreaded programming is Hard • Solutions that directly expose multiple threads to leaf code are a bad idea. • Sequential, single threaded, synchronous code is the fastest to write and debug • In order to meet schedules most leaf code will stay this way.
7-Mar-05 Notes on Collision detection • All collision prims are stored in a global search tree. • Bounding Kdop tree with 8 children per node. • The most common case is when 0 or 1 children need to be traversed • 8 children results in fewer branches • 8 Children allows better Prefetching
7-Mar-05 Collision detection • Each moving object is a “task” • Each object is independently queried vs. all other objects in the tree. • Results are output to a global list of contacts and collisions • To avoid duplicates, moving object vs. moving object collisions are only processed if the active moving object’s memory address is <= the other moving object.
7-Mar-05 Collision detection • Threads pop objects off of the todo list one by one using interlocked access until they are all processed. • Each query takes O(lgN) time. • Very little data contention • output operations are rare and quick • task allocation uses InterlockedIncrement • On 2 Cpus with many objects I see a 80% performance increase. • Hopefully scalable to many CPUs
7-Mar-05 Collision detection • We try to keep collision code and data in the cache as much as possible • We try to finish Collision detection as soon as possible because there are dependencies on it • All threads attack the problem at once
8-Mar-05 Notes on Integration The process that steps objects forward in time, in a manner consistent with all contacts and constraints.
8-Mar-05 Integration • Each batch of coupled objects is a task. • Each Batch is solved independently • Threads pop batches with no dependencies off of the todo list one by one using interlocked access until they are all processed.
8-Mar-05 Integration • When a dynamic object does not interact with other dynamic objects, it’s batch contains only that object. • When dynamic objects interact, they are coupled, their solutions are dependant on each other and they most be solved together.
8-Mar-05 Integration • In some cases, objects can be artificially decoupled. • I.e. assume object A weighs 2000kg, and object B weighs 1 kg. In some cases we can assume that the dynamics of B do not effect the dynamics of A. • In this case, A can first be solved independently, and the resulting dynamics can be fed into the solution for B. • This creates an ordering dependency. • A must be solved before B.
8-Mar-05 Integration • When objects are moved they must be updated in the global collision tree. • Transactions need to be atomic, this is accomplished with locks/critical sections • Ditto for the VSD tree • Task allocation is slightly more complex due to dependencies • Despite all this we see a 75% performance increase on 2 CPUs with many objects.
8-Mar-05 Integration • We use a discrete newton solver, which works okay with our task dicretization • I.e. One thread per batch • If there where hundreds of processors and not as many batches, we would fork the solver itself and use Jacobi iterations
9-Mar-05 Transactions • With fine grained data parallelism, we require many, light weight atomic transactions. • For this we use either: • Interlocked primitives • Critical Sections • Spin Locks
9-Mar-05 Transactions • Whenever possible, interlocked primitives are used. • Interlocked primitives are simple atomic transactions on single words • If the transaction is short a spin Lock is used. • Otherwise a critical section is used. • A Spin Lock is like a critical section, except that it spins rather than sleeps when blocking
9-Mar-05 CPU’s are difficult There are some processor specific nuances to consider when writing your own locks: Due to out of order reads, data access following the acquisition of a lock should be proceeded by a load fence or isync. Otherwise the processor might preload old data that changes right before the lock is released.