320 likes | 478 Views
Emergent Game Technologies Gamebryo Element Engine. Thread for Performance. Goals for Cross-Platform Threading. Play well with others Take advantage of platform-specific performance features For engines/middleware, be adaptable to the needs of customers. Write Once, Use Everywhere.
E N D
Emergent Game TechnologiesGamebryo Element Engine Thread for Performance
Goals for Cross-Platform Threading • Play well with others • Take advantage of platform-specific performance features • For engines/middleware, be adaptable to the needs of customers
Write Once, Use Everywhere • Underlying multi-threaded primitives are replicated on all platforms • Define cross-platform wrappers for these • Processing models can be applied on different architectures • Define cross-platform systems for these • Typical developer writes once, yet code performs well on all platforms
Emergent's Gamebryo Element • A foundation for easing cross-platform and multi-core development • Modular, customizable • Suite of content pipeline tools • Supports PC, Xbox, PS3 and Wii • Booth # 5716 - North Hall
Cross-Platform Threading Requires Common Primitives • Threads • Something that executes code • Sub issues: local storage, priorities • Data Locks / Critical sections • Manage contention for a resource • Atomic operations • An operation that is guaranteed to complete without interruption from another thread
Choosing a Processing Model • Architectural features drive choice • Cache coherence • Prefetch on Xbox • SPUs on PS3 • Many processing units • General purpose GPU • Stream Processing fits these properties • Provide infrastructure to compute this way • Shift engine work to this model
Stream Processing (Formal) Wikipedia: Given a set of input and output data (streams), the principle essentially defines a series of computer-intensive operations (kernel functions) to be applied for each element in the stream. Input 1 Kernel 1 Kernel 2 Output Input 2
Generalized Stream Processing • Improve for general purpose computing • Partition streams into chunks • Kernels have access to entire chunk • Parameters for kernels (fixed inputs) • Advantages • Reduce need for strict data locality • Enables loops, non-SIMD processing • Maps better onto hardware
Morphing+Skinning Example Morph Kernel (MK) Skin Vertices Skinning Kernel (SK) Bone Matrices Blend Weights Morph Weights Morph Target 1 Vertices Morph Target 2 Vertices Vertex Locations
Morphing+Skinning Example MT 1 V Part 1 Skin V Part 1 MK Instance 1 SK Instance 1 MT 2 V Part 1 Verts Part 1 Weights Fixed MW Fixed Matrices Fixed MK Instance 2 SK Instance 2 MT 1 V Part 2 Skin V Part 2 Verts Part 2 MT 2 V Part 2
Floodgate • Cross platform stream processing library • Optimized per-platform implementation • Documented API for customer use • Engine uses the same API for built in functionality • Skinning, Morphing, Particles, Instance Culling, ...
Floodgate Basics • Stream: A buffer of varying or fixed data • A pointer, length, stride, locking • Kernel: An operation to perform on streams of data • Code implementing “Execute” function • Task: Wrapper a kernel and IO streams • Workflow: A collection of Tasks processed as a unit
Kernel Example: Times2 // Include Kernel Definition macros #include <NiSPKernelMacros.h> // Declare the Timer2Kernel NiSPDeclareKernel(Times2Kernel)
Kernel Example: Times2 #include "Times2Kernel.h" NiSPBeginKernelImpl(Times2Kernel) { // Get the input stream float *pInput = kWorkload.GetInput<float>(0); // Get the output stream float *pOutput = kWorkload.GetOutput<float>(0); // Process data NiUInt32 uiBlockCount = kWorkload.GetBlockCount(); for (NiUInt32 ui = 0; ui < uiBlockCount; ui++) { pOutput[ui] = pInput[ui] * 2; } } NiSPEndKernelImpl(Times2Kernel)
Life of a Workflow • 1. Obtain Workflow from Floodgate • 2. Add Task(s) to Workflow • 3. Set Kernel • 4. Add Input Streams • 5. Add Output Streams • 6. Submit Workflow • … Do something else … • 7. Wait or Poll when results are needed
Example Workflow // Setup input and output streams from existing buffers NiTSPStream<float> inputStream(SomeInputBuffer, MAX_BLOCKS); NiTSPStream<float> outputStream(SomeOutputBuffer, MAX_BLOCKS); // Get a Workflow and setup a new task for it NiSPWorkflow* pWorkflow = NiStreamProcessor::Get()->GetFreeWorkflow(); NiSPTask* pTask = pWorkflow->AddNewTask(); // Set the kernel and streams pTask->SetKernel(&Times2Kernel); pTask->AddInput(&inputStream); pTask->AddOutput(&outputStream); // Submit workflow for execution NiStreamProcessor::Get()->Submit(pWorkflow); // Do other operations... // Wait for workflow to complete NiStreamProcessor::Get()->Wait(pWorkflow);
Floodgate Internals • Partitioning streams for Tasks • Task Dependency Analysis • Platform specific Workflow preparation • Platform specific execution • Platform specific synchronization
Overview of Workflow Analysis • Task dependencies defined by streams • Sort tasks into stages of execution • Tasks that use results from other tasks run in later stages • Stage N+1 tasks depend on output of Stage N tasks • Tasks in a given stage can run concurrent • Once a stage has completed, the next stage can run
Task 1 Task 2 Task 3 Stream A Stream B Stream C Stream D Stream E Stream F Stream B Stream G Task 5 Stream G Stream H Task 6 Task 4 Stream I Stream G Stream F Stream D Task 7 Sync Analysis: Workflow with many Tasks
Task 1 Stream A Stream B Task 4 Task 5 Stream G Stream H Task 2 Stream D Stream C Sync Task Sync Stream G Task 3 Task 6 Stream E Stream F Stream I Analysis: Dependency Graph Stage 1 Stage 2 Stage 3 Stage 0
Performance Notes • Data is broken into blocks -> Locality • Good cache performance • Optimize size for prefetch or DMA transfers • Fits in limited local storage (PS3) • Easily adapt to #cores • Can manage interplay with other systems • Kernels encapsulate processing • Good target for optimization, platform-specific • Clean solution without #if
Usability Notes • Automatically manage data dependency and simplify synchronization • Hide nasty platform-specific details • Prefetch, DMA transfers, processor detection, ... • Learn one API, use it across platforms • Productivity gains • Helps us produce quality documentation and samples • Eases debugging
Exploiting Floodgate in the Engine • Find tasks that operate on a single object • Skinning, morphing, particle systems, ... • Move these to Floodgate: Mesh Modifiers • Launch at some point during execution • After updating animation and bounds • After determining visibility • After physics finishes ... • Finish them when needed • Culling • Render • etc
Same applications, new performance ... • The big win is out-of-the-box performance • Same results could be achieved with much developer time • Hides details on different platforms (esp. PS3) After Before Skinning Objects 42fps 62fps Morphing Objects 12fps 38fps
Example CPU Utilization, Morphing Before After
Thread profiling, Morphing Before • Some parallelization through hand-coded parallel update • Note high overhead and 85% or so in serial execution
Thread profiling, Morphing After • Automatic parallelism in engine • 4 threads for Floodgate (4 CPUs) • Roughly, 50% of old serial time replaced with 4x parallelism
New Issues • Within the engine, resource usage peaks at certain times • e.g. Between visibility culling and rendering • Application-level work might fill in the empty spaces • Physics, global illumination, ... • What about single processor machines? • What about variable sized output? • Instance culling, for example
Ongoing Improvements • Improved workflow scheduling • Mechanisms to enhance application control • Optimizing when tasks change • Stream lengths change • Inputs/outputs are changed • More platform specific improvements • Off-loading more engine work
Using Floodgate in a game • Identify stream processing opportunities • Places where lots of data is processed with local access patterns • Places where work can be prepared early but results are not needed until later • Re-factor to use Floodgate • Depending on task, could be as little as a few hours. • Hard part is enforcing locality
Future proofed? • Both CPUs and GPUs can function as stream processors • Easily extends to more processing units • Potential snags are in application changes
Questions? • Ask Stephen! • Visit Emergent's booth at the show. • Booth 5716, North Hall, opposite Intel on the central aisle