1 / 32

Emergent Game Technologies Gamebryo Element Engine

Emergent Game Technologies Gamebryo Element Engine. Thread for Performance. Goals for Cross-Platform Threading. Play well with others Take advantage of platform-specific performance features For engines/middleware, be adaptable to the needs of customers. Write Once, Use Everywhere.

fawzi
Download Presentation

Emergent Game Technologies Gamebryo Element Engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Emergent Game TechnologiesGamebryo Element Engine Thread for Performance

  2. Goals for Cross-Platform Threading • Play well with others • Take advantage of platform-specific performance features • For engines/middleware, be adaptable to the needs of customers

  3. Write Once, Use Everywhere • Underlying multi-threaded primitives are replicated on all platforms • Define cross-platform wrappers for these • Processing models can be applied on different architectures • Define cross-platform systems for these • Typical developer writes once, yet code performs well on all platforms

  4. Emergent's Gamebryo Element • A foundation for easing cross-platform and multi-core development • Modular, customizable • Suite of content pipeline tools • Supports PC, Xbox, PS3 and Wii • Booth # 5716 - North Hall

  5. Cross-Platform Threading Requires Common Primitives • Threads • Something that executes code • Sub issues: local storage, priorities • Data Locks / Critical sections • Manage contention for a resource • Atomic operations • An operation that is guaranteed to complete without interruption from another thread

  6. Choosing a Processing Model • Architectural features drive choice • Cache coherence • Prefetch on Xbox • SPUs on PS3 • Many processing units • General purpose GPU • Stream Processing fits these properties • Provide infrastructure to compute this way • Shift engine work to this model

  7. Stream Processing (Formal)‏ Wikipedia: Given a set of input and output data (streams), the principle essentially defines a series of computer-intensive operations (kernel functions) to be applied for each element in the stream. Input 1 Kernel 1 Kernel 2 Output Input 2

  8. Generalized Stream Processing • Improve for general purpose computing • Partition streams into chunks • Kernels have access to entire chunk • Parameters for kernels (fixed inputs)‏ • Advantages • Reduce need for strict data locality • Enables loops, non-SIMD processing • Maps better onto hardware

  9. Morphing+Skinning Example Morph Kernel (MK)‏ Skin Vertices Skinning Kernel (SK)‏ Bone Matrices Blend Weights Morph Weights Morph Target 1 Vertices Morph Target 2 Vertices Vertex Locations

  10. Morphing+Skinning Example MT 1 V Part 1 Skin V Part 1 MK Instance 1 SK Instance 1 MT 2 V Part 1 Verts Part 1 Weights Fixed MW Fixed Matrices Fixed MK Instance 2 SK Instance 2 MT 1 V Part 2 Skin V Part 2 Verts Part 2 MT 2 V Part 2

  11. Floodgate • Cross platform stream processing library • Optimized per-platform implementation • Documented API for customer use • Engine uses the same API for built in functionality • Skinning, Morphing, Particles, Instance Culling, ...

  12. Floodgate Basics • Stream: A buffer of varying or fixed data • A pointer, length, stride, locking • Kernel: An operation to perform on streams of data • Code implementing “Execute” function • Task: Wrapper a kernel and IO streams • Workflow: A collection of Tasks processed as a unit

  13. Kernel Example: Times2 // Include Kernel Definition macros #include <NiSPKernelMacros.h> // Declare the Timer2Kernel NiSPDeclareKernel(Times2Kernel)‏

  14. Kernel Example: Times2 #include "Times2Kernel.h" NiSPBeginKernelImpl(Times2Kernel)‏ { // Get the input stream float *pInput = kWorkload.GetInput<float>(0); // Get the output stream float *pOutput = kWorkload.GetOutput<float>(0); // Process data NiUInt32 uiBlockCount = kWorkload.GetBlockCount(); for (NiUInt32 ui = 0; ui < uiBlockCount; ui++)‏ { pOutput[ui] = pInput[ui] * 2; } } NiSPEndKernelImpl(Times2Kernel)‏

  15. Life of a Workflow • 1. Obtain Workflow from Floodgate • 2. Add Task(s) to Workflow • 3. Set Kernel • 4. Add Input Streams • 5. Add Output Streams • 6. Submit Workflow • … Do something else … • 7. Wait or Poll when results are needed

  16. Example Workflow // Setup input and output streams from existing buffers NiTSPStream<float> inputStream(SomeInputBuffer, MAX_BLOCKS); NiTSPStream<float> outputStream(SomeOutputBuffer, MAX_BLOCKS); // Get a Workflow and setup a new task for it NiSPWorkflow* pWorkflow = NiStreamProcessor::Get()->GetFreeWorkflow(); NiSPTask* pTask = pWorkflow->AddNewTask(); // Set the kernel and streams pTask->SetKernel(&Times2Kernel); pTask->AddInput(&inputStream); pTask->AddOutput(&outputStream); // Submit workflow for execution NiStreamProcessor::Get()->Submit(pWorkflow); // Do other operations... // Wait for workflow to complete NiStreamProcessor::Get()->Wait(pWorkflow);

  17. Floodgate Internals • Partitioning streams for Tasks • Task Dependency Analysis • Platform specific Workflow preparation • Platform specific execution • Platform specific synchronization

  18. Overview of Workflow Analysis • Task dependencies defined by streams • Sort tasks into stages of execution • Tasks that use results from other tasks run in later stages • Stage N+1 tasks depend on output of Stage N tasks • Tasks in a given stage can run concurrent • Once a stage has completed, the next stage can run

  19. Task 1 Task 2 Task 3 Stream A Stream B Stream C Stream D Stream E Stream F Stream B Stream G Task 5 Stream G Stream H Task 6 Task 4 Stream I Stream G Stream F Stream D Task 7 Sync Analysis: Workflow with many Tasks

  20. Task 1 Stream A Stream B Task 4 Task 5 Stream G Stream H Task 2 Stream D Stream C Sync Task Sync Stream G Task 3 Task 6 Stream E Stream F Stream I Analysis: Dependency Graph Stage 1 Stage 2 Stage 3 Stage 0

  21. Performance Notes • Data is broken into blocks -> Locality • Good cache performance • Optimize size for prefetch or DMA transfers • Fits in limited local storage (PS3)‏ • Easily adapt to #cores • Can manage interplay with other systems • Kernels encapsulate processing • Good target for optimization, platform-specific • Clean solution without #if

  22. Usability Notes • Automatically manage data dependency and simplify synchronization • Hide nasty platform-specific details • Prefetch, DMA transfers, processor detection, ... • Learn one API, use it across platforms • Productivity gains • Helps us produce quality documentation and samples • Eases debugging

  23. Exploiting Floodgate in the Engine • Find tasks that operate on a single object • Skinning, morphing, particle systems, ... • Move these to Floodgate: Mesh Modifiers • Launch at some point during execution • After updating animation and bounds • After determining visibility • After physics finishes ... • Finish them when needed • Culling • Render • etc

  24. Same applications, new performance ... • The big win is out-of-the-box performance • Same results could be achieved with much developer time • Hides details on different platforms (esp. PS3)‏ After Before Skinning Objects 42fps 62fps Morphing Objects 12fps 38fps

  25. Example CPU Utilization, Morphing Before After

  26. Thread profiling, Morphing Before • Some parallelization through hand-coded parallel update • Note high overhead and 85% or so in serial execution

  27. Thread profiling, Morphing After • Automatic parallelism in engine • 4 threads for Floodgate (4 CPUs)‏ • Roughly, 50% of old serial time replaced with 4x parallelism

  28. New Issues • Within the engine, resource usage peaks at certain times • e.g. Between visibility culling and rendering • Application-level work might fill in the empty spaces • Physics, global illumination, ... • What about single processor machines? • What about variable sized output? • Instance culling, for example

  29. Ongoing Improvements • Improved workflow scheduling • Mechanisms to enhance application control • Optimizing when tasks change • Stream lengths change • Inputs/outputs are changed • More platform specific improvements • Off-loading more engine work

  30. Using Floodgate in a game • Identify stream processing opportunities • Places where lots of data is processed with local access patterns • Places where work can be prepared early but results are not needed until later • Re-factor to use Floodgate • Depending on task, could be as little as a few hours. • Hard part is enforcing locality

  31. Future proofed? • Both CPUs and GPUs can function as stream processors • Easily extends to more processing units • Potential snags are in application changes

  32. Questions? • Ask Stephen! • Visit Emergent's booth at the show. • Booth 5716, North Hall, opposite Intel on the central aisle

More Related