410 likes | 561 Views
Massive Parallelism in AI Throughput versus Realtime. Pierre Pontevia 10 th March 2010. Agenda. Where are we today The pathfinding challenge : from throughput to realtime MASAI : the premises of an AI massive parallel solution. Where are we today ?. Where are we today ?.
E N D
Massive Parallelism in AIThroughput versus Realtime Pierre Pontevia 10th March 2010
Agenda • Where are we today • The pathfindingchallenge : from throughputto realtime • MASAI : the premises of an AI massive parallel solution
Where are we today? • Parallel programming has becoming a realityfor game developers since the arrival of ”next gen” consoles (2005-2006) • Since then, a lot of new languages and programming models have been suggested to better tackle parallelism, • And new hardware is being announced, shaping the future of consoles… • So this is a good moment to see how parallelism could be revisited for the games of tomorrow… with a special focus on pathfinding
As a start, the 13 dwarves should help us to find the right parallel pattern • The 13 dwarves is an initiative from Berkeley University to help achieve high parallelism • A dwarf is an algorithmic method that captures a pattern of computation and communication • The 1st exercise is to identify which dwarves match the problems involved in pathfinding
As a start, the 13 dwarves should help us to find the right parallel pattern (cont’d)
As a start, the 13 dwarves should help us to find the right parallel pattern (cont’d)
Recent languages and programming models provide guidance for parallel implementation • Data Parallelism for homogenous architectures • OpenMP • TBB • Ct • Data Parallelism for heterogeneous architectures • CUDA, • OpenCL, • DirectCompute • SPURS • RapidMind • PC clusters • MPI • Map Reduce • Concurrent Programming • PPL, Asynchronous Agents • Grand Central Station
However, there are specific constraints in the video games impacting on parallel design… • Memory Resources Constraints • How much scratch memory required by solver • Concurrent Memory access • Computations are done on data which can change significantly from frame to frame • Data lifetime / persistence • Things are volatile by nature • Reactivity / Time delay / Frequency constraints • When do you really need the result of your computation • Interruptibility • The system can change its mind – 80% of the path goals are never reached
…and even more constraints when you develop middleware • Multiple cohabitant models • Several middleware with several threading models • Not blocking is not enough -> fine tuning issues • Spurs everywhere? • Multiple HW targets • PC is different from Xbox 360 console which is different from a PlayStation® 3 (PS3) console • Multiple exclusive programming languages
A gap analysis on existing solutions shows that no one solution fits the video game context perfectly • No model really takes care of memory as a limitating resource in the design of parallel solutions • No model takes into account time as a dimension of the problem • All the approches are verythroughput oriented
Pathfinding in a nutshell Path Smoothing DA(*) & Steering Path Planning • LOW FREQUENCY (0,1 Hz) • Input : • Topology • current position • destination • Output : • Valid Path • HIGH FREQUENCY (10 Hz) • Input : • current position • Target point • Output : • New Target point • MEDIUM FREQUENCY (2 Hz) • Input : • current position • destination • Output : • Target point B A (*): DA - DynamicAvoidance
Pathfindingis made of different solvers with different characteristics • 3 categories of solvers: • A*, Graph Traversal : low frequency/large input-workmemory • Trajectory Smoothing : medium frequency/optional • DA / Steering : high frequency/critical • A* • Graph Traversals > 500 K Work Memory requirements • DA • Steering • Smoothing < 5 K Frequency 3 0.2 10
There are 2 natures of data parallelism in pathfinding • Number of characters: all solver jobs increase linearly with the number of characters • Size of graph : Graph Traversal related solvers can use a Dwarf 9 pattern solving approach
A first approach could be a single frame batch paradigm (throughput) compatible with most programming models Pathfinder – Entity 1 Framework Steering Request Queue Path Request Queue DA Request Queue Target Request Queue MiddleWare Queue Search Path Task Select Target Task Compute DA Task Compute Steering Task Compute Kernel Compute Kernel Compute Kernel Compute Kernel Compute Kernel Compute Kernel Compute Kernel Compute Kernel PPM Queue Compute Kernel Compute Kernel Compute Kernel Compute Kernel Compute Kernel Compute Kernel Compute Kernel Compute Kernel Compute Kernel PPM (Parallel Programming Model)
Each task request has a context composed of character data, global data, and potentially customized objects Character Context Customizable (*): LPF – Obstacle Avoidance Global Data Output
Compute Steering Compute DA Tgt Point Compute Target Point Compute Path However, as the number of solvers can be limited by memory… Thread 1 Thread 2
Thread 1 Thread 1 Serial - No memory limitation Thread 2 …throughput maximization approach in parallelization can be capped by Amdahl’ law Thread 1 Thread 2 Parallel - No memory limitation Parallel - Memory constrained environment
To avoid that, the Pathfindingsolution needs to find more task parallelism on time dimension Moving from “How to solve all the work within a frame” To “How to distribute work across several frames”
A good illustration is describing Pathfinding as a statechartwith 4 orthogonal states Stopped Has Arrived Path Not Found New Destination Pos updated Active DA Target Steering Path Planning Target Selection No DA Target No Steering No Path No Target DA Target Updated Target Updated New Destination Path Updated Computing DA Target Computing Steering Searching Path Selecting Target Has arrived Has arrived Has arrived Has arrived New Pos DA Target Updated New Pos Target Updated Steering Computed DA Target Found Target Found Path Found New Pos Path Updated New Destination DA Target Computed Steering Computed Path Found Target Found
It is still compatible with the precedent approach, but multiframe (no more capped by Amdahl’s law) Active Active Active Framework DA Target DA Target DA Target Steering Steering Steering Path Planning Path Planning Path Planning Target Selection Target Selection Target Selection No DA Target No DA Target No DA Target No Steering No Steering No Steering No Path No Path No Path No Target No Target No Target DA Target Updated DA Target Updated DA Target Updated Target Updated Target Updated Target Updated New Destination New Destination New Destination Path Updated Path Updated Path Updated Steering Request Queue Path Request Queue Computing DA Target Computing DA Target Computing DA Target Computing Steering Computing Steering Computing Steering Searching Path Searching Path Searching Path Selecting Target Selecting Target Selecting Target DA Request Queue Target Request Queue Has arrived Has arrived Has arrived Has arrived Has arrived Has arrived Has arrived Has arrived Has arrived Has arrived Has arrived Has arrived Steering Computed Steering Computed Steering Computed DA Target Found DA Target Found DA Target Found New Pos DA Target Updated New Pos DA Target Updated New Pos DA Target Updated Target Found Target Found Target Found New Pos Target Updated New Pos Target Updated New Pos Target Updated MiddleWare Queue Path Found Path Found Path Found New Pos Path Updated New Pos Path Updated New Pos Path Updated New Destination New Destination New Destination Search Path Task Select Target Task Compute DA Task Compute Steering Task DA Target Computed DA Target Computed DA Target Computed Steering Computed Steering Computed Steering Computed Path Found Path Found Path Found Target Found Target Found Target Found
But now we have 3 new problems Problem 1 : How to guarantee that high frequency steering solvers return value on time? Problem 2 : How to deal with multiframe volatility and dynamicity of data? Problem 3 : What computation triggering logic do we want?
Problem 1 is a scheduling problem for realtime systems Problem 1 can be reworded as follows: “How to guarantee a deadline for each pathfinding solver request compatible with the frequency of the solver” This is very close the definition of a realtime software as found on Wikipedia: “In computer science, real-time computing (RTC), or "reactive computing", is the study of hardware and software systems that are subject to a "real-time constraint"—i.e., operational deadlines from event to system response” The good news is that there is a good literature on realtime scheduling!
To answer problem 1 we restate pathfinding solvers in a realtime formalism… • Realtime formalism: a task x is defined by 4 parameters • X.s : starting time • X.d : deadline • X.e : execution requirement • X.p : execution period • Adapting to pathfinding solvers: • Need to assume all tasks are periodic: • Easy for smoothing, steering or DA solvers • More tricky for A* and other Graph traversals solvers • Need to have an estimate of each core solver job duration: • Again quite simple for smoothing, steering or DA solvers • Much less easy for A* and other Graph traversals solvers -> need to decompose graph traversal tasks into subtasks of constant duration
…and select a scheduling algorithm • P-fairness scheduling scheme (S.K. Baruah, N.K. Cohen, C.G. Plaxton, D.A. Varvel): • Defines a notion of proportionate progress called P-fairness • Uses it to define an efficient algorithm solving the periodic scheduling problem • Cache-aware P-fair based scheduling scheme (J.H. Anderson, J.M. Calendrino, U.M. Devi) • Extends P-fairness approach to avoid scheduling of co-existent threads that would worsen performance of shared caches • Task-grouping P-fair based scheduling scheme (J.H. Anderson, J.M. Calendrino) • Extends P-fairness approach to encourage grouping of tasks that share common working set
Answering problem 2 (volatile data) requires a better description of memory models • Programming models differ in the way they manage memory space • Homogenous models: unified memory • Heterogeneous models: Host / Device space • Today only homogenous models offer a transparent memory management • For heterogeneous models, the developer still has to do a lot of work
Programming models differ in the way they manage memory space Framework Request Queue Host Memory Space Task Compute Kernel Compute Kernel Compute Kernel Compute Kernel Device Memory Space OpenCL Queue
There is a need for locking mechanism between the framework and the kernel
It requires also a better description of user data • There are 3 types of user data: • Read Only Memory (e.g. navmesh in a static world) • Needs to be aware of when user data is available and when it is garbage • Read / Write Memory(e.g.. navmesh in a dynamic world) • Same as Read Only approach, with extension to secure data modification stages • WorkMemory (e.g. open & closed sets for a A* solver) • Located where the solver is really called
Data Life cycle States are introduced to handle R/O and R/W data volatility and dynamicity Data Lifecycle States UNLOAD Notification Data Removed End Notifying Data Removed Notifying Data To be Inserted Data in Insertion Data Ready Data in Removal Data Inserted Ready for insertion LOAD Notification On Dependency Insertion / Removal Dependency Inserted / Removed Data Locked CRITICAL when data are not owned by middleware
Problem 3 (triggering logic)requires choosing between Pull or Push Triggering mechanism • To limit computations over time, it is important to decide whether we want a pull or push triggering model • In a push model, the system polls over all the characters to get new steering policy • In a pull model, the system gets update requirements from the game engine and only performs computations on related characters • The pull model better controls the amount of computations – not really compatible with a Realtime approach • The push model offers the capabilities of optimizing from a Cache and Task Grouping point of view
Guidelines for a new parallel programming model for realtime AI • Extends to the full AI the rational described in previous slides • Data / Message Flow based system • Realtime P-fair Scheduling algorithm • Compatible with heterogeneous programming models • PushTriggering Mechanism
Introducing the concept of Working Unit • A WU receives requests to process • A WU communicates with another WU ONLY through strongly typed requests • Requests are explicitly exposed in the WU interface • A request can be synchronous or asynchronous (2 different implementations of the request) • A WU is responsible for the serialization Host<->Device of its context Owner / Children Event Handler Incoming Requests Queues Working Unit Context Host Code Device Code Context Accessors Requests Interface Context Serializer
World1 World… Game Engine Entity 1 Entity 2 Entity … Brain1 Brain 2 Brain … Geometry Mgr PF 1 PF 2 PF … Entity Update Queue World Update Queue The system works on a mixture of events and requests Brain Update Queue Entity Update WU World Update WU Brain Update WU IsVisible Queue CanGo Queue Pathfinding Update Queue CanGo WU IsVisible WU Pathfinding WU Pathdata Mgr Request Event
The underlying architecture would rely on a event broadcaster and communicating components Communicating Component = Working Unit for parallelism SearchPath CC SearchPath CC SelectTarget CC SelectTarget CC ComputeDA CC ComputeDA CC Steering CC Steering CC Local Events Broadcaster Local Events Broadcaster Global Events Broadcaster
Open challenges • Customized Objects vs. Data / Services model • Interruptability • Multi-platform • Schedulingalgorithm performance • And many more…
Multiplatform • Too many programming languages! • C++ • C for OpenCL • C for CUDA • C99 for Spurs • HLSL 5 for DirectX • … • Which standards will emerge? • Which standards will be chosen in future consoles?