400 likes | 437 Views
Learn how to generalize plans in new environments in multiagent relational MDPs, tackling large problems efficiently. Discover the roadmap to generalization and approximate value functions in a factored MDP world.
E N D
Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University
Multiagent Coordination Examples • Search and rescue • Factory management • Supply chain • Firefighting • Network routing • Air traffic control • Multiple, simultaneous decisions • Exponentially-large spaces • Limited observability • Limited communication
Real-time Strategy Game Peasants collect resources and build Footmen attack enemies Buildings train peasants and footmen peasant footman building
Scaling up by Generalization • Exploit similarities between world elements • Generalize plans: • From a set of worlds to a new, unseen world • Avoid need to replan • Tackle larger problems Formalize notion of “similar” elements Compute generalizable plans
Relational Models and MDPs • Classes: • Peasant, Gold, Wood, Barracks, Footman, Enemy… • Relations • Collects, Builds, Trains, Attacks… • Instances • Peasant1, Peasant2, Footman1, Enemy1… • Value functions in class level • Objects of the same class have same contribution to value function • Factored MDP equivalents of PRMs [Koller, Pfeffer ‘98]
Gold Peasant P’ G’ Collects AP P G Relational MDPs • Class-level transition probabilities depends on: • Attributes; Actions; Attributes of related objects • Class-level reward function • Instantiation (world) • Number objects; Relations • Well-defined MDP
Planning in a World • Long-term planning by solving MDP • # states exponential in number of objects • # actions exponential • Efficient approximation by exploiting structure! • RMDP world is a factored MDP
Roadmap to Generalization • Solve 1 world • Compute generalizable value function • Tackle a new world
State Dynamics Decisions Rewards P’ P F’ G’ G F AF R H AP E E’ World is a Factored MDP P(F’|F,G,H,AF)
Long-term Utility = Value of MDP [Manne `60] • Value computed by linear programming: • One variable V (x) for each state • One constraint for each state x and action a • Number of states and actions exponential!
Approximate Value Functions Linear combination of restricted domain functions [Bellman et al. `63] [Tsitsiklis & Van Roy `96] [Koller & Parr `99,`00] [Guestrin et al. `01] • Each Vodepends on state of object and related objects: • State of footman • Status of barracks • Must find Vo giving good approximate value function
Single LP Solution for Factored MDPs [Schweitzer and Seidmann ‘85] • Variables for each Vo , for each object • Polynomially many LP variables • One constraint for every state and action • Exponentially many LP constraints • Vo , Qo depend on small sets of variables/actions • Exploit structure as in variable elimination [Guestrin, Koller, Parr `01]
Representing Exponentially Many Constraints Exponentially many linear = one nonlinear constraint
A D B C + + + max f ( A , B ) f ( A , C ) f ( C , D ) f ( B , D ) 1 2 3 4 , , , A B C D [ ] = + + + max f ( A , B ) f ( A , C ) max f ( C , D ) f ( B , D ) 1 2 3 4 , , A B C D = + + max f ( A , B ) f ( A , C ) g ( B , C ) 1 2 1 , , A B C Variable Elimination • Can use variable elimination to maximize over state space: [Bertele & Brioschi ‘72] Here we need only 23, instead of 63 sum operations • As in Bayes nets, maximization is exponential in tree-width
Representing the Constraints • Functions are factored, use Variable Elimination to represent constraints: Number of constraints exponentially smaller
Roadmap to Generalization • Solve 1 world • Compute generalizable value function • Tackle a new world
Generalization • Sample a set of worlds • Solve a linear program for these worlds: • Obtain class value functions • When faced with new problem: • Use class value function • No re-planning needed
Worlds and RMDPs • Meta-level MDP: • Meta-level LP:
Class-level Value Functions • Approximate solution to meta-level MDP • Linear approximation • Value function defined in the class level • All instances use same local value function
Class-level LP • Constraints for each world represented by factored LP • Number of worlds exponential or infinite • Sample worlds from P()
samples Theorem Exponentially (infinitely) many worlds ! need exponentially many samples? NO! Value function within , with prob. at least 1-. Rmax is the maximum class reward Proof method related to [de Farias, Van Roy ‘02]
LP with sampled worlds • Solve LP for sampled worlds • Use Factored LP for each world • Obtain class-level value function • New world: instantiate value function and act
Learning Classes of Objects • Which classes of objects have same value function? • Plan for sampled worlds individually • Use value function as “training data” • Find objects with similar values • Include features of world • Used decision tree regression in experiments
Summary of Generalization Algorithm • Model domain as Relational MDPs • Pick local object value functions Vo • Learn classes by solving some instances • Sample set of worlds • Factored LP computes class-level value function
A New World • When faced with a new world , value function is: • Q function becomes: • At each state, choose action maximizing Q(x,a) • Number of actions is exponential! • Each QC depends only on a few objects!!!
M1 M4 M2 Q3 M3 Observe only X2 and X3 Local Q function Approximation Q(A1,…,A4, X1,…,X4) Q(A1,…,A4, X1,…,X4) ¼ Q1(A1, A4, X1,X4) + Q2(A1, A2, X1,X2) + Q3(A2, A3, X2,X3) + Q4(A3, A4, X3,X4) Associated with Agent 3 Limited observability: agent i only observes variables in Qi Must choose action to maximize åi Qi
A1 + + + max Q ( A , A ) Q ( A , A ) Q ( A , A ) Q ( A , A ) 1 1 2 2 1 3 3 3 4 4 2 4 , , , A A A A 1 2 3 4 A2 A3 [ ] = + + + max Q ( A , A ) Q ( A , A ) max Q ( A , A ) Q ( A , A ) 1 1 2 2 1 3 3 3 4 4 2 4 , , A A A A 1 2 3 4 = + + max Q ( A , A ) Q ( A , A ) g ( A , A ) A4 1 1 2 2 1 3 1 2 3 , , A A A 1 2 3 If A2 attacks and A3defends, then A4 gets $10 Maximizing i Qi:Coordination Graph • Use variable elimination for maximization: [Bertele & Brioschi ‘72] • Limited communication for optimal action choice • Comm. bandwidth = induced width of coord. graph
Summary of Algorithm • Model domain as Relational MDPs • Factored LP computes class-level value function • Reuse class-level value function in new world
Server Ring of Rings Unidirectional Ring Star Experimental Results • SysAdmin problem
Leaf Intermediate Server Intermediate Intermediate Leaf Leaf Classes of Objects Discovered • Learned 3 classes
Results [with Gearhart and Kanodia] • 2 Peasants, Gold, Wood, Barracks, 2 Footman, Enemy • Reward for dead enemy • About 1 million of state/action pairs • Solve with Factored LP • Some factors are exponential • Coordination graph for action selection
Generalization • 9 Peasants, Gold, Wood, Barracks, 3 Footman, Enemy • Reward for dead enemy • About 3 trillion of state/action pairs • Instantiate generalizable value function • At run-time, factors are polynomial • Coordination graph for action selection
The 3 aspects of this talk • Scaling up collaborative multiagent planning • Exploiting structure • Generalization • Factored representation and algorithms • Relational MDP, Factored LP, coordination graph • Freecraft as a benchmark domain
Conclusions • RMDP • Compact representation for set of similar planning problems • Solve single instance with factored MDP algorithms • Tackle sets of problems with class-level value functions • Efficient sampling of worlds • Learn classes of value functions • Generalization to new domains • Avoid replanning • Solve larger, more complex MDPs