390 likes | 418 Views
Multiagent Coordination, Planning, Learning and Generalization with Factored MDPs. Carlos Guestrin Daphne Koller Stanford University. Includes collaborations with: Geoffrey Gordon 1 , Michail Lagoudakis 2 , Ronald Parr 2 , Relu Patrascu 4 , Dale Schuurmans 4 , Shobha Venkataraman 3 ! 1
E N D
Multiagent Coordination, Planning, Learning and Generalization with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Includes collaborations with: Geoffrey Gordon1, Michail Lagoudakis2, Ronald Parr2, Relu Patrascu4, Dale Schuurmans4, Shobha Venkataraman3!1 1Carnegie Mellon University 2Duke University 3Stanford University 4University of Waterloo
Multiagent Coordination Examples • Search and rescue • Factory management • Supply chain • Firefighting • Network routing • Air traffic control • Multiple, simultaneous decisions • Limited observability • Limited communication
t+1 t M1 Neighboring machines: Status: Si Si’ M4 M2 Load: Li Li’ M3 Ri Action: Ai When process terminates successfully Network Management Problem Administrators must coordinate to maximize global reward
Joint Decision Space • Represent as MDP: • Action space: joint action a= {a1,…, an} for all agents • State space: joint state x of entire system • Reward function: total reward r • Action space is exponential in # agents • State space is exponential in # variables • Global decision requires complete observation
Multiagents with Factored MDPs • Coordination • Planning • Learning • Generalization
Long-Term Utilities • One step utility: • SysAdmin Ai receives reward ($) if process completes • Total utility: sum of rewards • Optimal action requires long-term planning • Long-term utility Q(x,a): • Expected reward, given current state x and action a • Optimal action at state x is:
M1 M4 M2 Q3 M3 Observe only X2 and X3 Local Q function Approximation [Guestrin, Koller, Parr ‘01] Q(A1,…,A4, X1,…,X4) Q(A1,…,A4, X1,…,X4) ¼ Q1(A1, A4, X1,X4) + Q2(A1, A2, X1,X2) + Q3(A2, A3, X2,X3) + Q4(A3, A4, X3,X4) Associated with Agent 3 Limited observability: agent i only observes variables in Qi Must choose action to maximize åi Qi
A7 A2 A3 A6 A11 A9 A8 A10 A5 Maximizing i Qi:Coordination Graph [Guestrin, Koller, Parr ‘01] A1 • Trees don’t increase communication requirements • Cycles require graph triangulation A4 • Limited communication for optimal action choice • Comm. bandwidth = induced width of coord. graph
Variable Coordination Structure [Guestrin, Venkataraman, Koller ‘02] • With whom should I coordinate? • It depends! • Real-world: coordination structure must be dynamic • Exploit context-specific independence • Obtain coordination structure changing with state
Multiagents with Factored MDPs • Coordination • Planning • Learning • Generalization
Use function approximation to find Qi: Q(X1, …, X4, A1, …, A4) ¼Q1(A1, A4, X1,X4) + Q2(A1, A2, X1,X2) + Q3(A2, A3, X2,X3) + Q4(A3, A4, X3,X4) Long-term planning requires Markov Decision Process # states exponential # actions exponential Efficient approximation by exploiting structure! Where do the Qi come from?
State Dynamics Decisions Rewards M1 X1’ X1 X3’ M4 M2 X2’ X2 M3 R2 R4 R1 X3 R3 X4 X4’ A2 A4 A3 A1 Dynamic Decision Diagram P(X1’|X1, X4, A1)
Long-term Utility = Value of MDP [Manne `60] • Value computed by linear programming: • One variable V (x) for each state • One constraint for each state x and action a • Number of states and actions exponential!
Decomposable Value Functions Linear combination of restricted domain functions [Bellman et al. `63] [Tsitsiklis & Van Roy `96] [Koller & Parr `99,`00] [Guestrin et al. `01] • Each hiis status of small part(s) of a complex system: • Status of a machine and neighbors • Load on machine • Must find w giving good approximate value function
Single LP Solution for Factored MDPs [Schweitzer and Seidmann ‘85] [de Farias and Van Roy ‘01] • One variable wi for each basis function • Polynomially many LP variables • One constraint for every state and action • Exponentially many LP constraints • hi , Qi depend on small sets of variables/actions • Exploit structure as in variable elimination [Guestrin, Koller Parr `01]
Summary of Algorithm • Pick local basis functions hi • Single LP to compute local Qi’s in factored MDP • Coordination graph computes maximizing action
Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99]
Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99] Distributed reward Distributed value
Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99] LP pair basis LP single basis Distributed reward Distributed value
Multiagent Running Time Ring of rings Star pair basis Star single basis
Solve Very Large MDPs Solved MDPs with : 500agents; over 10150actions and 144365965422032752148167664920368 2268285973467048995407783138506080619639097776968725823559509545 8210061891186534272525795367402762022519832080387801477422896484 1274390400117588618041128947815623094438061566173054086674490506 1781254803444055470543970388958174653682549161362208302685637785 8229022846398307887896918556404084898937609373242171846359938695 5167650189405881090604260896714388641028143503856487471658320106 14366132173102768902855220001 1322070819480806636890455259752 states
Multiagents with Factored MDPs • Coordination • Planning • Learning • Generalization
Reinforcement Learning • Do we know the model ? NO: • Reinforcement learning • Training data: < x, a, x’, r > • Data collected while acting in the world • Model-free approach: • Learn Q-function directly [Guestrin, Lagoudakis, Parr `02] • Model-based approach: • Learn factored MDP [Guestrin, Patrascu, Schuurmans `02]
Power Grid – Multiagent LSPI [Schneider et al ‘99] Lower is better!
Power Grid – Multiagent LSPI [Schneider et al ‘99] Lower is better!
Power Grid – Multiagent LSPI [Schneider et al ‘99] Lower is better!
Multiagents with Factored MDPs • Coordination • Planning • Learning • Generalization
Hierarchical and Relational Models [Guestrin, Gordon ‘02] [Guestrin, Koller ‘02] • Classes of objects • Instances • Relations • Value functions in class level • Factored MDP equivalents of: • OOBNs [Koller, Pfeffer ‘97] • PRMs [Koller, Pfeffer ‘98] Server Client
Generalization • Sample a set of scenarios • Solve a linear program with these scenarios to obtain class value functions • When faced with new problem: • Use class value function • No re-planning needed
samples Theorem Exponentially (infinitely) many worlds ! need exponentially many samples? NO! Value function within , with prob. at least 1-. Proof method related to [de Farias, Van Roy ‘02]
Leaf Intermediate Server Intermediate Intermediate Leaf Leaf Classes of Objects Discovered • Learned 3 classes
Loopy Approximate Linear Programming Roadmap for Multiagents Multiagent Coordination and Planning Variable Coordination Structure Hierarchical Factored MDPs Coordinated Reinforcement Learning Relational MDPs
Conclusions • Multiagent planning algorithm: • Limited Communication • Limited Observability • Unified view of function approximation and multiagent communication • Single LP solution is simple and very efficient • Efficient reinforcement learning • Generalization to new domains • Exploit structure to reduce computation costs! • Solve very large MDPs efficiently
A1 + + + max Q ( A , A ) Q ( A , A ) Q ( A , A ) Q ( A , A ) 1 1 2 2 1 3 3 3 4 4 2 4 , , , A A A A 1 2 3 4 A2 A3 [ ] = + + + max Q ( A , A ) Q ( A , A ) max Q ( A , A ) Q ( A , A ) 1 1 2 2 1 3 3 3 4 4 2 4 , , A A A A 1 2 3 4 = + + max Q ( A , A ) Q ( A , A ) g ( A , A ) A4 1 1 2 2 1 3 1 2 3 , , A A A 1 2 3 If A2 reboots and A3 does nothing, then A4 get $10 Maximizing i Qi:Coordination Graph [Guestrin, Koller, Parr ‘01] • Use variable elimination for maximization: [Bertele & Brioschi ‘72] Here we need only 23, instead of 63 sum operations. • Limited communication for optimal action choice • Comm. bandwidth = induced width of coord. graph