1 / 39

Multiagent Coordination, Planning, Learning and Generalization with Factored MDPs

Multiagent Coordination, Planning, Learning and Generalization with Factored MDPs. Carlos Guestrin Daphne Koller Stanford University. Includes collaborations with: Geoffrey Gordon 1 , Michail Lagoudakis 2 , Ronald Parr 2 , Relu Patrascu 4 , Dale Schuurmans 4 , Shobha Venkataraman 3 ! 1

jmoberg
Download Presentation

Multiagent Coordination, Planning, Learning and Generalization with Factored MDPs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiagent Coordination, Planning, Learning and Generalization with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Includes collaborations with: Geoffrey Gordon1, Michail Lagoudakis2, Ronald Parr2, Relu Patrascu4, Dale Schuurmans4, Shobha Venkataraman3!1 1Carnegie Mellon University 2Duke University 3Stanford University 4University of Waterloo

  2. Multiagent Coordination Examples • Search and rescue • Factory management • Supply chain • Firefighting • Network routing • Air traffic control • Multiple, simultaneous decisions • Limited observability • Limited communication

  3. t+1 t M1 Neighboring machines: Status: Si Si’ M4 M2 Load: Li Li’ M3 Ri Action: Ai When process terminates successfully Network Management Problem Administrators must coordinate to maximize global reward

  4. Joint Decision Space • Represent as MDP: • Action space: joint action a= {a1,…, an} for all agents • State space: joint state x of entire system • Reward function: total reward r • Action space is exponential in # agents • State space is exponential in # variables • Global decision requires complete observation

  5. Multiagents with Factored MDPs • Coordination • Planning • Learning • Generalization

  6. Long-Term Utilities • One step utility: • SysAdmin Ai receives reward ($) if process completes • Total utility: sum of rewards • Optimal action requires long-term planning • Long-term utility Q(x,a): • Expected reward, given current state x and action a • Optimal action at state x is:

  7. M1 M4 M2 Q3 M3 Observe only X2 and X3 Local Q function Approximation [Guestrin, Koller, Parr ‘01] Q(A1,…,A4, X1,…,X4) Q(A1,…,A4, X1,…,X4) ¼ Q1(A1, A4, X1,X4) + Q2(A1, A2, X1,X2) + Q3(A2, A3, X2,X3) + Q4(A3, A4, X3,X4) Associated with Agent 3 Limited observability: agent i only observes variables in Qi Must choose action to maximize åi Qi

  8. A7 A2 A3 A6 A11 A9 A8 A10 A5 Maximizing i Qi:Coordination Graph [Guestrin, Koller, Parr ‘01] A1 • Trees don’t increase communication requirements • Cycles require graph triangulation A4 • Limited communication for optimal action choice • Comm. bandwidth = induced width of coord. graph

  9. Variable Coordination Structure [Guestrin, Venkataraman, Koller ‘02] • With whom should I coordinate? • It depends! • Real-world: coordination structure must be dynamic • Exploit context-specific independence • Obtain coordination structure changing with state

  10. Multiagents with Factored MDPs • Coordination • Planning • Learning • Generalization

  11. Use function approximation to find Qi: Q(X1, …, X4, A1, …, A4) ¼Q1(A1, A4, X1,X4) + Q2(A1, A2, X1,X2) + Q3(A2, A3, X2,X3) + Q4(A3, A4, X3,X4) Long-term planning requires Markov Decision Process # states exponential # actions exponential Efficient approximation by exploiting structure! Where do the Qi come from?

  12. State Dynamics Decisions Rewards M1 X1’ X1 X3’ M4 M2 X2’ X2 M3 R2 R4 R1 X3 R3 X4 X4’ A2 A4 A3 A1 Dynamic Decision Diagram P(X1’|X1, X4, A1)

  13. Long-term Utility = Value of MDP [Manne `60] • Value computed by linear programming: • One variable V (x) for each state • One constraint for each state x and action a • Number of states and actions exponential!

  14. Decomposable Value Functions Linear combination of restricted domain functions [Bellman et al. `63] [Tsitsiklis & Van Roy `96] [Koller & Parr `99,`00] [Guestrin et al. `01] • Each hiis status of small part(s) of a complex system: • Status of a machine and neighbors • Load on machine • Must find w giving good approximate value function

  15. Single LP Solution for Factored MDPs [Schweitzer and Seidmann ‘85] [de Farias and Van Roy ‘01] • One variable wi for each basis function  • Polynomially many LP variables • One constraint for every state and action  • Exponentially many LP constraints • hi , Qi depend on small sets of variables/actions  • Exploit structure as in variable elimination [Guestrin, Koller Parr `01]

  16. Summary of Algorithm • Pick local basis functions hi • Single LP to compute local Qi’s in factored MDP • Coordination graph computes maximizing action

  17. Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99]

  18. Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99] Distributed reward Distributed value

  19. Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99] LP pair basis LP single basis Distributed reward Distributed value

  20. Multiagent Running Time Ring of rings Star pair basis Star single basis

  21. Solve Very Large MDPs Solved MDPs with : 500agents; over 10150actions and 144365965422032752148167664920368 2268285973467048995407783138506080619639097776968725823559509545 8210061891186534272525795367402762022519832080387801477422896484 1274390400117588618041128947815623094438061566173054086674490506 1781254803444055470543970388958174653682549161362208302685637785 8229022846398307887896918556404084898937609373242171846359938695 5167650189405881090604260896714388641028143503856487471658320106 14366132173102768902855220001 1322070819480806636890455259752 states

  22. Multiagents with Factored MDPs • Coordination • Planning • Learning • Generalization

  23. Reinforcement Learning • Do we know the model ? NO: • Reinforcement learning • Training data: < x, a, x’, r > • Data collected while acting in the world • Model-free approach: • Learn Q-function directly [Guestrin, Lagoudakis, Parr `02] • Model-based approach: • Learn factored MDP [Guestrin, Patrascu, Schuurmans `02]

  24. Power Grid – Multiagent LSPI [Schneider et al ‘99] Lower is better!

  25. Power Grid – Multiagent LSPI [Schneider et al ‘99] Lower is better!

  26. Power Grid – Multiagent LSPI [Schneider et al ‘99] Lower is better!

  27. Multiagents with Factored MDPs • Coordination • Planning • Learning • Generalization

  28. Hierarchical and Relational Models [Guestrin, Gordon ‘02] [Guestrin, Koller ‘02] • Classes of objects • Instances • Relations • Value functions in class level • Factored MDP equivalents of: • OOBNs [Koller, Pfeffer ‘97] • PRMs [Koller, Pfeffer ‘98] Server Client

  29. Generalization • Sample a set of scenarios • Solve a linear program with these scenarios to obtain class value functions • When faced with new problem: • Use class value function • No re-planning needed

  30. samples Theorem Exponentially (infinitely) many worlds ! need exponentially many samples? NO! Value function within , with prob. at least 1-. Proof method related to [de Farias, Van Roy ‘02]

  31. Generalizing to New Problems

  32. Generalizing to New Problems

  33. Generalizing to New Problems

  34. Leaf Intermediate Server Intermediate Intermediate Leaf Leaf Classes of Objects Discovered • Learned 3 classes

  35. Learning Classes of Objects

  36. Learning Classes of Objects

  37. Loopy Approximate Linear Programming Roadmap for Multiagents Multiagent Coordination and Planning Variable Coordination Structure Hierarchical Factored MDPs Coordinated Reinforcement Learning Relational MDPs

  38. Conclusions • Multiagent planning algorithm: • Limited Communication • Limited Observability • Unified view of function approximation and multiagent communication • Single LP solution is simple and very efficient • Efficient reinforcement learning • Generalization to new domains • Exploit structure to reduce computation costs! • Solve very large MDPs efficiently

  39. A1 + + + max Q ( A , A ) Q ( A , A ) Q ( A , A ) Q ( A , A ) 1 1 2 2 1 3 3 3 4 4 2 4 , , , A A A A 1 2 3 4 A2 A3 [ ] = + + + max Q ( A , A ) Q ( A , A ) max Q ( A , A ) Q ( A , A ) 1 1 2 2 1 3 3 3 4 4 2 4 , , A A A A 1 2 3 4 = + + max Q ( A , A ) Q ( A , A ) g ( A , A ) A4 1 1 2 2 1 3 1 2 3 , , A A A 1 2 3 If A2 reboots and A3 does nothing, then A4 get $10 Maximizing i Qi:Coordination Graph [Guestrin, Koller, Parr ‘01] • Use variable elimination for maximization: [Bertele & Brioschi ‘72] Here we need only 23, instead of 63 sum operations. • Limited communication for optimal action choice • Comm. bandwidth = induced width of coord. graph

More Related