Multiagent Coordination, Planning, Learning & Generalization with Factored MDPs

Multiagent Coordination, Planning, Learning and Generalization with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Includes collaborations with: Geoffrey Gordon1, Michail Lagoudakis2, Ronald Parr2, Relu Patrascu4, Dale Schuurmans4, Shobha Venkataraman3!1 1Carnegie Mellon University 2Duke University 3Stanford University 4University of Waterloo

Multiagent Coordination Examples • Search and rescue • Factory management • Supply chain • Firefighting • Network routing • Air traffic control • Multiple, simultaneous decisions • Limited observability • Limited communication

t+1 t M1 Neighboring machines: Status: Si Si’ M4 M2 Load: Li Li’ M3 Ri Action: Ai When process terminates successfully Network Management Problem Administrators must coordinate to maximize global reward

Joint Decision Space • Represent as MDP: • Action space: joint action a= {a1,…, an} for all agents • State space: joint state x of entire system • Reward function: total reward r • Action space is exponential in # agents • State space is exponential in # variables • Global decision requires complete observation

Multiagents with Factored MDPs • Coordination • Planning • Learning • Generalization

Long-Term Utilities • One step utility: • SysAdmin Ai receives reward ($) if process completes • Total utility: sum of rewards • Optimal action requires long-term planning • Long-term utility Q(x,a): • Expected reward, given current state x and action a • Optimal action at state x is:

M1 M4 M2 Q3 M3 Observe only X2 and X3 Local Q function Approximation [Guestrin, Koller, Parr ‘01] Q(A1,…,A4, X1,…,X4) Q(A1,…,A4, X1,…,X4) ¼ Q1(A1, A4, X1,X4) + Q2(A1, A2, X1,X2) + Q3(A2, A3, X2,X3) + Q4(A3, A4, X3,X4) Associated with Agent 3 Limited observability: agent i only observes variables in Qi Must choose action to maximize åi Qi

A7 A2 A3 A6 A11 A9 A8 A10 A5 Maximizing i Qi:Coordination Graph [Guestrin, Koller, Parr ‘01] A1 • Trees don’t increase communication requirements • Cycles require graph triangulation A4 • Limited communication for optimal action choice • Comm. bandwidth = induced width of coord. graph

Variable Coordination Structure [Guestrin, Venkataraman, Koller ‘02] • With whom should I coordinate? • It depends! • Real-world: coordination structure must be dynamic • Exploit context-specific independence • Obtain coordination structure changing with state

Use function approximation to find Qi: Q(X1, …, X4, A1, …, A4) ¼Q1(A1, A4, X1,X4) + Q2(A1, A2, X1,X2) + Q3(A2, A3, X2,X3) + Q4(A3, A4, X3,X4) Long-term planning requires Markov Decision Process # states exponential # actions exponential Efficient approximation by exploiting structure! Where do the Qi come from?

State Dynamics Decisions Rewards M1 X1’ X1 X3’ M4 M2 X2’ X2 M3 R2 R4 R1 X3 R3 X4 X4’ A2 A4 A3 A1 Dynamic Decision Diagram P(X1’|X1, X4, A1)

Long-term Utility = Value of MDP [Manne `60] • Value computed by linear programming: • One variable V (x) for each state • One constraint for each state x and action a • Number of states and actions exponential!

Decomposable Value Functions Linear combination of restricted domain functions [Bellman et al. `63] [Tsitsiklis & Van Roy `96] [Koller & Parr `99,`00] [Guestrin et al. `01] • Each hiis status of small part(s) of a complex system: • Status of a machine and neighbors • Load on machine • Must find w giving good approximate value function

Single LP Solution for Factored MDPs [Schweitzer and Seidmann ‘85] [de Farias and Van Roy ‘01] • One variable wi for each basis function  • Polynomially many LP variables • One constraint for every state and action  • Exponentially many LP constraints • hi , Qi depend on small sets of variables/actions  • Exploit structure as in variable elimination [Guestrin, Koller Parr `01]

Summary of Algorithm • Pick local basis functions hi • Single LP to compute local Qi’s in factored MDP • Coordination graph computes maximizing action

Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99]

Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99] Distributed reward Distributed value

Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99] LP pair basis LP single basis Distributed reward Distributed value

Multiagent Running Time Ring of rings Star pair basis Star single basis

Solve Very Large MDPs Solved MDPs with : 500agents; over 10150actions and 144365965422032752148167664920368 2268285973467048995407783138506080619639097776968725823559509545 8210061891186534272525795367402762022519832080387801477422896484 1274390400117588618041128947815623094438061566173054086674490506 1781254803444055470543970388958174653682549161362208302685637785 8229022846398307887896918556404084898937609373242171846359938695 5167650189405881090604260896714388641028143503856487471658320106 14366132173102768902855220001 1322070819480806636890455259752 states

Reinforcement Learning • Do we know the model ? NO: • Reinforcement learning • Training data: < x, a, x’, r > • Data collected while acting in the world • Model-free approach: • Learn Q-function directly [Guestrin, Lagoudakis, Parr `02] • Model-based approach: • Learn factored MDP [Guestrin, Patrascu, Schuurmans `02]

Power Grid – Multiagent LSPI [Schneider et al ‘99] Lower is better!

Hierarchical and Relational Models [Guestrin, Gordon ‘02] [Guestrin, Koller ‘02] • Classes of objects • Instances • Relations • Value functions in class level • Factored MDP equivalents of: • OOBNs [Koller, Pfeffer ‘97] • PRMs [Koller, Pfeffer ‘98] Server Client

Generalization • Sample a set of scenarios • Solve a linear program with these scenarios to obtain class value functions • When faced with new problem: • Use class value function • No re-planning needed

samples Theorem Exponentially (infinitely) many worlds ! need exponentially many samples? NO! Value function within , with prob. at least 1-. Proof method related to [de Farias, Van Roy ‘02]

Generalizing to New Problems

Leaf Intermediate Server Intermediate Intermediate Leaf Leaf Classes of Objects Discovered • Learned 3 classes

Learning Classes of Objects

Loopy Approximate Linear Programming Roadmap for Multiagents Multiagent Coordination and Planning Variable Coordination Structure Hierarchical Factored MDPs Coordinated Reinforcement Learning Relational MDPs

Conclusions • Multiagent planning algorithm: • Limited Communication • Limited Observability • Unified view of function approximation and multiagent communication • Single LP solution is simple and very efficient • Efficient reinforcement learning • Generalization to new domains • Exploit structure to reduce computation costs! • Solve very large MDPs efficiently

A1 + + + max Q ( A , A ) Q ( A , A ) Q ( A , A ) Q ( A , A ) 1 1 2 2 1 3 3 3 4 4 2 4 , , , A A A A 1 2 3 4 A2 A3 [ ] = + + + max Q ( A , A ) Q ( A , A ) max Q ( A , A ) Q ( A , A ) 1 1 2 2 1 3 3 3 4 4 2 4 , , A A A A 1 2 3 4 = + + max Q ( A , A ) Q ( A , A ) g ( A , A ) A4 1 1 2 2 1 3 1 2 3 , , A A A 1 2 3 If A2 reboots and A3 does nothing, then A4 get $10 Maximizing i Qi:Coordination Graph [Guestrin, Koller, Parr ‘01] • Use variable elimination for maximization: [Bertele & Brioschi ‘72] Here we need only 23, instead of 63 sum operations. • Limited communication for optimal action choice • Comm. bandwidth = induced width of coord. graph

Multiagent Coordination, Planning, Learning & Generalization with Factored MDPs

Multiagent Coordination, Planning, Learning & Generalization with Factored MDPs

Presentation Transcript

Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

MDPS

Coordination and Planning

Models of Coordination in Multiagent Decision Making

Generalization and Specialization in Reinforcement Learning

An Accelerated Gradient Method for Multi-Agent Planning in Factored MDPs

Abstract Reasoning for Multiagent Coordination and Planning

MDPs and Reinforcement Learning

Factored MDPs

Value and Planning in MDPs

Efficient Solution Algorithms for Factored MDPs

Distributed Planning in Hierarchical Factored MDPs

Factored Planning

Reinforcement Learning Generalization and Function Approximation

Coordination in Multiagent Reinforcement Learning: A Bayesian Approach

International Planning and Coordination

Multiagent Coordination and Cooperation: challenges and techniques

Learning in Multiagent systems

Uncertain Multiagent Systems: Games and Learning

Generalizing Plans to New Environments in Multiagent Relational MDPs

Learning in Multiagent Systems

MDPs