720 likes | 874 Views
Execution-Time Communication Decisions for Coordination of Multi-Agent Teams. Maayan Roth Thesis Defense Carnegie Mellon University September 4, 2007. Cooperative Multi-Agent Teams Operating Under Uncertainty and Partial Observability. Cooperative teams
E N D
Execution-Time Communication Decisions for Coordination of Multi-Agent Teams Maayan Roth Thesis Defense Carnegie Mellon University September 4, 2007
Cooperative Multi-Agent Teams Operating Under Uncertainty and Partial Observability • Cooperative teams • Agents work together to achieve team reward • No individual motivations • Uncertainty • Actions have stochastic outcomes • Partial observability • Agents don’t always know world state
Coordinating When Communication is a Limited Resource • Tight coordination • One agent’s best action choice depends on the action choices of its teammates • We wish to Avoid Coordination Errors • Limited communication • Communication costs • Limited bandwidth
Thesis Question “How can we effectively use communication to enable the coordination of cooperative multi-agent teams making sequential decisions under uncertainty and partial observability?
Thesis Statement “Reasoning about communication decisions at execution-time provides a more tractable means for coordinating teams of agents operating under uncertainty and partial observability.”
Thesis Contributions • Algorithms that: • Guarantee agents will Avoid Coordination Errors (ACE) during decentralized execution • Answer the questions of when and what agents should communicate
Outline • Dec-POMDP model • Impact of communication on complexity • Avoiding Coordination Errors by reasoning over Possible Joint Beliefs (ACE-PJB) • ACE-PJB-Comm: When should agents communicate? • Selective ACE-PJB-Comm: What should agents communicate? • Avoiding Coordination Errors by executing Individual Factored Policies (ACE-IFP) • Future directions
Dec-POMDP Model • Decentralized Partially Observable Markov Decision Process • Multi-agent extension of single-agent POMDP model • Sequential decision-making in domains where: • Uncertainty in outcome of actions • Partial observability - uncertainty about world state
Dec-POMDP Model • M = <, S, {Ai}im, T, {i}im, O, R> • is the number of agents • S is set of possible world states • {Ai}im is set of joint actions, <a1, …, am> where ai Ai • T defines transition probabilities over joint actions • {i}im is set of joint observations, <1, …, m> where i i • O defines observation probabilities over joint actions and joint observations • R is team reward function
Dec-POMDP Complexity • Goal - Compute policy which, for each agent, maps its local observation history to an action • For all 2, Dec-POMDP with agents is NEXP-complete • Agents must reason about the possible actions and observations of their teammates
Impact of Communication on Complexity [Pynadath and Tambe, 2002] • If communication is free: • Dec-POMDP reducible to single-agent POMDP • Optimal communication policy is to communicate at every time step • When communication has any cost, Dec-POMDP is still intractable (NEXP-complete) • Agents must reason about value of information
Classifying Communication Heuristics • AND- vs. OR-communication [Emery-Montemerlo, 2005] • AND-communication does not replace domain-level actions • OR-communication does replace domain-level actions • Initiating communication [Xuan et al., 2001] • Tell - Agent decides to tell local information to teammates • Query - Agent asks a teammate for information • Sync - All agents broadcast all information simultaneously
Classifying Communication Heuristics • Does the algorithm consider communication cost? • Is the algorithm is applicable to: • General Dec-POMDP domains • General Dec-MDP domains • Restricted domains • Are the agents guaranteed to Avoid Coordination Errors?
Related Work AND OR Query Cost Sync ACE Unrestricted Tell
Overall Approach • Recall, if communication is free, you can treat a Dec-POMDP like a single agent 1) At plan-time, pretend communication is free - Generate a centralized policy for the team 2) At execution-time, use communication to enable decentralized execution of this policy while Avoiding Coordination Errors
Outline • Dec-POMDP, Dec-MDP models • Impact of communication on complexity • Avoiding Coordination Errors by reasoning over Possible Joint Beliefs (ACE-PJB) • ACE-PJB-Comm: When should agents communicate? • Selective ACE-PJB-Comm: What should agents communicate? • Avoiding Coordination Errors by executing Individual Factored Policies (ACE-IFP) • Future directions
Tiger Domain: (States, Actions) • Two-agent tiger problem [Nair et al., 2003]: Individual Actions: ai {OpenL, OpenR, Listen} Robot can open left door, open right door, or listen S: {SL, SR} Tiger is either behind left door or behind right door
Tiger Domain: (Observations) Individual Observations: I {HL, HR} Robot can hear tiger behind left door or hear tiger behind right door Observations are noisy and independent.
Tiger Domain:(Reward) • Coordination problem – agents must act together for maximum reward Listen has small cost (-1 per agent) Both agents opening door with tiger leads to medium negative reward (-50) Maximum reward (+20) when both agents open door with treasure Minimum reward (-100) when only one agent opens door with tiger
Coordination Errors Reward(<OpenR, OpenL>) = -100 Reward(<OpenL, OpenL>) ≥ -50 HL HL HL … Agents Avoid Coordination Errors when each agent’s action is a best response to its teammates’ actions. a1 = OpenR a2 = OpenL
Avoid Coordination Errors by Reasoning Over Possible Joint Beliefs (ACE-PJB) • Centralized POMDP policy maps joint beliefs to joint actions • Joint belief (bt) – distribution over world states • Individual agents can’t compute the joint belief • Don’t know what their teammates have observed or what action they selected • Simplifying assumption: • What if agents knew the joint action at each timestep? • Agents would only have to reason about possible observations • How can this be assured?
Ensuring Action Synchronization • Agents only allowed to choose actions based on information known to all team members • At start of execution, agents know b0 – initial distribution over world states A0 – optimal joint action given b0, based on centralized policy • At each timestep, each agent computes Lt, distribution of possible joint beliefs Lt = {<bt, pt, t>} t – observation history that led to bt pt - likelihood of observing t
HL HL b: P(SL) = 0.8 p: p(b) = 0.29 Possible Joint Beliefs b: P(SL) = 0.5 p: p(b) = 1.0 L0 a = <Listen, Listen> HL,HL HR,HR HL,HR HR,HL How should agents select actions over joint beliefs? b: P(SL) = 0.5 p: p(b) = 0.21 b: P(SL) = 0.5 p: p(b) = 0.21 b: P(SL) = 0.2 p: p(b) = 0.29 L1
Q-POMDP Heuristic • Select joint action that maximizes expected reward over possible joint beliefs • Q-MDP [Littman et al., 1995] • approximate solution to large POMDP using underlying MDP • Q-POMDP [Roth et al., 2005] • approximate solution to Dec-POMDP using underlying single-agent POMDP
b: P(SL) = 0.8 p: p(b) = 0.29 Q-POMDP Heuristic b: P(SL) = 0.5 p: p(b) = 1.0 Choose joint action by computing expected reward over all leaves HL,HL HR,HR HL,HR HR,HL b: P(SL) = 0.5 p: p(b) = 0.21 b: P(SL) = 0.5 p: p(b) = 0.21 b: P(SL) = 0.2 p: p(b) = 0.29 Agents will independently select same joint action, guaranteeing they avoid coordination errors… but action choice is very conservative (always <Listen,Listen>) ACE-PJB-Comm: Communication adds local observations to joint belief
HL ACE-PJB-Comm Example {} <HL,HL> <HL,HR> <HR,HL> <HR,HR> L1 aNC = Q-POMDP(L1) = <Listen,Listen> L* = circled nodes Don’t communicate aC = Q-POMDP(L*) = <Listen,Listen>
ACE-PJB-Comm Example {HL,HL} {} <HL,HL> <HL,HR> <HR,HL> <HR,HR> L1 a = <Listen, Listen> <HL,HL> <HL,HL> <HL,HL> <HL,HR> <HL,HL> <HR,HL> <HL,HL> <HR,HR> <HL,HR> <HL,HL> <HL,HR> <HL,HR> <HL,HR> <HR,HL> <HL,HR> <HR.HR> … L2 aNC = Q-POMDP(L2) = <Listen, Listen> L* = circled nodes Agent 1 communicates aC = Q-POMDP(L*) = <OpenR,OpenR> V(aC) - V(aNC) > ε
ACE-PJB-Comm Example {HL,HL} {} <HL,HL> <HL,HR> <HR,HL> <HR,HR> L1 a = <Listen, Listen> <HL,HL> <HL,HL> <HL,HL> <HL,HR> <HL,HL> <HR,HL> <HL,HL> <HR,HR> <HL,HR> <HL,HL> <HL,HR> <HL,HR> <HL,HR> <HR,HL> <HL,HR> <HR.HR> … L2 Agent 1 communicates <HL,HL> Agents open right door! Q-POMDP(L2) = <OpenR, OpenR>
ACE-PJB-Comm Results • 20,000 trials in 2-Agent Tiger Domain • 6 timesteps per trial • Agents communicate 49.7% fewer observations using ACE-PJB-Comm, 93.3% fewer messages • Difference in expected reward because ACE-PJB-Comm is slightly pessimistic about outcome of communication
Additional Challenges • Number of possible joint beliefs grows exponentially • Use particle filter to model distribution of possible joint beliefs • ACE-PJB-Comm answers the question of when agents should communicate • Doesn’t deal with what to communicate • Agents communicate all observations that they haven’t previously communicated
Selective ACE-PJB-Comm[Roth et al., 2006] • Answers what agents should communicate • Chooses most valuable subset of observations • Hill-climbing heuristic to choose observations that “push” teams towards aC • aC - joint action that would be chosen if agent communicated all observations • See details in thesis document
Selective ACE-PJB-Comm Results • 2-Agent Tiger domain: • Communicates 28.7% fewer observations • Same expected reward • Slightly more messages
Outline • Dec-POMDP, Dec-MDP models • Impact of communication on complexity • Avoiding Coordination Errors by reasoning over Possible Joint Beliefs (ACE-PJB) • ACE-PJB-Comm: When should agents communicate? • Selective ACE-PJB-Comm: What should agents communicate? • Avoiding Coordination Errors by executing Individual Factored Policies (ACE-IFP) • Future directions
Dec-MDP • State is collectively observable • One agent can’t identify full state on its own • Union of team observations uniquely identifies state • Underlying problem is an MDP, not a POMDP • Dec-MDP has same complexity as Dec-POMDP • NEXP-Complete
Acting Independently • ACE-PJB requires agents to know joint action at every timestep • Claim: In many multi-agent domains, agents can act independently for long periods of time, only needing to coordinate infrequently
Meeting-Under-Uncertainty Domain • Agents must move to goal location and signal simultaneously • Reward: +20 - Both agents signalat goal -50 - Both agents signal at another location -100 - Only one agent signals -1 - Agents move north, south, east, west, or stop
Factored Representations • Represent relationships among state variables instead of relationships among states S = <X0, Y0, X1, Y1> Each agent observes its own position
Factored Representations • Dynamic Decision Network models state variables over time • at = <East, *>:
Tree-structured Policies • Decision tree that branches over state variables • A tree-structured joint policy has joint actions at the leaves
Approach[Roth et al., 2007] • Generate tree-structured joint policies for underlying centralized MDP • Use this joint policy to generate a tree-structured individual policy for each agent* • Execute individual policies * See details in thesis document
Context-specific Independence Claim: In many multi-agent domains, one agent’s individual policy will have large sections where it is independent of variables that its teammates observe.
Individual Policies • One agent’s individual policy may depend on state features it doesn’t observe
Avoid Coordination Errors by Executing an Individual Factored Policy (ACE-IFP) • Robot traverses policy tree according to its observations • If it reaches a leaf, its action is independent of its teammates’ observations • If it reaches a state variable that it does not observe directly, it must ask a teammate for the current value of that variable • The amount of communication needed to execute a particular policy corresponds to the amount of context-specific independence in that domain
Avoid Coordination Errors by Executing an Individual Factored Policy (ACE-IFP) • Benefits: • Agents can act independently without reasoning about the possible observations or actions of their teammates • Policy directs agents about when, what, and with whom to communicate • Drawback: • In domains with little independence, agents may need to communicate a lot
Experimental Results • In 3x3 domain, executing factored policy required less than half as many messages as full communication, with same reward • Communication usage decreases relative to full communication as domain size increases
Factored Dec-POMDPs • [Hansen and Feng, 2000] looked at factored POMDPs • ADD-representations of transition, observation, and reward functions • Policy is a finite-state controller • Nodes are actions • Transitions depend on conjunctions of state variable assignments • To extend to Dec-POMDP, make individual policy a finite-state controller among individual actions • Somehow combine nodes with the same action • Communicate to enable transitions between action nodes
Future Directions • Considering communication cost in ACE-IFP • All children of a particular variable may have similar values • Worst-case cost of mis-coordination? • Modeling teammate variables requires reasoning about possible teammate actions • Extending factoring to Dec-POMDPs
Future Directions • Knowledge persistence • Modeling teammates’ variables • Can we identify “necessary conditions”? • e.g. “Tell me when you reach the goal.” Are you here yet? Are you here yet?
Contributions • Decentralized execution of centralized policies • Guarantee that agents will Avoid Coordination Errors • Make effective use of limited communication resources • When should agents communicate? • What should agents communicate? • Demonstrate significant communication savings in experimental domains