200 likes | 408 Views
Achieving Goals in Decentralized POMDPs. Christopher Amato Shlomo Zilberstein UMass Amherst May 14, 2009. Overview. The importance of goals DEC-POMDP model Previous work on goals Indefinite-horizon DEC-POMDPs Goal-directed DEC-POMDPs Results and future work.
E N D
Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass Amherst May 14, 2009
Overview • The importance of goals • DEC-POMDP model • Previous work on goals • Indefinite-horizon DEC-POMDPs • Goal-directed DEC-POMDPs • Results and future work
Achieving goals in multiagent setting • General setting • Problem proceeds over a sequence of steps until a goal is achieved • Multiagent setting • Can terminate when any number of agents achieve local goals or when all agents achieve a global goal • Many problems have this structure • Meeting or catching a target • Cooperatively completing a task • How do we make use of this structure?
DEC-POMDPs • Decentralized partially observable Markov decision process (DEC-POMDP) • Multiagent sequential decision making under uncertainty • At each stage, each agent receives: • A local observation rather than the actual state • A joint immediate reward r a1 o1 Environment a2 o2
A two agent DEC-POMDP can be defined with the tuple: M = S, A1, A2, P, R, 1, 2, O S, a finite set of states with designated initial state distribution b0 A1 and A2, each agent’s finite set of actions P, the state transition model: P(s’ | s, a1, a2) R, the reward model: R(s, a1, a2) 1 and 2, each agent’s finite set of observations O, the observation model: O(o1, o2 | s', a1, a2) This model can be extended to any number of agents DEC-POMDP definition
DEC-POMDP solutions • A policy for each agent is a mapping from their observation sequences to actions, W* A , allowing distributed execution • Note that planning can be centralized but execution is distributed • A joint policy is a policy for each agent • Finite-horizon case: goal is to maximize expected reward over infinite steps • Infinite-horizon case: discount the reward to keep sum finite using factor,
Achieving goals • If problem terminates after goal is achieved, how do we model it? • Unclear how many steps are needed until termination • Want to avoid a discount factor: value is often arbitrary and can change the solution • [ what else to say here? ]
Previous work • Some in POMDPs, but for DEC-POMDPs only Goldman and Zilberstein 04 • Modeled problems with goals as finite horizon and studied the complexity • Same complexity unless agents have independent transitions and observations and one goal is always better • This assumes negative rewards for non-goal states and no-op available at goal
Indefinite-horizon DEC-POMDPs • Extend POMDP assumptions Patek 01 and Hansen 07 • Our assumptions • Each agents possesses a set of terminal actions • Negative rewards for non-terminal actions • Problem stops when a terminal action is taken by each agent simultaneously • Can capture uncertainty about reaching goal • Many problems can be modeled this way • Example: Capturing a Target • All (or a subset) of agents must simultaneously attack • Or agents are targets are must meet at same location • Agents are unsure when goal is reached, but must choose when to terminate problem
Optimal solution • Lemma 3.1. An optimal set of indefinite-horizon policy trees must have horizon less than where is the value of the best combination of terminal actions, the value of best combination of non-terminal actions and is the maximum value attained by choosing a set of terminal actions on the first step given the initial state distribution. • Theorem 3.2. Our dynamic programming algorithm for indefinite-horizon POMDPs returns an optimal set of policy trees for the given initial state distribution.
Goal-directed DEC-POMDPs • Relax assumptions, but still have goal • Problem terminates when: • The set of agents reach a global goal state • A single agent or set of agents reach local goal states • Any chosen combination of actions and observations is taken or seen by the set of agents • Can no longer guarantee termination, so becomes subclass of infinite-horizon • More problems fall into this class (can terminate without agent knowledge) • Example: Completing a set of experiments • Robots must travel to different sites and perform different experiments at each • Some require cooperation (simultaneous action) while some can be completed independently • Problem ends when all necessary experiments are completed
Sample-based approach • Use sampling to generate agent trajectories • From the known initial state until goal conditions are met • Produces only action and observation sequences that lead to goal • This reduces the number of policies to consider • We prove a bound on the number of samples required to approach optimality (extended from Kearns, Mansour and Ng 99) Showed Probability that the value attained is at least from optimal is at most with samples
Getting more from fewer samples • Optimize a finite-state controller • Use trajectories to create a controller • Ensures a valid DEC-POMDP policy • Allows solution to be more compact • Choose actions and adjust resulting transitions (permitting possibilities that were not sampled) • Optimize in the context of the other agents • Trajectories create an initial controller which is then optimized to produce a high-valued policy
Generating controllers from trajectories Trajectories: a1-o1g a1-o3a1-o1g a1-o3 a1-o3a1-o1g a4-o4 a1-o2a3-o1g a4-o3 a1-o1g Initial controller: o1 g g o1 o1 a1 a1 a1 g 1 2 o3 o3 0 o2 o1 o4 a1 a3 3 4 g a4 o1 5 a1 o3 g Reduced controller: Optimized controllers: g g a) o1 o1 o1 o1 g g o1 a1 a1 o1 a1 0 a1 1 1 a1 a1 g g 2 o3 2 o3 o3 o3 0 o3 o1 b) g a4 o2 a1 o1 a1 a3 o4 3 4 g o2-4
Experiments • Compared our goal-directed approach with leading approximate infinite-horizon algorithms BFS: Szer and Charpillet 05 DEC-BPI: Bernstein, Hansen and Zilberstein 05 NLP: Amato, Bernstein and Zilberstein 07 • Each approach was run with larger controllers until resources were exhausted (2GB or 4 hours) • BFS provides an optimal deterministic controller for a given size • Other algs were run 10 times and mean times and values are reported
Experimental results # samples=5000000, 25 # samples=1000000, 10 # samples=1000000, 10 #=500000, 5 We built controllers from a small number of the highest valued trajectories Our sample-based approach outperforms other methods on these problems
Conclusions • Make use of goal structure, when present to improve efficiency and solution quality • Indefinite-horizon approach • Created model for DEC-POMDPs • Developed algorithm and proved optimality • Goal-directed problems • Described more general goal model • Developed sample-based algorithm and demonstrated high quality results • Proved a bound on the number of samples needed to approach optimality • Future: can extend this work to general finite and infinite-horizon problems