1 / 28

Dynamic Programming for Partially Observable Stochastic Games

Dynamic Programming for Partially Observable Stochastic Games. Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher Amato, Eric A. Hansen, Shlomo Zilberstein June 23, 2004. Extending the MDP Framework.

yetty
Download Presentation

Dynamic Programming for Partially Observable Stochastic Games

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher Amato, Eric A. Hansen, Shlomo Zilberstein June 23, 2004

  2. Extending the MDP Framework • The MDP framework can be extended to incorporate partial observability and multiple agents • Can we still do dynamic programming? • Lots of work on the single-agent case (POMDP) Sondik 78, Cassandra et al. 97, Hansen 98 • Some work on the multi-agent case, but limited theoretical guarantees Varaiya & Walrand 78,Nair et al. 03

  3. Our contribution • We extend DP to the multi-agent case • For cooperative agents (DEC-POMDP): • First optimal DP algorithm • For noncooperative agents: • First DP algorithm for iterated elimination of dominated strategies • Unifies ideas from game theory and partially observable MDPs

  4. Game Theory • Normal form game • Only one decision to make – no dynamics • A mixed strategy is a distribution over strategies a1 a2 b1 b2

  5. Solving Games • One approach to solving games is iterated elimination of dominated strategies • Roughly speaking, this removes all unreasonable strategies • Unfortunately, can’t always prune down to a single strategy per player

  6. Dominance • A strategy is dominated if for every joint distribution over strategies for the other players, there is another strategy that is at least as good • Dominance test looks like this: • Can be done using linear programming a1 a2 dominated a3 b1 b2

  7. Dynamic Programming for POMDPs • We’ll start with some important concepts: a1 o1 o2 a2 a3 o1 o2 o1 o2 a3 a2 a1 a1 s1 s2 belief state policy tree linear value function

  8. Dynamic Programming a1 a2 s1 s2

  9. a1 a2 a1 a2 a2 a1 a2 a1 o1 o1 o2 o2 o1 o1 o1 o1 o1 o1 o2 o2 o2 o2 o2 o2 a1 a1 a1 a1 a1 a1 a2 a2 a2 a2 a1 a1 a2 a2 a2 a2 Dynamic Programming s1 s2

  10. a1 a2 a1 a2 o1 o1 o2 o2 o1 o1 o2 o2 a1 a1 a1 a1 a2 a1 a2 a2 Dynamic Programming s1 s2

  11. Dynamic Programming s1 s2

  12. Properties of Dynamic Programming • After T steps, the best policy tree for s0 is contained in the set • The pruning test is exactly the same as in elimination of dominated strategies in normal form games

  13. Partially Observable Stochastic Game • Multiple agents control a Markov process • Each can have a different observation and reward function a1 1 o1, r1 world a2 2 o2, r2

  14. POSG – Formal Definition • A POSG isS, A1, A2, P, R, W1, W2, O, where • S is a finite state set, with initial state s0 • A1, A2 are finite action sets • P(s, a1, a2, s’) is a state transition function • R1(s, a1, a2) and R2(s, a1, a2) are reward functions • 1, 2are finite observation sets • O(s, o1,o2) is an observation function • Straightforward generalization to n agents

  15. POSG – More Definitions • A local policy is a mappingi: Wi* Ai • A joint policy is a pair1, 2 • Each agent wants to maximize its own expected reward over T steps • Although execution is distributed, planning is centralized

  16. Strategy Elimination in POSGs • Could simply convert to normal form • But the number of strategies is doubly exponential in the horizon length … …

  17. prune eliminate A Better Way to Do Elimination • We use dynamic programming to eliminate dominated strategies without first converting to normal form • Pruning a subtree eliminates the set of trees containing it a1 a2 a3 o1 o2 o1 o2 o1 o2 a1 a2 a2 a3 a2 a1 o1 o2 o1 o1 o2 o2 o1 o2 a2 a2 a3 a2 a3 a3 a2 a1

  18. Generalizing Dynamic Programming • Build policy trees as in single agent case • Pruning rule is a natural generalization What to prune Space for pruning

  19. a1 a1 a2 a2 Dynamic Programming

  20. a2 a1 a1 a2 a1 a2 a1 a2 a1 a2 a1 a2 a2 a2 a1 a1 o1 o1 o1 o1 o2 o2 o2 o2 o1 o1 o1 o1 o1 o1 o1 o1 o1 o1 o1 o1 o2 o2 o2 o2 o2 o2 o2 o2 o2 o2 o2 o2 a1 a1 a2 a2 a1 a2 a1 a2 a1 a2 a2 a1 a2 a2 a2 a2 a2 a2 a2 a2 a2 a1 a1 a2 a1 a1 a1 a1 a1 a1 a1 a1 Dynamic Programming

  21. a1 a1 a2 a2 a1 a1 a2 a1 a1 a2 a2 a2 a1 a1 o1 o1 o1 o1 o2 o2 o2 o2 o1 o1 o1 o1 o1 o1 o1 o1 o1 o1 o2 o2 o2 o2 o2 o2 o2 o2 o2 o2 a1 a1 a2 a2 a2 a2 a1 a1 a2 a1 a2 a2 a2 a2 a2 a2 a2 a2 a2 a1 a2 a1 a1 a1 a1 a1 a1 a1 Dynamic Programming

  22. a1 a1 a2 a2 a2 a1 a2 a2 a1 a1 a1 o1 o1 o1 o1 o2 o2 o2 o2 o1 o1 o1 o1 o1 o1 o1 o2 o2 o2 o2 o2 o2 o2 a1 a2 a2 a1 a2 a1 a2 a1 a2 a1 a2 a2 a2 a2 a2 a2 a1 a2 a1 a2 a1 a1 Dynamic Programming

  23. a1 a2 a2 a2 a1 a2 a1 a2 a1 o1 o1 o1 o2 o2 o2 o1 o1 o1 o1 o1 o1 o2 o2 o2 o2 o2 o2 a2 a1 a2 a1 a2 a1 a1 a2 a2 a2 a2 a2 a1 a2 a1 a2 a1 a1 Dynamic Programming

  24. a1 a2 a2 a2 a1 a2 a1 a1 o1 o1 o1 o2 o2 o2 o1 o1 o1 o1 o1 o2 o2 o2 o2 o2 a1 a2 a2 a1 a2 a1 a2 a2 a2 a2 a1 a2 a1 a2 a1 a1 Dynamic Programming

  25. Dynamic Programming

  26. Correctness of Dynamic Programming Theorem: DP performs iterated elimination of dominated strategies in the normal form of the POSG. Corollary: DP can be used to find an optimal joint policy in a cooperative POSG.

  27. Dynamic Programming in Practice • Initial empirical results show that much pruning is possible • Can solve problems with small state sets • And we can import ideas from the POMDP literature to scale up to larger problems Boutilier & Poole 96, Hauskrecht 00, Feng & Hansen 00, Hansen & Zhou 03, Theocharous & Kaelbling 03

  28. Conclusion • First exact DP algorithm for POSGs • Natural combination of two ideas • Iterated elimination of dominated strategies • Dynamic programming for POMDPs • Initial experiments on small problems, ideas for scaling to larger problems

More Related