1 / 28

POMDP

POMDP. MDP (Markov Decision Process). A MDP model contains: A set of states S A set of actions A A set of state transition description T Deterministic or Stochastic A reward function R (s, a). POMDP. A POMDP model contains: A set of states S A set of actions A

robbinsp
Download Presentation

POMDP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. POMDP

  2. MDP (Markov Decision Process) • A MDP model contains: • A set of states S • A set of actions A • A set of state transition description T • Deterministic or Stochastic • A reward function R (s, a)

  3. POMDP • A POMDP model contains: • A set of states S • A set of actions A • A set of state transition description T • A reward function R (s, a) • A finite set of observations Ω • An observation function O:S╳A→Π(Ω) • O(s’, a, o)

  4. POMDP Problem • 1. Belief state • First approach: choose the most probable state of the world, given past experience • Informational properties described via observations • Not explicit • Second approach: probability distributions over states of the world.

  5. POMDP Problem • 2. Finding an optimal policy: • The optimal policy is the solution of a continuous-space “Belief MDP”: • The set of belief states B • The set of actions A • The state-transition function τ(b, a, b’) • The reward function on belief states ρ(b, a)

  6. Value function • Policy Tree • Infinite Horizon • Witness Algorithm

  7. Policy Tree • A tree of depth t that specifies a complete t-step policy. • Nodes: actions, the top node determines the first action to be taken. • Edges: the resulting observation

  8. Sample Policy Tree

  9. Policy Tree • Value Evaluation: • Vp(s) is the value function of step-t that starting from state s and executing policy tree p.

  10. Policy Tree • Value Evaluation: • Expected value under policy tree p: • Where • Expected value that execute different policy trees from different initial belief states

  11. Policy Tree • Value Evaluation: • Vt with only two states:

  12. Policy Tree • Value Evaluation: • Vt with three states:

  13. Infinite Horizon • The three algorithm to compute V: • Naive approach • Improved by choosing useful policy tree • Witness algo.

  14. Infinite Horizon • Naive approach: • εis a small number • This policy tree contains: • nodes • Each nodes can be labeled with |A| possible actions • Total number of policy threes:

  15. Infinite Horizon • Improved by choosing useful policy tree: • Vt-1 is the set of useful (t – 1)-step policy trees, can be used to construct a superset of the useful t-step policy tree. • And there are |A||Vt-1||Ω| elements in Vt+

  16. Infinite Horizon • Improved by choosing useful policy tree:

  17. Infinite Horizon • Witness algorithm:

  18. Infinite Horizon • Witness algorithm: • is a set of t-step policy trees that have action a at their root • is the value function • And

  19. Infinite Horizon • Witness algorithm: • Finding witness: • At each iteration we ask, Is there some belief state,b, for which the true value, , computed by one-step lookahead using Vt-1, is different from the estimated value, , computed using the set U? • Provided

  20. Infinite Horizon • Witness algorithm: • Finding witness: • Now we can state the witness theorem [25]: The true Q-function, , differs from the approximate Q-function, , if and only if there is some , , and for which there is some b such that

  21. Infinite Horizon • Witness algorithm: • Finding witness:

  22. Infinite Horizon • Witness algorithm: • Finding witness: • The linear program used to find witness points:

  23. Infinite Horizon • Witness algorithm: • Complete value-iteration: • An agenda containing any single policy tree • A set U containing the set of desired policy tree • Using pnew to determine whether it is an improvement over the policy trees in U • 1. If no witness points are discovered, then that policy tree is removed from the agenda. When the agenda is empty, the algorithm terminates. • 2. If a witness point is discovered, the best policy tree for that point is calculated and added to U and all policy trees that dier from the current policy tree in a single subtree are added to the agenda.

  24. Infinite Horizon • Witness algorithm: • Complexity: • Since we know that no more than witness points are discovered (each adds a tree to the set of useful policy trees) • only trees can ever be added to the agenda (in addition to the one tree in the initial agenda). • Each linear program solved has variables and no more than constraints. • Each of these linear programs either removes a policy from the agenda (this happens at most times) or a witness point is discovered (this happens at most times).

  25. Tiger Problem • Two doors: • Behind one door is a tiger • Behind another door is a large reward • Two states: • the state of the world when the tiger is on the left as sl and when it is on the right as sr • Three actions: • left, right, and listen. • Rewards: • reward for opening the correct door is +10 and the penalty for choosing the door with the tiger behind it is -100, the cost of listen is -1 • Observations: • to hear the tiger on the left (Tl) or to hear the tiger on the right (Tr) • in state sl, the listen action results in observation Tl with probability 0.85 and the observation Tr with probability 0.15; conversely for world state sr.

  26. Tiger Problem

  27. Tiger Problem

  28. Tiger Problem • Decreasing listening reliability from 0.85 down to 0.65:

More Related