1 / 19

Partially Observable MDP (POMDP)

Partially Observable MDP (POMDP). Ruti Glick Bar-Ilan university. motivation. MDP Non – deterministic environment Full observation Partially Observable MDP Like MDP but … Missing knowledge of the state where the agent in: May not be complete May not be correct. +1. -1. Example.

tyus
Download Presentation

Partially Observable MDP (POMDP)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Partially Observable MDP(POMDP) Ruti Glick Bar-Ilan university

  2. motivation • MDP • Non – deterministic environment • Full observation • Partially Observable MDP • Like MDP but… • Missing knowledge of the state where the agent in: • May not be complete • May not be correct

  3. +1 -1 Example • In our 4x3 environment: • Agent don’t know where it is • Has no sensors • Being put in unknown state • What should it do? • If has known it is in (3,3) it would do Right action • But it doesn’t know

  4. Many other applications • Helicopter control [Bagnell & Schneider 2001] • Dialogue management [Roy, Pineau & Thrun 2000] • Preference elicitation [Boutilier 2002] • Optimal search and sensor scheduling [Krishnamurthy & Singh 2000] • Medical diagnosis and treatment [Hauskrecht & Fraser 2000] • Packet scheduling in computer networks [Chang et al. 2000; Bent & Van Hentenryck 2004]

  5. +1 -1 solutions • First solution: • Step 1: reduces uncertainty • Step 2: try heading to the +1 exit • Reduce uncertainty • move 5 times Left so quite likely to be at the left wall • Then move 5 times Up so quite likely to be at left top wall • Then move 5 time Right to goal state • Continue moving right increase chance to get to +1 • Quality of solution • Chance of81.8% to get to +1 • Expected utility of about 0.08 • Second solution: POMDP

  6. MDP • Components: • States - S • Actions - A • Transitions - T(s, a, s’) • Rewards – R(s) • Problem: • choose the action that makes the right tradeoffs between the immediate rewards and the future gains, to yield the best possible solution • Solution: • Policy

  7. PO-MDP • Components: • States - S • Actions - A • Transitions– T(s,a,s’) • Rewards – R(s) • Observations– O(s,o)

  8. Mapping in MDP & POMDP • In MDP, mapping is from states to actions. • In POMDP, mapping is from probability distributions (over states) to actions.

  9. Observation Model • O(s,o) – probability of getting observation o in state s • Example: • for robot without sensors: • O(s,o) = 1 for every s in S • Only one observation o exist

  10. Belief state • Set of actual states the agent may be in • b = probability distribution over all possible states • Example: • In our 4x3 example for robot without sensors the initial state: b=<1/9, 1/9, 1/9, 1/9, 1/9, 1/9, 1/9, 1/9, 1/9, 0, 0>

  11. Belief state (Cont.) • b(s) = probability assigned to actual state s by belief b • 0 ≤ b(s) ≤ 1 for every s in S • ∑s in S b (s) = 1 • b’(s’) = αO(s’, o)∑sT(s,a,s’)b(s) • α is normalizing constant that makes the belief state sum to 1 • We will define b’ = FORWARD(b,a,o)

  12. Finding best action • The optimal action depend only on the current belief state • Choosing the action: • Given belief state b, execute π*(b) • Received observation o • Calculate new state belief b’ • Set b b’

  13. π*(b) Vs. π*(s) • POMDP belief state is continues • In 4x3 example b is 11-dimensional continuous space • b(s) is point on a line between 0 to 1 • b is point in n dimensional space

  14. Transition model & rewards • Need different view of transition model and reward • Transition model as τ(b, a, b’) • Instead T(s, a, s’) • Actual state s isn’t known • Reward as ρ(b) • Instead R(s) • Have to take under account the mistake in belief state

  15. Transition model • Probability of getting observation o • P(o|a,b) = ∑s’P(o|a,s’,b) P(s’|a,b) = ∑s’O(s’,o) P(s’|a,b) = ∑s’ [O(s’,o) ∑sT(s,a,s’)b(s)] • Transition model • τ(b, a, b’) = P(b’|a,b) = ∑oP(b’|b,a,o) P(o|a,b) = ∑oP(b’|b,a,o) ∑s’O(s’,o) ∑sT(s,a,s’)b(s) • P(b’|o,a,b) = 1 if b’ = FORWARD(b,a,o) • P(b’|o,a,b) = 0 otherwise

  16. Reward function • ρ(b) = ∑sb(s) R(s) • Expected rewards of all the states the agent night be in

  17. From POMDP to MDP • τ(b, a, b’) and ρ(b) define MDP • π*(b) for this MDP is optimal policy for original POMDP • Belief state is observable to agent • Need new versions of Value / Policy iteration • Continues state belief • Versions of MDP don’t fit • Beyond this course

  18. +1 -1 Back to our example • Solution for sensor less 4x3 environment:[Left, Up, Up, Right, Up, Up, Right, Up, Up, Right, Up, Right, Up, Right, Up …] • Policy is sequence since the problem deterministic in beliefs space • The agent gets to +1 86.6% of times • Expected utility is 0.38

  19. Computational complexity • Finite-horizon • PSPACE-hard [Papadimitriou & Tsitsiklis 1987] • NP-complete if unobservable • Infinite-horizon • Undecidable [Madani, Hanks & Condon 1999] • NP-hard for -approximation [Lusena, Goldsmith & Mundhenk 2001] • NP-hard for memoryless or bounded-memory control problem [Littman 1994; Meuleau et al. 1999]

More Related