1 / 19

Class 4

Class 4. Partially Observable MDPs. Partially Observable Markov Decision Processes (POMDPs). Set S = s 1 ,…, s n of possible states Set A = a 1 ,…, a m of possible actions Set O = o 1 ,…, o l of possible observations Reward model R : same as before

tracit
Download Presentation

Class 4

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Class 4 • Partially Observable MDPs

  2. Partially Observable Markov Decision Processes (POMDPs) • Set S = s1,…,sn of possible states • Set A = a1,…,am of possible actions • Set O = o1,…,ol of possible observations • Reward model R: same as before • Transition model Ta: same as before • Observation model P(o | s) • function from S to probability distributions over O • Initial model P0(s) • probability distribution over initial state of world

  3. Interaction Model • Environment starts in state i with probability P0(i) • Agent receives observation o with probability P(o | i) • Agent chooses action a • Agent receives reward R(i,a) • State transitions to j according to Tai • Agent receives observation o’ with probability P(o’ | j) • Agent chooses action b • Agent receives reward R(j,b) • State transitions to k according to Tbj etc.

  4. Example 1: Miniature POMDP • States S1, S2 • Actions A1, A2 • Observations O1, O2 • Reward function:

  5. Miniature POMDP: Transition Models

  6. Miniature POMDP: Observation and Initial Models P0(S1) = 0.3, P0(S2) = 0.7

  7. Example 2: Robot Navigation With Doors • Robot needs to get to goal and avoid stairwell • There are doors between some locations • Doors open and close stochastically • Robot can only see doors at its location • Robot does not know its location and heading

  8. Specification of POMDP • State must specify • Location of robot • Heading of robot • For each door, whether it is open or closed • Transition and reward models same as before • What are the observations?

  9. Specification of POMDP • State must specify • Location of robot • Heading of robot • For each door, whether it is open or closed • Transition and reward models same as before • What are the observations? • <door in front, door left, door right, door behind>

  10. Specification of POMDP • State must specify • Location of robot • Heading of robot • For each door, whether it is open or closed • Transition and reward models same as before • What are the observations? • <door in front, door left, door right, door behind> • Observation model: projects all doors onto current location

  11. Example 3: Plant-Eating Robot • Given a plant, robot must decide what to do with it • Robot can test the plant to get an observation • test can be performed multiple times • Actions: eat, test, destroy • Observations: none, N, P • Rewards: +10 for eating nutritious plant, -20 for eating poisonous, -1 for testing

  12. Visualization of Model

  13. POMDPs are Hard! Why? • Optimal action depends on entire history of percepts and actions • size of policy exponential in length of history • Optimal action considers future beliefs, not just state • need information-gathering actions

  14. Hardness • Finite horizon: exponential in horizon • Infinite horizon: undecidable • Dilemma: • POMDPs describe many real world domains • but getting optimal policies is too hard • in fact optimal policies are too large to describe • What can we do? • try to find a good, not necessarily optimal policy

  15. Types of Policy: 1. Finite Memory Policy • Specifies what to do for each value of the last k percepts • E.g.: • () = Test • (N) = Test • (P) = Test • (NP) = Test • (PN) = Test • (NN) = Eat • (PP) = Destroy

  16. Types of Policy: 2. Finite State Policy • Agent is a finite state machine • Nodes are internal states of agent • not to be confused with states of the world • Edges are labeled with observations • Nodes are labeled with action to take • Is this the same policy as the previous finite memory policy?

  17. Types of Policy: 3. Belief State Policy • A belief state is a probability distribution over the current state of the system, given a sequence of observations • Belief state policy: divide belief state into regions, and specify an action for each region, e.g. • if P(N | History) > 0.9, Eat • if 0.2 < P(N | History)  0.9, Test • if P(N | History)  0.2, Destroy

  18. Using a Belief State Policy • We need to be able to compute the belief state efficiently, given a history • Let Pt denote the belief state at time t, i.e. Pt(i) = P(St = i | o1,…,ot)

  19. Using a Belief State Policy • We need to be able to compute the belief state efficiently, given a history • Let Pt denote the belief state at time t, i.e. Pt(i) = P(St = i | o1,…,ot) • Updating rule: where a is the action taken at time t

More Related