190 likes | 210 Views
Class 4. Partially Observable MDPs. Partially Observable Markov Decision Processes (POMDPs). Set S = s 1 ,…, s n of possible states Set A = a 1 ,…, a m of possible actions Set O = o 1 ,…, o l of possible observations Reward model R : same as before
E N D
Class 4 • Partially Observable MDPs
Partially Observable Markov Decision Processes (POMDPs) • Set S = s1,…,sn of possible states • Set A = a1,…,am of possible actions • Set O = o1,…,ol of possible observations • Reward model R: same as before • Transition model Ta: same as before • Observation model P(o | s) • function from S to probability distributions over O • Initial model P0(s) • probability distribution over initial state of world
Interaction Model • Environment starts in state i with probability P0(i) • Agent receives observation o with probability P(o | i) • Agent chooses action a • Agent receives reward R(i,a) • State transitions to j according to Tai • Agent receives observation o’ with probability P(o’ | j) • Agent chooses action b • Agent receives reward R(j,b) • State transitions to k according to Tbj etc.
Example 1: Miniature POMDP • States S1, S2 • Actions A1, A2 • Observations O1, O2 • Reward function:
Miniature POMDP: Observation and Initial Models P0(S1) = 0.3, P0(S2) = 0.7
Example 2: Robot Navigation With Doors • Robot needs to get to goal and avoid stairwell • There are doors between some locations • Doors open and close stochastically • Robot can only see doors at its location • Robot does not know its location and heading
Specification of POMDP • State must specify • Location of robot • Heading of robot • For each door, whether it is open or closed • Transition and reward models same as before • What are the observations?
Specification of POMDP • State must specify • Location of robot • Heading of robot • For each door, whether it is open or closed • Transition and reward models same as before • What are the observations? • <door in front, door left, door right, door behind>
Specification of POMDP • State must specify • Location of robot • Heading of robot • For each door, whether it is open or closed • Transition and reward models same as before • What are the observations? • <door in front, door left, door right, door behind> • Observation model: projects all doors onto current location
Example 3: Plant-Eating Robot • Given a plant, robot must decide what to do with it • Robot can test the plant to get an observation • test can be performed multiple times • Actions: eat, test, destroy • Observations: none, N, P • Rewards: +10 for eating nutritious plant, -20 for eating poisonous, -1 for testing
POMDPs are Hard! Why? • Optimal action depends on entire history of percepts and actions • size of policy exponential in length of history • Optimal action considers future beliefs, not just state • need information-gathering actions
Hardness • Finite horizon: exponential in horizon • Infinite horizon: undecidable • Dilemma: • POMDPs describe many real world domains • but getting optimal policies is too hard • in fact optimal policies are too large to describe • What can we do? • try to find a good, not necessarily optimal policy
Types of Policy: 1. Finite Memory Policy • Specifies what to do for each value of the last k percepts • E.g.: • () = Test • (N) = Test • (P) = Test • (NP) = Test • (PN) = Test • (NN) = Eat • (PP) = Destroy
Types of Policy: 2. Finite State Policy • Agent is a finite state machine • Nodes are internal states of agent • not to be confused with states of the world • Edges are labeled with observations • Nodes are labeled with action to take • Is this the same policy as the previous finite memory policy?
Types of Policy: 3. Belief State Policy • A belief state is a probability distribution over the current state of the system, given a sequence of observations • Belief state policy: divide belief state into regions, and specify an action for each region, e.g. • if P(N | History) > 0.9, Eat • if 0.2 < P(N | History) 0.9, Test • if P(N | History) 0.2, Destroy
Using a Belief State Policy • We need to be able to compute the belief state efficiently, given a history • Let Pt denote the belief state at time t, i.e. Pt(i) = P(St = i | o1,…,ot)
Using a Belief State Policy • We need to be able to compute the belief state efficiently, given a history • Let Pt denote the belief state at time t, i.e. Pt(i) = P(St = i | o1,…,ot) • Updating rule: where a is the action taken at time t