Enhancing Understanding of Partially Observable MDPs

Class 4 • Partially Observable MDPs

Partially Observable Markov Decision Processes (POMDPs) • Set S = s1,…,sn of possible states • Set A = a1,…,am of possible actions • Set O = o1,…,ol of possible observations • Reward model R: same as before • Transition model Ta: same as before • Observation model P(o | s) • function from S to probability distributions over O • Initial model P0(s) • probability distribution over initial state of world

Interaction Model • Environment starts in state i with probability P0(i) • Agent receives observation o with probability P(o | i) • Agent chooses action a • Agent receives reward R(i,a) • State transitions to j according to Tai • Agent receives observation o’ with probability P(o’ | j) • Agent chooses action b • Agent receives reward R(j,b) • State transitions to k according to Tbj etc.

Example 1: Miniature POMDP • States S1, S2 • Actions A1, A2 • Observations O1, O2 • Reward function:

Miniature POMDP: Transition Models

Miniature POMDP: Observation and Initial Models P0(S1) = 0.3, P0(S2) = 0.7

Example 2: Robot Navigation With Doors • Robot needs to get to goal and avoid stairwell • There are doors between some locations • Doors open and close stochastically • Robot can only see doors at its location • Robot does not know its location and heading

Specification of POMDP • State must specify • Location of robot • Heading of robot • For each door, whether it is open or closed • Transition and reward models same as before • What are the observations?

Specification of POMDP • State must specify • Location of robot • Heading of robot • For each door, whether it is open or closed • Transition and reward models same as before • What are the observations? • <door in front, door left, door right, door behind>

Specification of POMDP • State must specify • Location of robot • Heading of robot • For each door, whether it is open or closed • Transition and reward models same as before • What are the observations? • <door in front, door left, door right, door behind> • Observation model: projects all doors onto current location

Example 3: Plant-Eating Robot • Given a plant, robot must decide what to do with it • Robot can test the plant to get an observation • test can be performed multiple times • Actions: eat, test, destroy • Observations: none, N, P • Rewards: +10 for eating nutritious plant, -20 for eating poisonous, -1 for testing

Visualization of Model

POMDPs are Hard! Why? • Optimal action depends on entire history of percepts and actions • size of policy exponential in length of history • Optimal action considers future beliefs, not just state • need information-gathering actions

Hardness • Finite horizon: exponential in horizon • Infinite horizon: undecidable • Dilemma: • POMDPs describe many real world domains • but getting optimal policies is too hard • in fact optimal policies are too large to describe • What can we do? • try to find a good, not necessarily optimal policy

Types of Policy: 1. Finite Memory Policy • Specifies what to do for each value of the last k percepts • E.g.: • () = Test • (N) = Test • (P) = Test • (NP) = Test • (PN) = Test • (NN) = Eat • (PP) = Destroy

Types of Policy: 2. Finite State Policy • Agent is a finite state machine • Nodes are internal states of agent • not to be confused with states of the world • Edges are labeled with observations • Nodes are labeled with action to take • Is this the same policy as the previous finite memory policy?

Types of Policy: 3. Belief State Policy • A belief state is a probability distribution over the current state of the system, given a sequence of observations • Belief state policy: divide belief state into regions, and specify an action for each region, e.g. • if P(N | History) > 0.9, Eat • if 0.2 < P(N | History)  0.9, Test • if P(N | History)  0.2, Destroy

Using a Belief State Policy • We need to be able to compute the belief state efficiently, given a history • Let Pt denote the belief state at time t, i.e. Pt(i) = P(St = i | o1,…,ot)

Using a Belief State Policy • We need to be able to compute the belief state efficiently, given a history • Let Pt denote the belief state at time t, i.e. Pt(i) = P(St = i | o1,…,ot) • Updating rule: where a is the action taken at time t

Enhancing Understanding of Partially Observable MDPs

Enhancing Understanding of Partially Observable MDPs

Presentation Transcript

Class 4

Class 4

Measurement Class 4

Class 4

Class 4

Class 4

Class 4

Class 4

Class #4

Evidence – Class 4

Class 4

PHY131H1F Class 4

Class 4

Class 4

Class 4 Honors

Class 4

LOTA CLASS 4

ICM class 4

Class 4

Class 4 - Recursion

Class 4

CBSE Class 4 Classes | CBSE Class 4 Syllabus