Partially Observable Markov Decision Process (Chapter 15 & 16)

Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta

Contents • POMDP • Example POMDP • Finite World POMDP algorithm • Practical Considerations • Approximate POMDP Techniques

Partially Observable Markov Decision Processes(POMDP) • POMDP: • Uncertainty in Measurements  State • Uncertainty in Control Effects • Adapt previous Value Iteration Algorithm (VI-VIA)

Partially Observable Markov Decision Processes(POMDP) • POMDP: • World can't be sensed directly • Measurements: incomplete, noisy, etc. • Partial Observability • Robot has to estimate a posterior distribution over a possible world state.

Partially Observable Markov Decision Processes(POMDP) • POMDP: • Algorithm to find optimal control policy exit for FINITE WORLD: • State space • Action space • Space of observation • Planning horizon • Computation is complex • For continuous case there are approximations All Finite

Partially Observable Markov Decision Processes(POMDP) • The algorithm we are going to study all based in Value Iteration (VI). with • The same as previous but is not observable • Robot has to make decision in the BELIEF STATE • Robot’s internal knowledge about the state of the environment • Space of posteriori distribution over state

Partially Observable Markov Decision Processes(POMDP) • So with • Control Policy

Partially Observable Markov Decision Processes(POMDP) • Belief  bel  • Each value in POMDP is function of entire probability distribution • Problems: • State Space finite  Belief Space continuous • State Space continuous  Belief Space infinitely-dimensional continuum • Also complexity in calculate the Value Function Because of the integral over all the distribution

Partially Observable Markov Decision Processes(POMDP) • At the end  optimal solution exist for Interesting Special Case of Finite World: • state space; action space; space of observations; planning horizon  All finite • Solution of VF are Piecewise Linear Function over the belief space • The previous arrive because • Expectation is a linear operation • Ability to select different controls in different parts

Example POMDP 2 States: 3 Control Actions:

Example POMDP When execute payoff: Dilemma  opposite payoff in each state  knowledge of the state translate directly into payoff

Example POMDP To acquire knowledge robot has control (Cost of waiting, cost of sensing, etc.) affects the state of the world in non-deterministic manner:

Example POMDP • Benefit  Before each control decision, the robot can sense. By sensing robot gains knowledge about the state • Make better control decisions • High payoff expectation • In the case of control action , robot sense without terminal action

Example POMDP • The measurement model is governed by the following probability distribution:

Example POMDP This example is easy to graph over the belief space (2 states) • Belief state

Example POMDP • Control Policy • Function that maps the unit interval [0;1] to space of all actions Example

Example POMDP – Control Choice • Control Choice (When to execute what control?) • First consider the immediate payoff . • Payoff now is a function of belief state So for , the expected payoff Payoff in POMDPs

Example POMDP – Control Choice

Example POMDP – Control Choice • First we calculate • the robot simply selects the action of highest expected payoff Piecewise Linear convex Function Maximum of individual payoff function

Example POMDP – Control Choice • First we calculate • the robot simply selects the action of highest expected payoff Transition occurs when in Optimal Policy

Example POMDP - Sensing • Now we have perception • What if the robot can sense before it chooses control? • How it affects the optimal Value Function Sensing info about State enable choose better control action In previous example Expected payoff How better will this be after sensing?

Example POMDP – Control Choice Belief after sensing as a function of the belief before sensing Given by Bayes Rule Finally

Example POMDP – Control Choice How this affects the Value Function?

Example POMDP – Control Choice Mathematically That is just replacing by in the Value Function

Example POMDP – Control Choice However our interest is the complete Expected Value Function after sensing, that consider also the probability of sensing the other measurement. This is given by:

Example POMDP – Control Choice An this results in

Example POMDP – Control Choice Mathematically

Example POMDP - Prediction To plan at a horizon larger than we have to take this into consideration and project our value function accordingly According to our transition probability model If If In between the expectation is linear

Example POMDP – Prediction An this results in

Example POMDP – Prediction And adding and we have:

Example POMDP – Prediction Mathematically cost Fix!!

Example POMDP – Pruning Full backup : Impractical!!! Efficient approximate POMDP needed

Finite World POMDP algorithm To understand this read Mathematical Derivation of POMDPs pp.531-536 in [1]

Example POMDP – Practical Considerations It looks easy let’s try something more “real”… Probabilistic Robot “RoboProb”

Example POMDP – Practical Considerations It looks easy let’s try something more “real”… Probabilistic Robot “RoboProb” 11 States: 0.8 5 Control Actions: 0.1 0.1 Sense without moving Transition Model

Example POMDP – Practical Considerations It looks easy let’s try something more “real”… Probabilistic Robot “RoboProb” “Reward”  Payoff The same set for all control action Example

Example POMDP – Practical Considerations It’s getting kind of hard :S… Probabilistic Robot “RoboProb” Transition Probability Example  0.8 0.1 0.1

Example POMDP – Practical Considerations It’s getting kind of hard :S… Probabilistic Robot “RoboProb” Transition Probability Example 

Example POMDP – Practical Considerations It’s getting kind of hard :S… Probabilistic Robot “RoboProb” Measurement Probability

Example POMDP – Practical Considerations It’s getting kind of hard :S… Probabilistic Robot “RoboProb” Belief States Impossible to graph!!

Example POMDP – Practical Considerations It’s getting kind of hard :S… Probabilistic Robot “RoboProb” Each linear function results from executing control , followed by observing measurement , and then executing control .

Example POMDP – Practical Considerations It’s getting kind of hard :S… Probabilistic Robot “RoboProb” Defining Measurement Probability Defining “Reward” Payoff Defining Transition Probability Merging Transition (Control) Probability

Example POMDP – Practical Considerations It’s getting kind of hard :S… Probabilistic Robot “RoboProb” Setting Beliefs Executing Sensing Executing

Example POMDP – Practical Considerations Now What…? Probabilistic Robot “RoboProb” Calculating The real problem is to compute 

Example POMDP – Practical Considerations The real problem is to compute  Key factor in this update is the conditional probability This probability specifies a distribution over probability distributions. Given a belief and a control action , the outcome is a distribution over distributions. Because belief is also based on the next measurement, the measurement itself is generated stochastically.

Example POMDP – Practical Considerations The real problem is to compute  So we make Contain only on non-zero term =

Partially Observable Markov Decision Process (Chapter 15 & 16)