Partially Observable Markov Decision Process (POMDP)

Douglas Aberdeen, National ICT Australia 2003 Partially Observable Markov Decision Process (POMDP) by Ye Fang Department of Computer Science Rice University

Overview • Recap POMDPand the exact solution • Heuristic methods • Heuristics for Exact methods • Grid methods • Factored belief states • Simulation • Methods for continuous state and action space • Solution

Learning with a Model • The agent knows the model , , • Observation/action history: • Belief state 1/3 1/3 1/3 Goal 1/2 1/2 1

Learning with a Model • Update beliefs: • Long-term value of a belief state • Define:

Complexity of Exact Methods • Exponential number of state variables: • Updating believe state is expensive. • Believe-state monitoring is hard. • Exponential number of belief states: • PSPACE-Hard for simplified finite-horizon POMDP. • NP-Hard to find a policy.

How to make POMDP feasible? • Almost impossible to find a exact solution for POMDP model • Where does the complexity of exact solution come from? • Infinite believe states • Updating believe states and their value functions • Introduce heuristic methods for exact methods

How to make POMDP feasible? • Why can Heuristics work? • Simplify the representation of value function by assuming the system is an MDP. • Replace the believe state b with real world state

Heuristic for Exact Methods • The intuition behind these heuristics is to assume the system as an MDP by finding an approximate projection from belief state to world state.

Heuristic for Exact Methods • Goal: • Find an good approximation of projection from belief state to world state. • Find a good policy for each believe state.

Heuristic for Exact Methods • MSL(most likely state) • Voting Heuristics • QMDP Heuristic • Heuristic using the uncertainty of belief state

MLS Heuristic • We can assume the system is in the most likely world state(MLS) i at time t. The policy executed at that state is the transition with largest Q-value at state i.

MLS Heuristic • This method neglects all possible world states but the MSL state at belief state b. • EX: Given optimal action in a world with three states and two actions, u(s0) = a0, u(s1) = a0, u(s2) = a1 b = [0.3, 0.3, 0.4]

Voting Heuristic • The voting heuristic assigns a probability distribution over the actions instead of over the states. • Given: • The action for each belief state:

Voting Heuristic • EX: Given optimal action in a world with three states and two actions, u(s0) = a0, u(s1) = a0, u(s2) = a1 b = [0.3, 0.3, 0.4], V(s0, a0)=5, V(s0, a1)=4, V(s1,a0)=5, V(s1,a1)=4, V(s2,a0)=0,V(s2,a1)= 10. At state s2, expectedR(a0) =3, expectedR(a1) =6.4

Voting Heuristic • This method does not take the reward of an action into account. • Introduce QMDP, which emphasize the Q-function of the optimal policy rather than the policy itself.

QMDP Heuristic • QMDP only takes into account the belief state at first step. • What if this action does not do much to disambiguate the state, this method cannot improve the action over time.

Shortcomings of the Heuristics • What if the belief states is close to uniform? • Ex: a robot trying to reach the other end of a futureless desert. By observation, it has almost same belief of it is at everywhere. • What if there is a lot of uncertainty in the information state? • Consider the uncertainty when taking action

Formal measurement of Uncertainty • Entropy is the measure of a probability distribution that reflects how spiked or spread out the probability mass is, essentially capturing the amount of uncertainty with a single number. • f(.) is a discrete probability mass function.

Two Objectives • When choosing actions, we want to: • To take actions that will yield the highest rewards. • To reduce the entropy of information state.

Weighted Entropy Control • Intuition: relate the entropy to the rewards to give some rough measure of the value of information.

Weighted Entropy Control • When the entropy is near 1, it means the environment is totally unobservable. • When the entropy near 0, it means the model is almost a MDP.

Weighted Entropy Control • Define VL to be the lower bound for POMDP value function. • The value at each belief state is: • The control strategy will be:

Other Heuristics for POMDP • Grid method • Factored belief method • Simulation

Grid method • Instead of compose the world state from belief state, it picks the real world states. • How to choose the set of real world states (a interesting region of each belief state)? • How to interpolate?

How to choose grid points? • Simulationto find useful points • Adding points where the value differed a lot though with similar observation.

How to interpolate? • Maintain the convex nature of the value function: • f(g,u) is the value grid point g under action u. • Example: nearest neighbors, linear interpolations, etc.

Factored Belief State • Intuition: learn the dependency of state variables • Ex: at time t: the state of raining is true at time t+1: the state of “ground is dry” is not very likely to be true.

Factored Belief State • We can use a subset of state variables to construct a Bayes network(BN). • Belief-state projection can be searched to find a suitable BN for a specific problem（belief monitoring）. It is a learning of adjusting the belief networkparametrized by ϕ. • Factored linear value function: weighted linear combinations of polynomial basis functions.

Simulation and Belief State • Concentrate learning effort on the states that are most likely to be encountered. • In terms of Q-learning, we can simulate a path in POMDP and perform iteration of the value function on the monitored current belief states. • Not good for POMDP with more than hundreds of states.

Simulation and Belief State • Learn Q-function that generalize to all belief states • Artificial neural network can also be used to approximate the value function of the full belief states.

Continuous State and Action Spaces • Sampled belief states. • Use particle filters to update the belief state. • The value function is approximated using the average of k nearest neighbors.

Policy search Vs Value search • It is simpler to just determine how to act instead of the value of acting. • Approximate value function method usually produce deterministic policies. • The heuristic methods are approximated projections from belief states to world states. • It is better to introduce randomness in policy.

Policy search Vs Value search • Policy search can be very difficult. • Value search can be better for small POMDP . • Value search imposes Bellman equations as constrains.

Policy search • Policy search can be implemented by policyiteration. • Step1: Evaluate current policy • Step2: Improve policy

Recap • Different Heuristics • Projecting a belief state to a world state • Evaluating the values for belief states • Finding good policy

Partially Observable Markov Decision Process (POMDP)