What Are Partially Observable Markov Decision Processes

What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536

POMDPs • A special case of the Markov Decision Process (MDP). In an MDP, the environ-ment is fully observable, and with the Markov assumption for the transition model, the optimal policy depends only on the current state. • For POMDPs, the environment is only partially observable

POMDP Implications • Since current state is not necessarily known, agent cannot execute the optimal policy for the state. • A POMDP is defined by the following: • Set of states S, set of actions A, set of observations O • Transition model T(s, a, s’) • Reward model R(s) • Observation model O(s, o) – probability of observing observation s in state o.

POMDP Implications (cont.) • Optimal action depends not on current state but on agent’s current belief state. • Belief state is a probability distribution over all possible states • Given a belief state, if agent does an action a and perceives observation o, new belief state is • b’(s’) = α O(s’, o) Σ T(s, a, s’) b(s) • Optimal policy π*(s) maps from belief states to actions

POMDP Solutions • Solving POMDP on a physical state space is equi-valent to solving an MDP on the belief state space • However, state space is continuous and very high-dimensional, so solutions are difficult to compute. • Even finding approximately optimal solutions is PSPACE-hard (i.e. really hard)

Why Study POMDPs? • In spite of the difficulties, POMDPs are still very important. • Many real-world problems and situations are not fully observable, but the Markov assumption is often valid. • Active area of research • Google search on “POMDP” returns ~5000 results • A number of current papers on the topic

Some Solution Techniques • Most exact solution algorithms (value iteration, policy iteration ) use dynamic programming techniques • These techniques transform from one value function (the transition model in physical space, which is piecewise linear and convex - PWLC) to another that can be used in an MDP solution technique • Dynamic programming algorithms: one-pass (1971), exhaustive (1982), linear support (1988), witness (1996) • Better method – incremental pruning (1996)

POMDPs at Work • Pattern Recognition tasks • SA-POMDP (Single-action POMDP) – only decision is whether to change state or not • Model constructed to recognize words within text to which noise was added – i.e. individual letters within the words were • SA-POMDP outperformed a pattern recognizer based on Hidden Markov Models, and exhibited better immunity to noise

POMDPs at Work (cont.) • Robotics • Mission planning • Robot Navigation • POMDP used to control the movement of an autonomous robot within a crowded environment • Used to predict the motion of other objects within the robot’s environment • Decompose state space into hierarchy, so individual POMDPs have a computationally tractable task

POMDPs at Work (cont.) • BATmobile – the Bayesian Autonomous Taxi • Many different tasks make use of a number of AI techniques • POMDPs used for the actual driving control (as opposed to higher level trip planning) • To efficiently compute, uses approximation techniques

BAT (cont.) • Several different techniques combined: • Dynamic Probabilistic Network (DPN) to maintain current belief state • Dynamic Decision Network (DDN) to perform bounded lookahead • Hand-coded explicit policy representations – i.e. decision trees • Supervised / reinforcement learning techniques to learn policy decisions

BAT (cont.) • The BAT has been constructed in a simulation environment and has been demonstrated to successfully handle a variety of driving problems, such as passing slower vehicles, reacting to unsafe drivers, avoiding stalled vehicles, and merging into traffic.

Resources • Tutorial on POMDPs: • http://www.cs.brown.edu/research/ai/pomdp/tutorial/index.html • Additional pointers to articles on my web site: • http://www.cs.montana.edu/~bwall/cs536

What Are Partially Observable Markov Decision Processes