POMDPs

POMDPs • Logistics • No class Wed • Please attend colloq Tues 11/25 3:30 • Outline

Value-Iteration (Recap) • DP update – a step in value-iteration • MDP • S – finite set of states in the world • A – finite set of actions • T: SxAxS -> [0,1] (e.g. T(s,a,s’) = 0.2) • R: SxA -> Real (e.g. R(s,a) = 10) • Algm

Heuristic Search • Dynamic programming is exhaustive • Can’t we use heuristics?

Conformant Planning • Simplest case of partial observability • No probabilities, no observations • Search thru the space of …?

Heuristics for Belief State Space Heuristic? Goal belief state Initial belief state

Heuristics

POMDP • <S, A, T, R, Ω, O> tuple • S, A, T, R of MDP • Ω – finite set of observations • O:SxAxΩ -> [0, 1] • Z alternate notation • Belief state • - information state • – b, probability distribution over S • - b(s1)

POMDP - SE • SE – State Estimator • updates belief state based on • previous belief state last action, current observation • SE(b,a,o) = b’

POMDP - SE z

POMDP - Π • How generate policy, Π? • POMDP-> “Belief MDP” • MDP parameters: • S => B, set of belief states • A => same • T => τ(b,a,b’) • R => ρ(b, a) • Solve with value-iteration algorithm

Flaws • Insufficient background/motivation • Weak evaluation / comparison • Numeric stability • Policy graph construction • Approximate policy vs. exact

Experiments [UAI97] Speed Problem Size

BS Transitions & Rewards • τ(b,a,b’) • ρ(b, a)

Two Problems • How to represent value function over continuous belief space? • How to update value function Vt with Vt-1? • POMDP -> MDP S => B, set of belief states A => same T => τ(b,a,b’) R => ρ(b, a)

Alternate Notation(s) x(s) = probability of s in belief xZ = set of observations {z}

Bellman Backup

Running Example • POMDP with • Two states (s1 and s2) • Two actions (a1 and a2) • Three observations (z1, z2, z3) 1D belief space for a 2 state POMDP Probability that state is s1

First Problem Solved • Key insight: value function • piecewise linear & convex (PWLC) • Convexity makes intuitive sense • Middle of belief space – high entropy, can’t select actions appropriately, less long-term reward • Near corners of simplex – low entropy, take actions more likely to be appropriate for current world state, gain more reward • Each line (hyperplane) represented with vector • Coefficients of line (hyperplane) • e.g. V(b) = c1 x b(s1) + c2 x (1-b(s1)) • To find value function at b, find vector with largest dot pdt with b

Second Problem • Can’t iterate over all belief states (infinite) for value-iteration but… • Given vectors representing Vt-1, generate vectors representing Vt • Theorem (Sondik 1971) – the  horizon value function can be approximated arbitrarily well by piecewise linear function

Horizon 1 • No future • Value function consists only of immediate reward • e.g. • R(s1, a1) = 1, R(s2, a1) = 0, • R(s1, a2) = 0, R(s2, a2) = 1.5 • b = <0.25, 0.75> • Value of doing a1 • = 1 x b(s1) + 0 x b(s2) • = 1 x 0.25 + 0 x 0.75 • Value of doing a2 • = 0 x b(s1) + 1.5 x b(s2) • = 0 x 0.25 + 1.5 x 0.75

Second Problem • Break problem down into 3 steps • -Compute value of belief state given action and observation • -Compute value of belief state given action • -Compute value of belief state

Horizon 2 – Given action & obs • If in belief state b,what is the best value of • doing action a1 and seeing z1? • Best value = best value of immediate action + best value of next action • Best value of immediate action = horizon 1 value function

Horizon 2 – Given action & obs • Assume best immediate action is a1 and obs is z1 • What’s the best action for b’ that results from initial b when perform a1 and observe z1? • Not feasible – do this for all belief states (infinite)

Horizon 2 – Given action & obs • Construct function over entire (initial) belief space • from horizon 1 value function • with belief transformation built in

Horizon 2 – Given action & obs • S(a1, z1) corresponds to [Cassandra UAI97]’s • S() built in: • - horizon 1 value function • - belief transformation • - “Weight” of seeing z after performing a • - Discount factor • - Immediate Reward • S() PWLC +  Note: using x to represent belief state

Horizon 2 – Given action • What is the horizon 2 value of a belief state given immediate action is a1? • Horizon 2, do action a1 • Horizon 1, do action…?

Horizon 2 – Given action • What’s the best strategy at b? • How to compute line (vector) representing best strategy at b? • How many strategies are there in figure? • What’s the max number of strategies (after taking immediate action a1)?

Horizon 2 – Given action • How can we represent the 4 regions (strategies) as a value function? • Note: each region is a strategy

Horizon 2 – Given action • Sum up vectors representing region • Sum of vectors = vectors (add lines, get lines) • Correspond to paper’s transformation

Horizon 2 a1 U a2

Horizon 2 This tells you how to act! =>

Second Problem • Break problem down into 3 steps • -Compute value of belief state given action and observation • -Compute value of belief state given action • -Compute value of belief state • Use horizon 2 value function • to update horizon 3’s ...

The Hard Step • Easy to visually inspect to obtain different regions • But in higher dimensional space, with many actions and observations….hard problem

Naïve way - Enumerate • How does Incremental Pruning do it?

Future Work • Scaling up • One LP solved / vector • Start with |S| vectors May get 2 |s| vectors And |s| = 2 |A| • Scaling up • Policy iteration? • Factoring, ADDs, Reachability? • Search thru space of policies • Monte carlo methods

Variants – Belief State MDP • Exact V, exact b • Approximate V, exact b • - Discreting b into a grid and interpolate • Exact V, approximate b • - Use particle filters to sample b • - track approximate belief state using DBN • Approximate V, Approximate b • - combine previous two

POMDPs

POMDPs

Presentation Transcript

Policies for POMDPs

Distributed POMDPs with Coordination Locales (DPCLs)

POMDPs

Modeling Speech using POMDPs

Learning and Planning for POMDPs

Achieving Goals in Decentralized POMDPs

Optimal Fixed-Size Controllers for Decentralized POMDPs

Active Learning in POMDPs

Approximate POMDPs using Point-based Value Iteration

Networked Distributed POMDPs: DCOP-Inspired Distributed POMDPs

Solving POMDPs Using Quadratically Constrained Linear Programs

Solving POMDPs through Macro Decomposition

Policy Improvement for POMDPs using gradient ascent

Reinforcement Learning in POMDPs Without Resets

Decision-making on Robots Using POMDPs

Oracular POMDPs: A Very Special Case

Reinforcement Learning in POMDPs Without Resets

Generalized Point Based Value Iteration for Interactive POMDPs

CPS 570: Artificial Intelligence Markov decision processes, POMDPs