380 likes | 409 Views
POMDPs. Logistics No class Wed Please attend colloq Tues 11/25 3:30 Outline. Value-Iteration (Recap). DP update – a step in value-iteration MDP S – finite set of states in the world A – finite set of actions T: SxAxS -> [0,1] (e.g. T(s,a,s’) = 0.2)
E N D
POMDPs • Logistics • No class Wed • Please attend colloq Tues 11/25 3:30 • Outline
Value-Iteration (Recap) • DP update – a step in value-iteration • MDP • S – finite set of states in the world • A – finite set of actions • T: SxAxS -> [0,1] (e.g. T(s,a,s’) = 0.2) • R: SxA -> Real (e.g. R(s,a) = 10) • Algm
Heuristic Search • Dynamic programming is exhaustive • Can’t we use heuristics?
Conformant Planning • Simplest case of partial observability • No probabilities, no observations • Search thru the space of …?
Heuristics for Belief State Space Heuristic? Goal belief state Initial belief state
POMDP • <S, A, T, R, Ω, O> tuple • S, A, T, R of MDP • Ω – finite set of observations • O:SxAxΩ -> [0, 1] • Z alternate notation • Belief state • - information state • – b, probability distribution over S • - b(s1)
POMDP - SE • SE – State Estimator • updates belief state based on • previous belief state last action, current observation • SE(b,a,o) = b’
POMDP - Π • How generate policy, Π? • POMDP-> “Belief MDP” • MDP parameters: • S => B, set of belief states • A => same • T => τ(b,a,b’) • R => ρ(b, a) • Solve with value-iteration algorithm
Flaws • Insufficient background/motivation • Weak evaluation / comparison • Numeric stability • Policy graph construction • Approximate policy vs. exact
Experiments [UAI97] Speed Problem Size
BS Transitions & Rewards • τ(b,a,b’) • ρ(b, a)
Two Problems • How to represent value function over continuous belief space? • How to update value function Vt with Vt-1? • POMDP -> MDP S => B, set of belief states A => same T => τ(b,a,b’) R => ρ(b, a)
Alternate Notation(s) x(s) = probability of s in belief xZ = set of observations {z}
Running Example • POMDP with • Two states (s1 and s2) • Two actions (a1 and a2) • Three observations (z1, z2, z3) 1D belief space for a 2 state POMDP Probability that state is s1
First Problem Solved • Key insight: value function • piecewise linear & convex (PWLC) • Convexity makes intuitive sense • Middle of belief space – high entropy, can’t select actions appropriately, less long-term reward • Near corners of simplex – low entropy, take actions more likely to be appropriate for current world state, gain more reward • Each line (hyperplane) represented with vector • Coefficients of line (hyperplane) • e.g. V(b) = c1 x b(s1) + c2 x (1-b(s1)) • To find value function at b, find vector with largest dot pdt with b
Second Problem • Can’t iterate over all belief states (infinite) for value-iteration but… • Given vectors representing Vt-1, generate vectors representing Vt • Theorem (Sondik 1971) – the horizon value function can be approximated arbitrarily well by piecewise linear function
Horizon 1 • No future • Value function consists only of immediate reward • e.g. • R(s1, a1) = 1, R(s2, a1) = 0, • R(s1, a2) = 0, R(s2, a2) = 1.5 • b = <0.25, 0.75> • Value of doing a1 • = 1 x b(s1) + 0 x b(s2) • = 1 x 0.25 + 0 x 0.75 • Value of doing a2 • = 0 x b(s1) + 1.5 x b(s2) • = 0 x 0.25 + 1.5 x 0.75
Second Problem • Break problem down into 3 steps • -Compute value of belief state given action and observation • -Compute value of belief state given action • -Compute value of belief state
Horizon 2 – Given action & obs • If in belief state b,what is the best value of • doing action a1 and seeing z1? • Best value = best value of immediate action + best value of next action • Best value of immediate action = horizon 1 value function
Horizon 2 – Given action & obs • Assume best immediate action is a1 and obs is z1 • What’s the best action for b’ that results from initial b when perform a1 and observe z1? • Not feasible – do this for all belief states (infinite)
Horizon 2 – Given action & obs • Construct function over entire (initial) belief space • from horizon 1 value function • with belief transformation built in
Horizon 2 – Given action & obs • S(a1, z1) corresponds to [Cassandra UAI97]’s • S() built in: • - horizon 1 value function • - belief transformation • - “Weight” of seeing z after performing a • - Discount factor • - Immediate Reward • S() PWLC + Note: using x to represent belief state
Second Problem • Break problem down into 3 steps • -Compute value of belief state given action and observation • -Compute value of belief state given action • -Compute value of belief state
Horizon 2 – Given action • What is the horizon 2 value of a belief state given immediate action is a1? • Horizon 2, do action a1 • Horizon 1, do action…?
Horizon 2 – Given action • What’s the best strategy at b? • How to compute line (vector) representing best strategy at b? • How many strategies are there in figure? • What’s the max number of strategies (after taking immediate action a1)?
Horizon 2 – Given action • How can we represent the 4 regions (strategies) as a value function? • Note: each region is a strategy
Horizon 2 – Given action • Sum up vectors representing region • Sum of vectors = vectors (add lines, get lines) • Correspond to paper’s transformation
Second Problem • Break problem down into 3 steps • -Compute value of belief state given action and observation • -Compute value of belief state given action • -Compute value of belief state
Horizon 2 a1 U a2
Horizon 2 This tells you how to act! =>
Second Problem • Break problem down into 3 steps • -Compute value of belief state given action and observation • -Compute value of belief state given action • -Compute value of belief state • Use horizon 2 value function • to update horizon 3’s ...
The Hard Step • Easy to visually inspect to obtain different regions • But in higher dimensional space, with many actions and observations….hard problem
Naïve way - Enumerate • How does Incremental Pruning do it?
Future Work • Scaling up • One LP solved / vector • Start with |S| vectors May get 2 |s| vectors And |s| = 2 |A| • Scaling up • Policy iteration? • Factoring, ADDs, Reachability? • Search thru space of policies • Monte carlo methods
Variants – Belief State MDP • Exact V, exact b • Approximate V, exact b • - Discreting b into a grid and interpolate • Exact V, approximate b • - Use particle filters to sample b • - track approximate belief state using DBN • Approximate V, Approximate b • - combine previous two