450 likes | 528 Views
Incremental Pruning. CSE 574 May 9, 2003 Stanley Kok. Value-Iteration (Recap). DP update – a step in value-iteration MDP S – finite set of states in the world A – finite set of actions T: SxA -> Π (S) (e.g. T(s,a,s’) = 0.2) R: SxA -> R (e.g. R(s,a) = 10) Algm. POMDP.
E N D
Incremental Pruning CSE 574 May 9, 2003 Stanley Kok
Value-Iteration (Recap) • DP update – a step in value-iteration • MDP • S – finite set of states in the world • A – finite set of actions • T: SxA -> Π(S) (e.g. T(s,a,s’) = 0.2) • R: SxA -> R (e.g. R(s,a) = 10) • Algm
POMDP • <S, A, T, R, Ω, O> tuple • S, A, T, R of MDP • Ω – finite set of observations • O:SxA-> Π(Ω) • Belief state • - information state • – b, probability distribution over S • - b(s1)
POMDP - SE • SE – State Estimator • updates belief state based on • previous belief state last action, current observation • SE(b,a,o) = b’
POMDP - Π • Focus on Π component • POMDP-> “Belief MDP” • MDP parameters: • S => B, set of belief states • A => same • T => τ(b,a,b’) • R => ρ(b, a) • Solve with value-iteration algm
POMDP - Π • τ(b,a,b’) • ρ(b, a)
Two Problems • How to represent value function over continuous belief space? • How to update value function Vt with Vt-1? • POMDP -> MDP S => B, set of belief states A => same T => τ(b,a,b’) R => ρ(b, a)
Running Example • POMDP with • Two states (s1 and s2) • Two actions (a1 and a2) • Three observations (z1, z2, z3) 1D belief space for a 2 state POMDP Probability that state is s1
First Problem Solved • Key insight: value function • piecewise linear & convex (PWLC) • Convexity makes intuitive sense • Middle of belief space – high entropy, can’t select actions appropriately, less long-term reward • Near corners of simplex – low entropy, take actions more likely to be appropriate for current world state, gain more reward • Each line (hyperplane) represented with vector • Coefficients of line (hyperplane) • e.g. V(b) = c1 x b(s1) + c2 x (1-b(s1)) • To find value function at b, find vector with largest dot pdt with b
Second Problem • Can’t iterate over all belief states (infinite) for value-iteration but… • Given vectors representing Vt-1, generate vectors representing Vt
Horizon 1 • No future • Value function consists only of immediate reward • e.g. • R(s1, a1) = 0, R(s2, a1) = 1.5, • R(s1, a2) = 1, R(s2, a2) = 0 • b = <0.25, 0.75> • Value of doing a1 • = 1 x b(s1) + 0 x b(s2) • = 1 x 0.25 + 0 x 0.75 • Value of doing a2 • = 0 x b(s1) + 1.5 x b(s2) • = 0 x 0.25 + 1.5 x 0.75
Second Problem • Break problem down into 3 steps • -Compute value of belief state given action and observation • -Compute value of belief state given action • -Compute value of belief state
Horizon 2 – Given action & obs • If in belief state b,what is the best value of • doing action a1 and seeing z1? • Best value = best value of immediate action + best value of next action • Best value of immediate action = horizon 1 value function
Horizon 2 – Given action & obs • Assume best immediate action is a1 and obs is z1 • What’s the best action for b’ that results from initial b when perform a1 and observe z1? • Not feasible – do this for all belief states (infinite)
Horizon 2 – Given action & obs • Construct function over entire (initial) belief space • from horizon 1 value function • with belief transformation built in
Horizon 2 – Given action & obs • S(a1, z1) corresponds to paper’s • S() built in: • - horizon 1 value function • - belief transformation • - “Weight” of seeing z after performing a • - Discount factor • - Immediate Reward • S() PWLC
Second Problem • Break problem down into 3 steps • -Compute value of belief state given action and observation • -Compute value of belief state given action • -Compute value of belief state
Horizon 2 – Given action • What is the horizon 2 value of a belief state given immediate action is a1? • Horizon 2, do action a1 • Horizon 1, do action…?
Horizon 2 – Given action • What’s the best strategy at b? • How to compute line (vector) representing best strategy at b? (easy) • How many strategies are there in figure? • What’s the max number of strategies (after taking immediate action a1)?
Horizon 2 – Given action • How can we represent the 4 regions (strategies) as a value function? • Note: each region is a strategy
Horizon 2 – Given action • Sum up vectors representing region • Sum of vectors = vectors (add lines, get lines) • Correspond to paper’s transformation
Horizon 2 – Given action • What does each region represent? • Why is this step hard (alluded to in paper)?
Second Problem • Break problem down into 3 steps • -Compute value of belief state given action and observation • -Compute value of belief state given action • -Compute value of belief state
Horizon 2 a1 U a2
Horizon 2 This tells you how to act! =>
Second Problem • Break problem down into 3 steps • -Compute value of belief state given action and observation • -Compute value of belief state given action • -Compute value of belief state • Use horizon 2 value function to update horizon 3’s ...
The Hard Step • Easy to visually inspect to obtain different regions • But in higher dimensional space, with many actions and observations….hard problem
Naïve way - Enumerate • How does Incremental Pruning do it?
Incremental Pruning • How does IP improve naïve method? • Will IP ever do worse than naïve method? Combinations Purge/ Filter
Incremental Pruning • What’s other novel idea(s) in IP? • RR: Come up with smaller set D as argument to Dominate() • RR has more linear pgms but less contraints in the worse case. • Empirically ↓ constraints saves more time than ↑ linear programs require
Incremental Pruning Why are the terms after U needed? • What’s other novel idea(s) in IP? • RR: Come up with smaller set D as argument to Dominate()
Identifying Witness • Witness Thm: • -Let Ua be a set of vectors representing value function • -Let u be in Ua (e.g. u = αz1,a2 + αz2,a1 + αz3,a1 ) • -If there is a vector v which differs from u in one observation (e.g. v = αz1,a1 + αz2,a1 + αz3,a1) and • there is a b such that b.v > b.u, • -then Ua is not equal to the true value function
Witness Algm b’ b’’ b b’ b’’ • Randomly choose a belief state b • Compute vector representing best value at b (easy) • Add vector to agenda • While agenda is not empty • Get vector Vtop from top of agenda • b’ = Dominate(Vtop, Ua) • If b’ is not null (there is a witness), • compute vector u for best value at b’ and add it to Ua • compute all vectors v’s that differ from u at one observation and add them to agenda
Linear Support • If value function is incorrect, biggest diff is at edges (convexity)
Experiments • Comments???
Important Ideas • Purge()
Flaws • Insufficient background/motivation
Future Research • Better best-case/worse-case analyses • Precision parameter Є
Variants • Reactive Policy • - st = zt; • - π(z) = a • - branch & bound search • - gradient ascent search • - perceptual aliasing problem • Finite History Window • - π(z1…zk) = a • - Suffix tree to represent observation, leaf action • Recurrent Neural Nets • - use neural nets to maintain some state (so information about past is not forgotten)
Variants – Belief State MDP • Exact V, exact b • Approximate V, exact b • - Discreting b into a grid and interpolate • Exact V, approximate b • - Use particle filters to sample b • - track approximate belief state using DBN • Approximate V, Approximate b • - combine previous two
Variants - Pegasus • Policy Evaluation of Goodness And Search Using Scenarios • Convert POMDP to another POMDP with deterministic state transitions • Search for policy of transformed POMDP with highest estimated value