420 likes | 646 Views
Approximate POMDPs using Point-based Value Iteration. Ryan Luna 21 March 2013. The More You Know. G. Shani, J. Pineau and R. Kaplow. A Survey of Point-based POMDP Solvers. Autonomous Agents and Multi-agent Systems . 2012.
E N D
Approximate POMDPs using Point-based Value Iteration Ryan Luna 21 March 2013
The More You Know G. Shani, J. Pineau and R. Kaplow. A Survey of Point-based POMDP Solvers. Autonomous Agents and Multi-agent Systems. 2012. T. Smith and R. Simmons. Heuristic Search Value Iteration for POMDPs. Uncertainty in Artificial Intelligence. 2004. J. Pineau, G. Gordon and S. Thrum. Point-based Value Iteration: An anytime algorithm for POMDPs. Int’l Joint Conferences in Artificial Intelligence. 2003. S. Thrun, W. Burgard. and D. Fox. Probabilistic Robotics. MIT Press. 2006.
POMDP • Solving a POMDP is very similar to an MDP • The similarities: • State transitions are still stochastic • Value function is a function of our current “state” • We still perform Bellman backups to compute V • The differences: • We have a probability distribution of where we are • We can make (stochastic) observations of our current belief
measurements state x1 action u3 state x2 measurements actions u1, u2 payoff payoff Let’s solve a POMDP!
You Said We Were Solving A POMDP • Fine. We sense z1Now what? • We have gained information. Update our value function! p(z1 | x1) = 0.7p(z1 | x2) = 0.3
I’m a Beliefer V1(b) b’(b | z1) V1(b | z1)
HEY! You said POMDP… • Geez. OK. We don’t know that we observed z1. We must compute expected value.
HEY! You said POMDP… • Geez. OK. We don’t know that we observed z1. We must compute expected value.
HEY! You said POMDP… • Geez. OK. We don’t know that we observed z1. We must compute expected value.
Value of Sensing Before sensing After Sensing
Lather, Rinse, Repeat • We just did a full backup for T=1! • Repeat for T=2.
The POMDP Hex • POMDPs have both the curse of dimensionality and the curse of history. • Scholars maintain that history is the truly unmanageable part. O (|V| x |A| x |Ω| x |S|2 + |A| x |S| x |V||Ω| ) Belief Update (taking an action) Value Backups (sensor measurements)
The POMDP Hex • T = 1; |V| = 4 • T = 3; |V| ≈ 64 • T = 20; |V| ≈ 10547864 • T = 30; |V| ≈ 10561012337 T = 30
So, How can we address this? • You fly to Paris to enjoy a nice croque monsieur
Here we go • Point-based Value Iteration • Heuristic-search Value Iteration
Point-based Value Iteration • Addresses two major concerns: • Value Iteration addresses all possible beliefs equally, no matter how absurd • Exact value function is probably unnecessary • How do we do this? • Maintain a set of beliefs over which the value function is computed • Only incorporate values which maximize at least one member of the belief set • Focus search on most probable beliefs
Point-based Value Iteration • Maintains a fixed set of belief points, B • The value function is computed only over B • B only contains reachable beliefs • The number of constraints (|V|) is fixed at |B| • Value updates are now PTIME • PBVI provides an anytime solution that converges to optimal value function • The error in the value function is bounded
What does it Mean? • Start with a small number of reachable beliefs • Point-based backup instead of Bellman backup • Implicitly prunes value simplex • Increase # beliefs until timeout O (|V| x |A| x |Ω| x |S|2 + |B| x |A| x |S| x |Ω|)
Heuristic Search Value Iteration • Popular flavor of PBVI • Upper and lower bound on value estimate • Uses several powerful heuristics • Anytime. Gets arbitrarily close* to optimal • Performs a depth-first search into belief space • Depth is bounded
Upper and Lower Bounds! • Lower bound is the standard vector simplex • Upper bound is a set of belief/value points • Search ends whendifference in bounds atinitial belief is < ε • Initialization? • Lower bound is easy • Upper bound is a solution to the MDP
We are doing Heuristic Search • Startling observation: the quality of the value estimate at a successor affects its predecessor • width = difference in upper and lower bounds • Want to choose successors which minimize width at initial belief • What does this mean for successors? • We have to pick observations and actions
IE-Max Heuristic • OK. Pick an action… • Select the one with the max upper bound
Excess Uncertainty Heuristic • To complete the deal, we need an observation
Search Depth • Depth-first search strikes fear into the hearts of even the strongest men • Our search is bounded at depth t… phew • Once we expand a belief that satisfies this eqn, the search ceases
Updating the Value Function • When search ceases, perform full Bellman backups in reverse order • Just insert the constraint vector for the l.b. • Update u.b. based on expected value • Both bounds areuniformly improvable
Wait, wat? b0 an a0 t = 0 z0 z0 zm zm z1 z1 t = 1 t = k
Wait, wat? Max “excess uncertainty” b0 Max upper bound an a0 t = 0 z0 z0 zm zm z1 z1 t = 1 t = k
Wait, wat? b0 an a0 t = 0 z0 z0 zm zm z1 z1 t = 1 t = k
Properties of HSVI • Upper and lower bounds monotonically converge to optimal value function • Local updates preserve improvability • Maximum regret is ε • Finite search depth • Finite depth → finite Bellman updates
Rock-Sample • Deterministic motions • Noisy sensor forRock-goodness • +10 for sampling good • -10 for sampling bad • +10 for exiting • No other cost/reward
More Results • Lots of comparisons in the original paper
What did we learn • Exact POMDP solution utterly infeasible • Even for the tiniest problem ever • History is (probably) worse than dimensionality • Approximate solutions have better properties • Anytime POMDP solutions are possible • Scales to hundreds or even thousands of states • The problem is really really really really hard