Approximate POMDP planning: Overcoming the curse of history!

Approximate POMDP planning:Overcoming the curse of history! Presented by: Joelle Pineau Joint work with: Geoff Gordon and Sebastian Thrun Machine Learning Lunch - March 10, 2003

To use or not to use a POMDP • POMDPs provide a rich framework for sequential decision-making, which can model: • varying rewards across actions and goals • uncertainty in the action effects • uncertainty in the state of the world Machine Learning Lunch - March 10, 2003

Existing applications of POMDPs • Maintenance scheduling • Puterman, 1994 • Robot navigation • Koenig & Simmons, 1995; Roy & Thrun, 1999 • Helicopter control • Bagnell & Schneider, 2001; Ng et al., 2002 • Dialogue modeling • Roy, Pineau & Thrun, 2000; Peak&Horvitz, 2000 • Preference elicitation • Boutilier, 2002 Machine Learning Lunch - March 10, 2003

Graphical Model Representation POMDP is n-tuple { S, A, , b, T, O, R }: S = state set A = action set  = observation set b(s) = initial belief T(s,a,s’) = state-to-state transition probabilities O(s,a,o) = observation generation probabilities R(s,a) = Reward function st-1 st What goes on: Machine Learning Lunch - March 10, 2003

Graphical Model Representation POMDP is n-tuple { S, A, , b, T, O, R }: S = state set A = action set  = observation set b(s) = initial belief T(s,a,s’) = state-to-state transition probabilities O(s,a,o) = observation generation probabilities R(s,a) = Reward function st-1 st What goes on: (s) (s) rt-1 rt at-1 at Machine Learning Lunch - March 10, 2003

Graphical Model Representation POMDP is n-tuple { S, A, , b, T, O, R }: S = state set A = action set  = observation set b(s) = initial belief T(s,a,s’) = state-to-state transition probabilities O(s,a,o) = observation generation probabilities R(s,a) = Reward function st-1 st What goes on: (s) (s) ot-1 ot rt-1 rt What we see: at-1 at Machine Learning Lunch - March 10, 2003

Graphical Model Representation POMDP is n-tuple { S, A, , b, T, O, R }: S = state set A = action set  = observation set b(s) = initial belief T(s,a,s’) = state-to-state transition probabilities O(s,a,o) = observation generation probabilities R(s,a) = Reward function st-1 st What goes on: ot-1 ot rt-1 rt What we see: at-1 at (b) (b) bt-1 bt What we infer: Machine Learning Lunch - March 10, 2003

Understanding the belief state • A belief is a probability distribution over states Where Dim(B) = |S|-1 • E.g. Let S={s1, s2} 1 P(s1) 0 Machine Learning Lunch - March 10, 2003

Understanding the belief state • A belief is a probability distribution over states Where Dim(B) = |S|-1 • E.g. Let S={s1, s2, s3} 1 P(s1) 0 P(s2) 1 Machine Learning Lunch - March 10, 2003

Understanding the belief state • A belief is a probability distribution over states Where Dim(B) = |S|-1 • E.g. Let S={s1, s2, s3 , s4} 1 P(s3) P(s1) 0 P(s2) 1 Machine Learning Lunch - March 10, 2003

The first curse of POMDP planning • The curse of dimensionality: • dimension of the belief = # of states • dimension of planning problem = # of states • related to the MDP curse of dimensionality Machine Learning Lunch - March 10, 2003

Planning for POMDPs • Learning a value function V(b) bB: • Learning an action-selection policy (b) bB: Machine Learning Lunch - March 10, 2003

Exact value iteration for POMDPs • Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 V0(b) P(s1) b Machine Learning Lunch - March 10, 2003

Exact value iteration for POMDPs • Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 1 3 V1(b) P(s1) b Machine Learning Lunch - March 10, 2003

Exact value iteration for POMDPs • Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 1 3 2 27 V2(b) P(s1) b Machine Learning Lunch - March 10, 2003

Exact value iteration for POMDPs • Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 1 3 2 27 3 2187 V2(b) P(s1) b Machine Learning Lunch - March 10, 2003

Exact value iteration for POMDPs • Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 1 3 2 27 3 2187 4 14,348,907 V2(b) P(s1) b Machine Learning Lunch - March 10, 2003

Properties of exact value iteration • Value function is always piecewise-linear convex • Many hyper-planes can be pruned away |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 1 3 2 5 3 9 4 7 5 13 10 27 15 47 20 59 … V2(b) P(s1) b Machine Learning Lunch - March 10, 2003

Is pruning sufficient? |S|=20, |A|=6, ||=8 Iteration # hyper-planes 0 1 1 5 2 213 3 ????? … Not for this problem! Machine Learning Lunch - March 10, 2003

The second curse of POMDP planning • The curse of dimensionality: • the dimension of each hyper-plane = # of states • The curse of history: • the number of hyper-planes grows exponentially with the planning horizon Machine Learning Lunch - March 10, 2003

The second curse of POMDP planning • The curse of dimensionality: • the dimension of each hyper-plane = # of states • The curse of history: • the number of hyper-planes grows exponentially with the planning horizon dimensionality history Complexity of POMDP value iteration: Machine Learning Lunch - March 10, 2003

s1 s0 s2 Possible approximation approaches • Ignore the belief: • Discretize the belief: • Compress the belief: • Plan for trajectories: - overcomes both curses - very fast - performs poorly in high entropy beliefs [Littman et al., 1995] - overcomes the curse of history (sort of) - scales exponentially with # states [Lovejoy, 1991; Brafman 1997; Hauskrecht, 1998; Zhou&Hansen, 2001] - overcomes the curse of dimensionality [Poupart&Boutilier, 2002; Roy&Gordon, 2002] - can diminish both curses - requires restricted policy class - local minimum, slow-changing gradients [Baxter&Bartlett, 2000; Ng&Jordan, 2002] Machine Learning Lunch - March 10, 2003

A new algorithm: Point-based value iteration • Main idea: • Select a small set of belief points V(b) P(s1) b1 b0 b2 Machine Learning Lunch - March 10, 2003

A new algorithm: Point-based value iteration • Main idea: • Select a small set of belief points • Plan for those belief points only V(b) P(s1) b1 b0 b2 Machine Learning Lunch - March 10, 2003

A new algorithm: Point-based value iteration • Main idea: • Select a small set of belief points  Focus on reachable beliefs • Plan for those belief points only V(b) P(s1) b1 b0 b2 a,o a,o Machine Learning Lunch - March 10, 2003

A new algorithm: Point-based value iteration • Main idea: • Select a small set of belief points  Focus on reachable beliefs • Plan for those belief points only  Learn value and its gradient V(b) P(s1) b1 b0 b2 a,o a,o Machine Learning Lunch - March 10, 2003

Point-based value update V(b) P(s1) b1 b0 b2 Machine Learning Lunch - March 10, 2003

Point-based value update • Initialize the value function (…and skip ahead a few iterations) Vn(b) P(s1) b1 b0 b2 Machine Learning Lunch - March 10, 2003

Point-based value update • Initialize the value function (…and skip ahead a few iterations) • For each bB: Vn(b) P(s1) b Machine Learning Lunch - March 10, 2003

Point-based value update • Initialize the value function (…and skip ahead a few iterations) • For each bB: • For each (a,o): Project forward bba,o and find best value: Vn(b) P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Machine Learning Lunch - March 10, 2003

Point-based value update • Initialize the value function (…and skip ahead a few iterations) • For each bB: • For each (a,o): Project forward bba,o and find best value: Vn(b) ba1,o1,ba2,o1 ba1,o2 ba2,o2 P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Machine Learning Lunch - March 10, 2003

Point-based value update • Initialize the value function (…and skip ahead a few iterations) • For each bB: • For each (a,o): Project forward bba,o and find best value: • Sum over observations: Vn(b) ba1,o1,ba2,o1 ba1,o2 ba2,o2 P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Machine Learning Lunch - March 10, 2003

Point-based value update • Initialize the value function (…and skip ahead a few iterations) • For each bB: • For each (a,o): Project forward bba,o and find best value: • Sum over observations: Vn(b) ba1,o1,ba2,o1 ba1,o2 ba2,o2 P(s1) b Machine Learning Lunch - March 10, 2003

Point-based value update • Initialize the value function (…and skip ahead a few iterations) • For each bB: • For each (a,o): Project forward bba,o and find best value: • Sum over observations: Vn+1(b) ba2 ba1 P(s1) b Machine Learning Lunch - March 10, 2003

Point-based value update • Initialize the value function (…and skip ahead a few iterations) • For each bB: • For each (a,o): Project forward bba,o and find best value: • Sum over observations: • Max over actions: Vn+1(b) ba2 ba1 P(s1) b Machine Learning Lunch - March 10, 2003

Point-based value update • Initialize the value function (…and skip ahead a few iterations) • For each bB: • For each (a,o): Project forward bba,o and find best value: • Sum over observations: • Max over actions: Vn+1(b) P(s1) b1 b0 b2 Machine Learning Lunch - March 10, 2003

Complexity of value update Exact Update Point-based Update I - Projection S2An S2AB II - Sum S2An SAB2 III - Max SAn SAB where: S = # states n = # solution vectors at iteration n A = # actions B = # belief points  = # observations n+1 Machine Learning Lunch - March 10, 2003

Theoretical properties of point-based updates • Theorem: For any belief set B and any horizon n, the error of the PBVI algorithm n=||VnB-Vn*|| is bounded by: Machine Learning Lunch - March 10, 2003

Back to the full algorithm • Main idea: • Select a small set of belief points  PART II • Plan for those belief points only  PART I V(b) P(s1) b1 b0 b2 a,o a,o Machine Learning Lunch - March 10, 2003

Experimental results: Lasertag domain State space = RobotPositionOpponentPosition Observable: RobotPosition - always OpponentPosition- only if same as Robot Action space = {North, South, East, West, Tag} Opponent strategy: Move away from robot w/ Pr=0.8 |S|=870, |A|=5, ||=30 Machine Learning Lunch - March 10, 2003

Performance of PBVI on Lasertag domain Opponent tagged 59% of trials Opponent tagged 17% of trials Machine Learning Lunch - March 10, 2003

Performance on well-known POMDPs Maze33 |S|=36, |A|=5, ||=17 Hallway |S|=60, |A|=5, ||=20 Hallway2 |S|=92, |A|=5, ||=17 Method QMDP Grid PBUA PBVI Reward 0.198 0.94 2.30 2.25 Time(s) 0.19 n.v. 12166 3448 B n.a. 174 660 470 %Goal 47 n.v 100 95 Reward 0.261 n.v. 0.53 0.53 Time(s) 0.51 n.v. 450 288 B n.a. n.a. 300 86 %Goal 22 98 100 98 Reward 0.109 n.v. 0.35 0.34 Time(s) 1.44 n.v. 27898 360 B n.a. 337 1840 95 Machine Learning Lunch - March 10, 2003

Back to the full algorithm • Main idea: • Select a small set of belief points  PART II • Plan for those belief points only  PART I V(b) P(s1) b1 b0 b2 a,o a,o Machine Learning Lunch - March 10, 2003

Selecting good belief points • What can we learn from policy search methods? • Focus on reachable beliefs. Machine Learning Lunch - March 10, 2003

Selecting good belief points • What can we learn from policy search methods? • Focus on reachable beliefs. a1,o1 a2,o1 P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 a2,o2 a1,o2 Machine Learning Lunch - March 10, 2003

Selecting good belief points • What can we learn from policy search methods? • Focus on reachable beliefs. • What can we learn from MDP exploration techniques? • Select widely-spaced beliefs, rather than near-by beliefs. P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Machine Learning Lunch - March 10, 2003

How does PBVI actually select belief points? • Start with B  b0 • For any belief point bB: P(s1) b Machine Learning Lunch - March 10, 2003

How does PBVI actually select belief points? • Start with B  b0 • For any belief point bB: • For each action aA: • Generate a new belief ba by applying a and stochastically picking an observation o. P(s1) b ba1 a1,o2 Machine Learning Lunch - March 10, 2003

Approximate POMDP planning: Overcoming the curse of history!