1 / 51

Black Box and Generalized Algorithms for Planning in Uncertain Domains

Black Box and Generalized Algorithms for Planning in Uncertain Domains. Thesis Proposal, Dept. of Computer Science, Carnegie Mellon University H. Brendan McMahan. Outline. The Problem and Approach Motivating Examples Goals and Techniques MDPs and Uncertainty Example Algorithms

Download Presentation

Black Box and Generalized Algorithms for Planning in Uncertain Domains

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Black Box and Generalized Algorithms for Planning in Uncertain Domains Thesis Proposal, Dept. of Computer Science, Carnegie Mellon University H. Brendan McMahan

  2. Outline • The Problem and Approach • Motivating Examples • Goals and Techniques • MDPs and Uncertainty • Example Algorithms • Proposed Future Work

  3. Mars Rover Mission Planning • Human control not realistic • Collect data while conserving power and bandwidth First Experiments in the Robotic Investigation of Life in the Atacama Desert of Chile. D. Wettergreen, et al. 2005. Recent Progress in Local and Global Traversability for Planetary Rovers. S. Singh, et al. 2000.

  4. Autonomous Helicopter Control • 6+ continuous state dimensions • Complex, non-linear dynamics • High failure cost Inverted Autonomous Helicopter Flight via Reinforcement Learning A. Ng, et al. Autonomous Helicopter Control using Reinforcement Learning Policy Search Methods J. Bagnell and J. Schneider

  5. Online Shortest Path Problem Getting from my (old) house to CMU each day:

  6. Other Domains

  7. Goal Planning • multiple decisions over time to achieve goals or minimize cost in Uncertain Domains • NOT deterministic, fully observable, perfectly modeled

  8. The Black Box Approach Hard Planning Problem Easier Problems New Algorithm Fast Existing Algorithm Solution Solutions

  9. Fast Existing Algorithm The Generalization Approach Hard Planning Problem Generalization of Existing Algorithm Solution

  10. Two Examples • Black Box Approach Used as a Black Box Oracle Algorithms (MDPs with unknown costs) MDP Alg (e.g., value iteration) • Generalization Approach Generalize To Algorithms for Stochastic Shortest Paths Dijkstra’s Alg (Shortest Paths)

  11. Benefits of using Black Boxes • Use fast/optimized/mature implementations • Pick implementation for specific domain • Will be able to use algorithms not even invented yet • Theoretical advantages

  12. Benefits of Generalization • New intuitions • Some performance guarantees for free

  13. $4.75 1.0 Hungry No New Ideas 1.0 $0.10 0.99 $1.00 $1.00 0.1 0.8 0.01 0.1 Goal: New idea! Markov Decision Processes An MDP (S, A, P, c) … • S is a finite set of states • A is a finite set of actions • dynamics P(y | x, a) • costs c(x,a) A Research MDP A = {eat, wait, work}

  14. Simple Example Domain Robot path planning problem: • Actions = {8 neighbors} • Cost: Euclidean Distance • Prob. p of random action

  15. Types of Uncertainty • Outcome Uncertainty (MDPs) • Partial Observability (POMDPs) • Model Uncertainty (families of MDPs, RL) • Modeling Other Agents (Agent Uncertainty?)

  16. The Curse of Dimensionality • The size of |S| is exponential in the number of state variables: <x,y, vx, vy, battery_power, door_open, another_door_open, goal_x, goal_y, bob_x, bob_y, …>

  17. Outline • The Problem and Approach • Example Algorithms • MDPs with Unknown Costs • Generalizing Dijkstra’s Algorithm • Proposed Future Work

  18. Unknown Costs, Offline Version A game with two players: • The Planner chooses a policy for a MDP with known dynamics • The Sentry chooses a cost function from a set K = {c1,…,ck} of possible cost functions.

  19. Avoiding Detection by Sensors The Planner (robot) picks policies (paths): The Sentry picks cost functions (sensor placements):

  20. Matrix Game Formulation An optimal mixed strategy for the planner is a distribution over deterministic polices (paths). Matrix game M: • Planner (rows) selects a policy  • Sentry (columns) selects a cost c • M(, c) = [total cost of  under costs c] Goal: Find a minimax solution to M

  21. Interpretations • Model Uncertainty: → unknown cost function • Partial Observability: → fixed, unobservable cost function • Agent Uncertainty: →an adversary picks the cost function

  22. How to Solve It • Problem: Matrix M is exponentially big • Solution: Can be represented compactly as a Linear Program (LP) • Problem: LP still takes much too long to solve • Solution: The Single Oracle Algorithm, taking advantage of fast black box MDP algorithms

  23. Single Oracle Algorithm • F is a small set of policies • M’ is the matrix game where the Planner must play from F. • We can solve M’ efficiently, it is only |F| x |K| in size! |F| = 2

  24. Single Oracle Algorithm If only … • we knew it was sufficient for the Planner to randomize among a small set of strategies • and we could find that set of strategies.

  25. Single Oracle Algorithm • Use an MDP algorithm to find an optimal policy  against the fixed cost function c. • Add  to F • Solve M’ and let c be the expected cost function under the Sentry’s optimal mixed strategy.

  26. Example Run: Initialization • Fix policy (blue path) • Solve M’ to find red sensor field (cost vector), fix this as c

  27. Iteration 1: Best Response • Solve for the best response policy  (new blue line) • Add  to F Red: Fixed cost vector (expected field of view) Blue: Shortest path given costs

  28. Iteration 1: Solve the Game • Solve M’ Minimax Equilibrium: Red: Mixture of Costs Blue: Mixture of Paths from F

  29. Iteration 2: Best Response • Solve for the best response policy  (new blue line) • Add  to F Red: Fixed cost vector (expected field of view) Blue: Shortest path given costs

  30. Iteration 2: Solve the Game • Solve M’ Minimax Equilibrium: Red: Mixture of Costs Blue: Mixture of Paths from F

  31. Iteration 6: Convergence Solution to M’ Best Response

  32. Unknown Costs, Online Version • Go from my house to CMU each day • Model as a graph

  33. A Shortest Path Problem? • If we knew all the edge costs, it would be easy! • But, traffic, downed trees → uncertainty

  34. Limited Observations • BGA Algorithm: • Keep estimates of edge lengths • Most days, follow FPL1 algorithm: pick shortest path with respect to estimated lengths plus a little noise. • Occasionally, play a “random” path in order to make sure we have good estimates of the edge lengths. • 1 [Kalai and Vempala, 2003] • Each day, observe the total length of the path we actually took to get to CMU

  35. Keeps states on a priority queue Pops states in order of increasing distance, updates predecessors Prioritized Sweeping1,2has a similar structure, but doesn’t reduce to Dijkstra’s algorithm Dijkstra's Algorithm v'=6 v'=7 x4 v'=∞ x3 v'=5 v'= ∞ v'=2 x2 v'=3 v'=∞ v'=1 x1 v'=2 v'=∞ G v'= 0 • 1 [A. Moore, C. Atkeson 1993] 2 [D. Andre, et al. 1998]

  36. Values of red states updated based on value of purple states. Prioritized Sweeping When we pop a state x, backup x, update priorities of predecessors w y1 w1 x1 y2 w2 y3

  37. Values of red states updated based on value of purple states. Improved Prioritized Sweeping • When we pop a state x, its value has already been updated • Update values and priorities of predecessors w y1 w1 x1 y2 w2 y3

  38. Priority Function Intuitions Update the state: • with lowest value (closest to goal) • whose value is most accurately known • For Dijkstra’s algorithm, the updated (popped) state’s optimal value is known • This is the state whose value will change the least in the future. • whose value has changed the most since it was last updated.

  39. Comparison PS, same problem: IPS, deterministic domain: Dark red indicates recently popped from queue, lighter means less recently.

  40. Outline • The Problem and Approach • Example Algorithms • Proposed Future Work • Bounded RTDP and extensions • Large action spaces • Details of proposed contributions

  41. Bounded RTDP RTDP: • Fixed start state means many states are irrelevant • Sample, backup along start → goal trajectories BRTDP adds: • performance guarantees, much faster convergence (often better than HDP, LRTDP, and LAO*)

  42. Dijkstra and BRTDP Dijkstra-style scheduling of backups for BRTDP • Sample multiple trajectories • Use priority queue to schedule backups of states on all trajectories

  43. Dijkstra, BRTDP, and POMDPs • HSVI1 is like BRTDP, but for POMDPs • The same trick should apply • But more benefit, because backups are more expensive x1 x2 Piecewise linear belief-space value function 1 [T. Smith and R. Simmons. 2004 ]

  44. Large Action Spaces • (Prioritized) Policy Iteration already has an advantage • Better tradeoff between policy evaluation, policy improvement? • Structured sets of actions? • Application of Experts/Bandits algorithms?

  45. Details: Proposed Contributions • Discussion of algorithms already developed: Oracle Algorithms, BGA, IPS, BRTDP, and several others. • At least two significant new algorithmic contributions: • BRTDP + Dijkstra algorithm, extension to POMDPs • Improved version of PPI to handle large action spaces • Something else: generalizations of conjugate-gradient linear solvers to MDPs, extensions of the technique for finding upper bounds introduced in the BRTDP paper, algorithms for efficiently solving restricted classes of POMDPs...

  46. Details: Proposed Contributions • At least one significant new theoretical contribution: • Approximation algorithm for Canadian Traveler’s Problem or Stochastic TSP • Results connecting online algorithms / MDP techniques to stochastic optimization • New contributions on bandit-style online algorithms, perhaps applications to MDPs

  47. Summary Black Boxes: MDPs with unknown Costs Motivating Problems Future Work: BRTDP + Dijkstra, Large action spaces Generalization: Reducing to Dijkstra

  48. Questions?

  49. Relationships of Algorithms Discussed

  50. Iteration 3: Best Response • Solve for the best response policy  (new blue line) • Add  to F Red: Fixed cost vector (expected field of view) Blue: Shortest path given costs

More Related