160 likes | 290 Views
Sample-based Planning for Continuous Action Markov Decision Processes. Ari Weinstein Chris Mansley Michael L. Littman. aweinst@rutgers.edu cmansley@cs.rutgers.edu mlittman@cs.rutgers.edu. Rutgers Laboratory for Real-Life Reinforcement Learning. Motivation. Sample-based planning:
E N D
Sample-based Planning for Continuous Action Markov Decision Processes Ari Weinstein Chris Mansley Michael L. Littman aweinst@rutgers.edu cmansley@cs.rutgers.edu mlittman@cs.rutgers.edu • Rutgers Laboratory for Real-Life Reinforcement Learning
Motivation • Sample-based planning: • Planning cost independent of size of state • Sometimes MDP too large • Continuous Action MDPs: • Common setting, but few RL algorithms exist • Imagine riding in a car where gas and brakes are on/off switches • If we have, or can learn dynamics for continuous action domains, how do we plan in it?
Sample-based planning for finite MDPs • Don’t care about regions far away • Requires generative model • Ask for a <s, a, r, s’> for any <s, a> anytime • Sparse sampling [Kearns et al. 1999] • PAC-style guarantees • Too expensive • Monte-Carlo tree search • Weaker theoretical guarantees (generally) • In practice, more useful
Monte-Carlo Tree [DAG] Search • Possible trajectories (rollouts) through an MDP can be encoded by a DAG • Layered in depths with all states in each depth • Edges contain actions, rewards • Explore DAG so high value action is taken
Upper Confidence bounds applied to Trees(UCT)[Kocsis, Szepesvári. 2006] • Instance of Monte-Carlo tree search • Leverages bandit literature • Places a bandit agent similar to UCB1 at each <state, depth> in rollout tree [Auer et al. 2002] (only illustrated at root) • Action selection according to:
Continuous action spaces • Most canonical RL domains are continuous action MDPs – why ignore it? • Hillcar, pole balancing, acrobot, double integrator, robotics… • Coarse discretization is not good enough • Infinite regret • Want to focus samples in optimal region
Hierarchical Optimistic Optimization (HOO)[Bubeck et al. 2008] • Partition action space similar to a KD-tree • Keep track of rewards for each subtree • Blue is the bandit,red is the decomposition of HOO tree • Thickness represents estimated reward • Tree grows deeper and builds estimates at high resolution where reward is highest
HOO continued • Exploration bonuses for number of samples and size of each subregion • Regions with large volume and few samples are unknown, vice versa • Pull arm in region according to maximal • Has optimal regret, independent of action dimension
HOOT[Weinstein, Mansley, Littman. 2010] • Hierarchical Optimistic Optimization applied to Trees • Ideas follow from UCT • Instead of UCB, places a HOO agent at each <state, depth> in rollout tree • Results in continuous action planning
Benefits of HOOT • Planning cost independent of state size • Continuous action planning • Adaptive partitioning of action space allows for more efficient tree search • Fewer samples wasted on suboptimal actions • Good performance in high dimensional action spaces • Good horizon depth
Experiments • D-double integrator, D-link swimmer • Number of samples to generative model fixed to 2048, 8192 per planning step, respectively • Since both are discrete state planners, state dimension has coarse discretization of 20 divisions per dimension
D-Double Integrator [Santamaría et al. 2006] • Object with position and velocity. Control acceleration. Reward is -(p2+a2) • Consequence of poor action discretization • Explosion in finite actions causes failure
D-link Swimmer [Tassa et al. 2006] • Swim head from start to goal • For D links, there are D-1 actions and 2D+4 states • 5 continuous action and 16 continuous state dimensions in most complex • Difficult to get good coverage with standard RL methods • With more dimensions, UCT fails while HOOT improves significantly
In the interest of full disclosure • Bad (undirected) exploration • Theoretical analysis difficult (nonstationarity) • Degenerate behavior due to vMin, vMax scaling • UCT also has these problems
Conclusions • HOOT is a planner that operates directly in continuous action spaces • Local solutions of MDP mean costs independent of state size • No action discretization tuning • Coarse discretization not good enough even in simple MDPs, even when tuned • Coarse discretization explodes in high dimensions, making planning almost impossible • Future work: • HOOT for continuous state spaces • Using optimiziers in place of max for continuous action RL algorithms of other forms
References • Kocsis, L. and Szepesvári, C. Bandit based Monte-Carlo planning. In Machine Learning: ECML 2006, 2006. • Auer, P., Fischer, P., and Cesa-Bianchi, N. Finite-time analysis of the multi-armed bandit problem. Machine Learning, 47, 2002 • Kearns M., Mansour S., Ng A., A Sparse Sampling Algorithm for Near-Optimal Planning in Large MDPs, IJCAI 99 • Bubeck S., Munos R., Stoltz G., Szepesvári C., Online Optimization in X-Armed Bandits, NIPS 08 • Santamaría, Juan C., Sutton, R., and Ram, Ashwin. Ex-periments with reinforcement learning in problems with continuous state and action spaces. In Adaptive Behavior 6, 1998. • Tassa, Yuval, Erez, Tom, and Smart, William D. Receding horizon differential dynamic programming. In Advances in Neural Information Processing Systems 21. 2007.