250 likes | 378 Views
Reinforcement Learning Dealing with Complexity and Safety in RL. Subramanian Ramamoorthy School of Informatics 27 March, 2012. (Why) Isn’t RL Deployed More Widely?.
E N D
Reinforcement LearningDealing with Complexity and Safety in RL Subramanian Ramamoorthy School of Informatics 27 March, 2012
(Why) Isn’t RL Deployed More Widely? Very interesting discussion at: http://umichrl.pbworks.com/w/page/7597585/Myths%20of%20Reinforcement%20Learning, maintained by Satinder Singh • Negative views/myths: RL is hard due to dimensionality, partial observability, function approximation, etc. etc. • Positive view: There is no getting away from the fact that RL is the proper statement of the “agent’s problem”. So, the question is really one of how to solve it! Reinforcement Learning
A Provocative Claim “The (PO)MDP frameworks are fundamentally broken, not because they are insufficiently powerful representations, but because they are too powerful. We submit that, rather than generalizing these models, we should be specializing them if we want to make progress on solving real problems in the real world.” T. Lane, W.D. Smart, Why (PO)MDPs Lose for Spatial Tasks and What to Do About It, ICML Workshop on Rich Representations for RL, 2005. Reinforcement Learning
What is the Issue? (Lane et al.) • In our efforts to formalize the notion of “learning control”, we have striven to construct ever more general and, putatively, powerful models. By the mid-1990s we had (with a little bit of blatant “borrowing” from the Operations Research community) arrived at the (PO)MDP formalism (Puterman, 1994) and grounded our RL methods in it (Sutton & Barto, 1998; Kaelbling et al., 1996; Kaelbling et al., 1998). • These models are mathematically elegant, have enabled precise descriptions and analysis of a wide array of RL algorithms, and are incredibly general. We argue, however, that their very generality is a hindrance in many practical cases. • In their generality, these models have discarded the very qualities — metric, topology, scale, etc. — that have proven to be so valuable for many, many science and engineering disciplines. Reinforcement Learning
What is Missing in POMDPs? • POMDPs do not describe natural metrics in environment • When driving, we know both global and local distances • POMDPs do not natively recognize differences between scales • Uncertainty in control is entirely different from uncertainty in routing • POMDPs conflate properties of the environment with properties of the agent • Roads and buildings behave differently from cars and pedestrians: we need to generalize over them differently • POMDPs are defined in a global coordinate frame, often discrete! • We may need many different representations in real problems Reinforcement Learning
Specific Insight #1 Metric of a space imposes a “speed limit” on the agent — the agent cannot transition to arbitrary points in the environment in a single step. Consequences: • Agent can neglect large parts of the state space when planning. • More importantly, however, this result implies that control experience can be generalized across regions of the state space. • If the agent learns a good policy for one bounded region of the state space, and it can find a second bounded region that is homeomorphic to the first. Metric envelope bound for point-to-point navigation in an open-space gridworld environment. The outer region is the elliptical envelope that contains 90% of the trajectory probability mass. The inner, darker region is the set of states occupied by an agent in a total of 10,000 steps of experience (319 trajectories from bottom to top). Reinforcement Learning
Insight #2: Manifold Representations • Informally, a manifold representation models the domain of the value function using a set of overlapping local regions, called charts. • Each chart has a local coordinate frame, is a (topological) disk, and has a (local) Euclidean distance metric. The collection of charts and their overlap regions is called a manifold. • We can embed partial value functions (and other models) on these charts, and combine them, using the theory of manifolds, to provide a global value function (or model). 13 eq. classes. If you consider Rotational symmetry, Only 4 classes. Reinforcement Learning
What Makes Some POMDP Problems Easy to Approximate? David Hsu, Wee Sun Lee, Nan Rong, NIPS 2007 Reinforcement Learning
Understanding Why PBVI Works • Point-based algorithms have been surprisingly successful in computing approximately optimal solutions for POMDPs. • What are the belief-space properties that allow some POMDP problems to be approximated efficiently, explaining the point-based algorithms’ success? Reinforcement Learning
Hardness of POMDPs • Intractability due to curse of dimensionality • Size of belief space grows exponentially with state space, |S| • But, in recent years, good progress has been made in sampling the belief space and approximating solutions • Hsu et al. refer to solutions to a POMDP with hundreds of states in seconds • Tag problem: robot needs to search for a moving tag (whose position is unobserved except when robot bumps into it), ~870-dim space • Solved using PBVI methods in <1 minute Reinforcement Learning
Initial Observation • Many point-based algorithms only explore a subset of the belief space, , the reachable space • The reachable space contains all points reachable from a given initial belief point b0 under arbitrary sequences of actions and observations • Is the reason for PBVI’s success that reachable space is small? • Not always: Tag has approx. 860-dim reachable space. Reinforcement Learning
Covering Number • Covering number of a space is the minimum number of given size balls that needed to cover the space fully • Hsu et al. show that an approximately optimal POMDP solution can be computed in time polynomial in the covering number of R(b0) • Covering number also reveals that the belief space for Tag behaves more like the union of some 29-dimensional spaces rather than an 870-dimensional space, as the robot’s position is fully observed. Reinforcement Learning
Further Questions • Is it possible to compute an approximate solution efficiently under the weaker condition of having a small covering number for an optimal reachable R*(b0), which contains only points in B reachable from b0 under an optimal policy? • Unfortunately, this problem is NP-hard. The problem remains NP-hard, even if the optimal policies have a compact piecewise-linear representation using a-vectors. • However, given a suitable set of points that “cover” R*(b0) well, a good approximate solution can be computed in polynomial time. • Using sampling to approximate an optimal reachable space, and not just the reachable space, may be a promising approach in practice. Reinforcement Learning
Lyapunov Design for Safe Reinforcement Learning Theodore J. Perkins and Andrew G. Barto, JMLR 2002 Reinforcement Learning
Dynamical Systems • Dynamical systems can be described by states and evolution of states over time • The evolution of states is constrained by dynamics of the system • In other words, dynamical systems are mappings from current state to next state • If the mapping is a contraction, the state will eventually converge to a fixed point Reinforcement Learning
Reinforcement Learning – Traditional Methods • The target or goal state may not be a natural attractor • Hypothesis: Learning is easier if target is a fixed point, e.g., TD-Gammon • People have tried to embed domain knowledge in various ways: • Known good actions are specified • Sub-goals are explicitly specified Reinforcement Learning
Key Idea • Use Lyapunov functions to constrain action selection • This forces the RL agent to move towards the goal • e.g., consider grid world, finite steps if Lyapunov constrained: Reinforcement Learning
Problem Setup • Deterministic dynamical system • Evolution according to MDP, Reinforcement Learning
Lyapunov Functions • Generalized energy functions Reinforcement Learning
Pendulum Problem Reinforcement Learning
Results 1 • AEA,AAll had shorter trials than Aconst • AEA outperformed AAll, especially at fine resolutions of discretization • AEA trial times seemed independent of binning • AConst alone never worked Note: Theorem guarantees that AEA monotonically increases energy. Reinforcement Learning
Results 2 1: AEA, G2 2: AAll, G2 3: AConst, G2 4: AAll + sat LQR, G1 Reinforcement Learning
Stochastic Case Reinforcement Learning
Results – Stochastic Case Reinforcement Learning
Some Open Questions • How can you improve performance using less sophisticated ‘primitive’ actions? • Perkins and Barto use deep intuition to design local laws, e.g., to avoid undesired gravity-control equilibria. How do we deal with this when the dynamics is less understood? • Stochastic cases have rather weak guarantees. How can they be improved? Reinforcement Learning