Reinforcement Learning Dealing with Complexity and Safety in RL

Reinforcement LearningDealing with Complexity and Safety in RL Subramanian Ramamoorthy School of Informatics 27 March, 2012

(Why) Isn’t RL Deployed More Widely? Very interesting discussion at: http://umichrl.pbworks.com/w/page/7597585/Myths%20of%20Reinforcement%20Learning, maintained by Satinder Singh • Negative views/myths: RL is hard due to dimensionality, partial observability, function approximation, etc. etc. • Positive view: There is no getting away from the fact that RL is the proper statement of the “agent’s problem”. So, the question is really one of how to solve it! Reinforcement Learning

A Provocative Claim “The (PO)MDP frameworks are fundamentally broken, not because they are insuﬃciently powerful representations, but because they are too powerful. We submit that, rather than generalizing these models, we should be specializing them if we want to make progress on solving real problems in the real world.” T. Lane, W.D. Smart, Why (PO)MDPs Lose for Spatial Tasks and What to Do About It, ICML Workshop on Rich Representations for RL, 2005. Reinforcement Learning

What is the Issue? (Lane et al.) • In our eﬀorts to formalize the notion of “learning control”, we have striven to construct ever more general and, putatively, powerful models. By the mid-1990s we had (with a little bit of blatant “borrowing” from the Operations Research community) arrived at the (PO)MDP formalism (Puterman, 1994) and grounded our RL methods in it (Sutton & Barto, 1998; Kaelbling et al., 1996; Kaelbling et al., 1998). • These models are mathematically elegant, have enabled precise descriptions and analysis of a wide array of RL algorithms, and are incredibly general. We argue, however, that their very generality is a hindrance in many practical cases. • In their generality, these models have discarded the very qualities — metric, topology, scale, etc. — that have proven to be so valuable for many, many science and engineering disciplines. Reinforcement Learning

What is Missing in POMDPs? • POMDPs do not describe natural metrics in environment • When driving, we know both global and local distances • POMDPs do not natively recognize differences between scales • Uncertainty in control is entirely different from uncertainty in routing • POMDPs conflate properties of the environment with properties of the agent • Roads and buildings behave differently from cars and pedestrians: we need to generalize over them differently • POMDPs are defined in a global coordinate frame, often discrete! • We may need many different representations in real problems Reinforcement Learning

Specific Insight #1 Metric of a space imposes a “speed limit” on the agent — the agent cannot transition to arbitrary points in the environment in a single step. Consequences: • Agent can neglect large parts of the state space when planning. • More importantly, however, this result implies that control experience can be generalized across regions of the state space. • If the agent learns a good policy for one bounded region of the state space, and it can ﬁnd a second bounded region that is homeomorphic to the ﬁrst. Metric envelope bound for point-to-point navigation in an open-space gridworld environment. The outer region is the elliptical envelope that contains 90% of the trajectory probability mass. The inner, darker region is the set of states occupied by an agent in a total of 10,000 steps of experience (319 trajectories from bottom to top). Reinforcement Learning

Insight #2: Manifold Representations • Informally, a manifold representation models the domain of the value function using a set of overlapping local regions, called charts. • Each chart has a local coordinate frame, is a (topological) disk, and has a (local) Euclidean distance metric. The collection of charts and their overlap regions is called a manifold. • We can embed partial value functions (and other models) on these charts, and combine them, using the theory of manifolds, to provide a global value function (or model). 13 eq. classes. If you consider Rotational symmetry, Only 4 classes. Reinforcement Learning

What Makes Some POMDP Problems Easy to Approximate? David Hsu, Wee Sun Lee, Nan Rong, NIPS 2007 Reinforcement Learning

Understanding Why PBVI Works • Point-based algorithms have been surprisingly successful in computing approximately optimal solutions for POMDPs. • What are the belief-space properties that allow some POMDP problems to be approximated efficiently, explaining the point-based algorithms’ success? Reinforcement Learning

Hardness of POMDPs • Intractability due to curse of dimensionality • Size of belief space grows exponentially with state space, |S| • But, in recent years, good progress has been made in sampling the belief space and approximating solutions • Hsu et al. refer to solutions to a POMDP with hundreds of states in seconds • Tag problem: robot needs to search for a moving tag (whose position is unobserved except when robot bumps into it), ~870-dim space • Solved using PBVI methods in <1 minute Reinforcement Learning

Initial Observation • Many point-based algorithms only explore a subset of the belief space, , the reachable space • The reachable space contains all points reachable from a given initial belief point b0 under arbitrary sequences of actions and observations • Is the reason for PBVI’s success that reachable space is small? • Not always: Tag has approx. 860-dim reachable space. Reinforcement Learning

Covering Number • Covering number of a space is the minimum number of given size balls that needed to cover the space fully • Hsu et al. show that an approximately optimal POMDP solution can be computed in time polynomial in the covering number of R(b0) • Covering number also reveals that the belief space for Tag behaves more like the union of some 29-dimensional spaces rather than an 870-dimensional space, as the robot’s position is fully observed. Reinforcement Learning

Further Questions • Is it possible to compute an approximate solution efficiently under the weaker condition of having a small covering number for an optimal reachable R*(b0), which contains only points in B reachable from b0 under an optimal policy? • Unfortunately, this problem is NP-hard. The problem remains NP-hard, even if the optimal policies have a compact piecewise-linear representation using a-vectors. • However, given a suitable set of points that “cover” R*(b0) well, a good approximate solution can be computed in polynomial time. • Using sampling to approximate an optimal reachable space, and not just the reachable space, may be a promising approach in practice. Reinforcement Learning

Lyapunov Design for Safe Reinforcement Learning Theodore J. Perkins and Andrew G. Barto, JMLR 2002 Reinforcement Learning

Dynamical Systems • Dynamical systems can be described by states and evolution of states over time • The evolution of states is constrained by dynamics of the system • In other words, dynamical systems are mappings from current state to next state • If the mapping is a contraction, the state will eventually converge to a fixed point Reinforcement Learning

Reinforcement Learning – Traditional Methods • The target or goal state may not be a natural attractor • Hypothesis: Learning is easier if target is a fixed point, e.g., TD-Gammon • People have tried to embed domain knowledge in various ways: • Known good actions are specified • Sub-goals are explicitly specified Reinforcement Learning

Key Idea • Use Lyapunov functions to constrain action selection • This forces the RL agent to move towards the goal • e.g., consider grid world, finite steps if Lyapunov constrained: Reinforcement Learning

Problem Setup • Deterministic dynamical system • Evolution according to MDP, Reinforcement Learning

Lyapunov Functions • Generalized energy functions Reinforcement Learning

Pendulum Problem Reinforcement Learning

Results 1 • AEA,AAll had shorter trials than Aconst • AEA outperformed AAll, especially at fine resolutions of discretization • AEA trial times seemed independent of binning • AConst alone never worked Note: Theorem guarantees that AEA monotonically increases energy. Reinforcement Learning

Results 2 1: AEA, G2 2: AAll, G2 3: AConst, G2 4: AAll + sat LQR, G1 Reinforcement Learning

Stochastic Case Reinforcement Learning

Results – Stochastic Case Reinforcement Learning

Some Open Questions • How can you improve performance using less sophisticated ‘primitive’ actions? • Perkins and Barto use deep intuition to design local laws, e.g., to avoid undesired gravity-control equilibria. How do we deal with this when the dynamics is less understood? • Stochastic cases have rather weak guarantees. How can they be improved? Reinforcement Learning

Reinforcement Learning Dealing with Complexity and Safety in RL

Reinforcement Learning Dealing with Complexity and Safety in RL

Presentation Transcript

Dealing with Complexity

Dealing with large data set and complexity in your testing

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Soar-RL and Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning (RL)

Soar-RL: Reinforcement Learning and Soar

REINFORCEMENT LEARNING

Reinforcement Learning Dealing with Partial Observability

Reinforcement Learning (RL)

Reinforcement Learning (RL)

Dealing with Software Complexity

Simple ways of dealing with complexity?

Internet Safety and Dealing with Cyberbullies

Chap. 13 Reinforcement Learning (RL)

Dealing with Ionospheric Complexity

Reinforcement Learning