Structural Return Maximization for Reinforcement Learning

Structural Return Maximization for Reinforcement Learning Josh Joseph AlborzGeramifard Javier Velez Jonathan How Nicholas Roy

How should we act in the presence of complex, unknown dynamics?

What do I mean by complex dynamics? • Can’t derive from first principles / intuition • Any dynamics model will be approximate • Limited data • Otherwise just do nearest neighbors • Batch data • Trying to keep it as simple as possible for now • Fairly straightforward to extend to active learning

What do I mean by complex dynamics? • Can’t derive from first principles / intuition • Any dynamics model will be approximate • Limited data • Batch data • Fairly straightforward to extend to active learning

How does RL solve these problems? • Assume some representation class for: • Dynamics model • Value function • Policy • Collect some data • Find the “best” representation based on the data

How does RL solve these problems? • The “best” representation based on the data • This defines the best policy…not the best representation Policy Starting state Value (return) reward unknown dynamics model

…but does RL actually solve this problem? • Policy Search • Policy directly parameterized by

…but does RL actually solve this problem? • Policy Search • Policy directly parameterized by Empirical estimate Number of episodes

…but does RL actually solve this problem? • Model-based RL • Dynamics model =

…but does RL actually solve this problem? Maximizing likelihood != maximizing return • Model-based RL • Dynamics model =

…but does RL actually solve this problem? Maximizing likelihood != maximizing return …similar story for value-based methods • Model-based RL • Dynamics model =

ML model selection in RL • So why do we do it? • It’s easy • It sometimes works really well • Intuitively it feels like finding the most likely model should result in a high performing policy • Why does it fail? • Chooses an “average” model based on the data • Ignores reward function • What do we do then?

Our Approach • Model-based RL • Dynamics model =

Our Approach • Model-based RL • Dynamics model = Empirical estimate

Planning with Misspecified Model Classes Us

Our Approach • Model-based RL • Dynamics model = Empirical estimate

Our Approach • Model-based RL • Dynamics model = We can do the same thing in a value-based setting. Empirical estimate

…but • We are indirectly choosing a policy representation • The win of this indirect representation is that it can be “small” • Small = less data? • Intuitively you’d think so • Empirical evidence from toy problems • But all of our guarantees rely on infinite data • …maybe there’s a way to be more concrete

What we want • How does the representation space relate to true return? • …they’ve been doing this in classification since the 60s • Relationship between the bound and “size” of the representation space / amount of data ?

What we want • How does the representation space relate to true return? • …they’ve been doing this in classification since the 60s • Relationship between the “size” of the representation space and the amount of data ?

How to get there Model-based, value-based, policy search

How to get there Model-based, value-based, policy search Empirical Risk Minimization Map RL to classification

How to get there Model-based, value-based, policy search Empirical Risk Minimization Map RL to classification Measuring function class size Bound on true risk

How to get there Model-based, value-based, policy search Empirical Risk Minimization Map RL to classification Measuring function class size Bound on true risk Structural risk minimization Structure of function classes

Classification

Classification f

Classification Risk

Classification Loss (cost) Unknown data distribution Risk

Empirical Risk Minimization Unknown data distribution

Empirical Risk Minimization Unknown data distribution Empirical estimate Number of samples

Mapping RL toClassification

Structural Return Maximization for Reinforcement Learning

Structural Return Maximization for Reinforcement Learning

Presentation Transcript

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

REINFORCEMENT LEARNING

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning