830 likes | 934 Views
Structural Return Maximization for Reinforcement Learning. Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy. How should we act in the presence of complex , u nknown dynamics?. How should we act in the presence of complex, unknown dynamics?.
E N D
Structural Return Maximization for Reinforcement Learning Josh Joseph AlborzGeramifard Javier Velez Jonathan How Nicholas Roy
How should we act in the presence of complex, unknown dynamics?
How should we act in the presence of complex, unknown dynamics?
How should we act in the presence of complex, unknown dynamics?
How should we act in the presence of complex, unknown dynamics?
What do I mean by complex dynamics? • Can’t derive from first principles / intuition • Any dynamics model will be approximate • Limited data • Otherwise just do nearest neighbors • Batch data • Trying to keep it as simple as possible for now • Fairly straightforward to extend to active learning
What do I mean by complex dynamics? • Can’t derive from first principles / intuition • Any dynamics model will be approximate • Limited data • Batch data • Fairly straightforward to extend to active learning
How does RL solve these problems? • Assume some representation class for: • Dynamics model • Value function • Policy • Collect some data • Find the “best” representation based on the data
How does RL solve these problems? • Assume some representation class for: • Dynamics model • Value function • Policy • Collect some data • Find the “best” representation based on the data
How does RL solve these problems? • The “best” representation based on the data • This defines the best policy…not the best representation Policy Starting state Value (return) reward unknown dynamics model
How does RL solve these problems? • The “best” representation based on the data • This defines the best policy…not the best representation Policy Starting state Value (return) reward unknown dynamics model
How does RL solve these problems? • The “best” representation based on the data • This defines the best policy…not the best representation Policy Starting state Value (return) reward unknown dynamics model
…but does RL actually solve this problem? • Policy Search • Policy directly parameterized by
…but does RL actually solve this problem? • Policy Search • Policy directly parameterized by
…but does RL actually solve this problem? • Policy Search • Policy directly parameterized by Empirical estimate Number of episodes
…but does RL actually solve this problem? • Policy Search • Policy directly parameterized by Empirical estimate Number of episodes
…but does RL actually solve this problem? • Model-based RL • Dynamics model =
…but does RL actually solve this problem? • Model-based RL • Dynamics model =
…but does RL actually solve this problem? • Model-based RL • Dynamics model =
…but does RL actually solve this problem? Maximizing likelihood != maximizing return • Model-based RL • Dynamics model =
…but does RL actually solve this problem? Maximizing likelihood != maximizing return …similar story for value-based methods • Model-based RL • Dynamics model =
ML model selection in RL • So why do we do it? • It’s easy • It sometimes works really well • Intuitively it feels like finding the most likely model should result in a high performing policy • Why does it fail? • Chooses an “average” model based on the data • Ignores reward function • What do we do then?
ML model selection in RL • So why do we do it? • It’s easy • It sometimes works really well • Intuitively it feels like finding the most likely model should result in a high performing policy • Why does it fail? • Chooses an “average” model based on the data • Ignores reward function • What do we do then?
ML model selection in RL • So why do we do it? • It’s easy • It sometimes works really well • Intuitively it feels like finding the most likely model should result in a high performing policy • Why does it fail? • Chooses an “average” model based on the data • Ignores reward function • What do we do then?
Our Approach • Model-based RL • Dynamics model =
Our Approach • Model-based RL • Dynamics model = Empirical estimate
Our Approach • Model-based RL • Dynamics model = Empirical estimate
Our Approach • Model-based RL • Dynamics model = Empirical estimate
Our Approach • Model-based RL • Dynamics model = We can do the same thing in a value-based setting. Empirical estimate
…but • We are indirectly choosing a policy representation • The win of this indirect representation is that it can be “small” • Small = less data? • Intuitively you’d think so • Empirical evidence from toy problems • But all of our guarantees rely on infinite data • …maybe there’s a way to be more concrete
…but • We are indirectly choosing a policy representation • The win of this indirect representation is that it can be “small” • Small = less data? • Intuitively you’d think so • Empirical evidence from toy problems • But all of our guarantees rely on infinite data • …maybe there’s a way to be more concrete
What we want • How does the representation space relate to true return? • …they’ve been doing this in classification since the 60s • Relationship between the bound and “size” of the representation space / amount of data ?
What we want • How does the representation space relate to true return? • …they’ve been doing this in classification since the 60s • Relationship between the bound and “size” of the representation space / amount of data ?
What we want • How does the representation space relate to true return? • …they’ve been doing this in classification since the 60s • Relationship between the “size” of the representation space and the amount of data ?
How to get there Model-based, value-based, policy search
How to get there Model-based, value-based, policy search Empirical Risk Minimization Map RL to classification
How to get there Model-based, value-based, policy search Empirical Risk Minimization Map RL to classification Measuring function class size Bound on true risk
How to get there Model-based, value-based, policy search Empirical Risk Minimization Map RL to classification Measuring function class size Bound on true risk
How to get there Model-based, value-based, policy search Empirical Risk Minimization Map RL to classification Measuring function class size Bound on true risk Structural risk minimization Structure of function classes
How to get there Model-based, value-based, policy search Empirical Risk Minimization Map RL to classification Measuring function class size Bound on true risk Structural risk minimization Structure of function classes
Classification Risk
Classification Loss (cost) Unknown data distribution Risk
Empirical Risk Minimization Unknown data distribution
Empirical Risk Minimization Unknown data distribution Empirical estimate Number of samples