1 / 33

Exploration in Reinforcement Learning

Exploration in Reinforcement Learning. Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK jlw@cs.bham.ac.uk www.cs.bham.ac.uk/research/robotics www.cs.bham.ac.uk/~jlw. The talk in one slide.

jeff
Download Presentation

Exploration in Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK jlw@cs.bham.ac.uk www.cs.bham.ac.uk/research/robotics www.cs.bham.ac.uk/~jlw

  2. The talk in one slide • Optimal learning problems: how to act while learning how to act • We’re going to look at this while learning from rewards • Old heuristic: be optimistic in the face of uncertainty • Our method: apply principle of optimism directly to model of how the world works • Bayesian

  3. Plan • Reinforcement Learning • How to act while learning from rewards • An approximate algorithm • Results • Learning with structure

  4. +3 +50 -1 -1 r1 r9 r4 r5 Reinforcement Learning (RL) • Learning from punishments and rewards • Agent moves through world, observing states and rewards • Adapts its behaviour to maximise some function of reward … … s1 s2 s3 s4 s5 s9 … a1 a2 a3 a4 a5 a9 …

  5. +3 +50 -1 -1 r1 r9 r4 r5 Reinforcement Learning (RL) • Let’s assume our agent acts according to some rules, called a policy, p • The return Rt is a measure of long term reward collected after time t

  6. Reinforcement Learning (RL) • Rt is a random variable • So it has an expected value in a state under a given policy • RL problem is to find optimal policy p*that maximises the expected value in every state

  7. a1 1 r = 2 1 2 r = 0 a2 Markov Decision Processes (MDPs) • The transitions between states are uncertain • The probabilities depend only on the current state • Transition matrix P, and reward function R

  8. 3 5 4 a i Bellman equations and bootstrapping • Conditional independence allows us to define the expected return V*for the optimal policy p*in terms of a recurrence relation: where • We can use the recurrence relation to bootstrap our estimate of V*in two ways

  9. 4 3 5 a i at st+1 st rt+1 Two types of bootstrapping • We can bootstrap using explicit knowledge of PandR (Dynamic Programming) • Or we can bootstrap using samples from PandR (temporal difference learning)

  10. Multi-agent RL: Learning to play football • Learning to play in a team • Too time consuming to do on real robots • There is a well established simulator league • We can learn effectively from reinforcement

  11. Learning to play backgammon • TD(l) learning and a Backprop net with one hidden layer • 1,500,000 training games (self play) • Equivalent in skill to the top dozen human players • Backgammon has ~1020 states, so can’t be solved using DP

  12. The Exploration problem: intuition • We are learning to maximise performance • But how should we act while learning? • Trade-off: exploit what we know or explore to gain new information? • Optimal learning: maximise performance while learning given your imperfect knowledge

  13. The optimal learning problem • If we knew P it would be easy • However … • We estimate P from observations • P is a random variable • There is a density f(P) over the space of possible MDPs • What is it? How does it help us solve our problem?

  14. 1 A density over MDPs • Suppose we’ve wandered around for a while • We have a matrix M containing the transition counts • The density over possible P depends on M, f(P|M) and is a product of Dirichlet densities a1 r = 2 1 2 r = 0 a2 1 0 0 1

  15. A density over multinomials a1 1 1 2 1 0 0 1

  16. a1 1 r = 2 1 2 r = 0 a2 Optimal learning formally stated • Given f(P|M), find p that maximises 1 0 0 1

  17. Transforming the problem • When we evaluate the integral we get another MDP! • This one is defined over the space of information states • This space grows exponentially in the depth of the look ahead a1 R2 R3 a2

  18. A heuristic: Optimistic Model Selection • Solving the information state space MDP is intractable • An old heuristic is be optimistic in the face of uncertainty • So here we pick an optimistic P • Find V* for that P only • How do we pick P optimistically?

  19. Optimistic Model Selection • h Do some DP style bootstrapping to improve estimated V

  20. Experimental results

  21. Bayesian view: performance while learning

  22. Bayesian view: policy quality

  23. Do we really care? • Why solve for MDPs? While challenging they are too simple to be useful • Structured representations are more powerful

  24. Model-based RL: structured models • Transition model P is represented compactly using a Dynamic Bayes Net (or factored MDP) • V is represented as a tree • Backups look like goal regression operators • Converging with the AI planning community

  25. Structured Exploration: results

  26. ot ot+2 ot+1 at at+1 at+2 st st+1 st+2 rt+1 rt+2 Challenge: Learning with Hidden State • Learning in a POMDP, or k-Markov environment • Planning in POMDPs is intractable • Factored POMDPs look promising • POMDPs are the basis of the state of the art in mobile robotics

  27. Wrap up • RL is a class of problems • We can pose some optimal learning problems elegantly in this framework • Can’t be perfect, but we can do alright • BUT: probabilistic representations while very useful in many fields are a source of frequent intractability • General probabilistic representations are best avoided • How?

  28. Cumulative Discounted Return

  29. Cumulative Discounted Return

  30. Cumulative Discounted Return

  31. Policy Quality

  32. Policy Quality

  33. Policy Quality

More Related