330 likes | 465 Views
Exploration in Reinforcement Learning. Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK jlw@cs.bham.ac.uk www.cs.bham.ac.uk/research/robotics www.cs.bham.ac.uk/~jlw. The talk in one slide.
E N D
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK jlw@cs.bham.ac.uk www.cs.bham.ac.uk/research/robotics www.cs.bham.ac.uk/~jlw
The talk in one slide • Optimal learning problems: how to act while learning how to act • We’re going to look at this while learning from rewards • Old heuristic: be optimistic in the face of uncertainty • Our method: apply principle of optimism directly to model of how the world works • Bayesian
Plan • Reinforcement Learning • How to act while learning from rewards • An approximate algorithm • Results • Learning with structure
+3 +50 -1 -1 r1 r9 r4 r5 Reinforcement Learning (RL) • Learning from punishments and rewards • Agent moves through world, observing states and rewards • Adapts its behaviour to maximise some function of reward … … s1 s2 s3 s4 s5 s9 … a1 a2 a3 a4 a5 a9 …
+3 +50 -1 -1 r1 r9 r4 r5 Reinforcement Learning (RL) • Let’s assume our agent acts according to some rules, called a policy, p • The return Rt is a measure of long term reward collected after time t
Reinforcement Learning (RL) • Rt is a random variable • So it has an expected value in a state under a given policy • RL problem is to find optimal policy p*that maximises the expected value in every state
a1 1 r = 2 1 2 r = 0 a2 Markov Decision Processes (MDPs) • The transitions between states are uncertain • The probabilities depend only on the current state • Transition matrix P, and reward function R
3 5 4 a i Bellman equations and bootstrapping • Conditional independence allows us to define the expected return V*for the optimal policy p*in terms of a recurrence relation: where • We can use the recurrence relation to bootstrap our estimate of V*in two ways
4 3 5 a i at st+1 st rt+1 Two types of bootstrapping • We can bootstrap using explicit knowledge of PandR (Dynamic Programming) • Or we can bootstrap using samples from PandR (temporal difference learning)
Multi-agent RL: Learning to play football • Learning to play in a team • Too time consuming to do on real robots • There is a well established simulator league • We can learn effectively from reinforcement
Learning to play backgammon • TD(l) learning and a Backprop net with one hidden layer • 1,500,000 training games (self play) • Equivalent in skill to the top dozen human players • Backgammon has ~1020 states, so can’t be solved using DP
The Exploration problem: intuition • We are learning to maximise performance • But how should we act while learning? • Trade-off: exploit what we know or explore to gain new information? • Optimal learning: maximise performance while learning given your imperfect knowledge
The optimal learning problem • If we knew P it would be easy • However … • We estimate P from observations • P is a random variable • There is a density f(P) over the space of possible MDPs • What is it? How does it help us solve our problem?
1 A density over MDPs • Suppose we’ve wandered around for a while • We have a matrix M containing the transition counts • The density over possible P depends on M, f(P|M) and is a product of Dirichlet densities a1 r = 2 1 2 r = 0 a2 1 0 0 1
A density over multinomials a1 1 1 2 1 0 0 1
a1 1 r = 2 1 2 r = 0 a2 Optimal learning formally stated • Given f(P|M), find p that maximises 1 0 0 1
Transforming the problem • When we evaluate the integral we get another MDP! • This one is defined over the space of information states • This space grows exponentially in the depth of the look ahead a1 R2 R3 a2
A heuristic: Optimistic Model Selection • Solving the information state space MDP is intractable • An old heuristic is be optimistic in the face of uncertainty • So here we pick an optimistic P • Find V* for that P only • How do we pick P optimistically?
Optimistic Model Selection • h Do some DP style bootstrapping to improve estimated V
Do we really care? • Why solve for MDPs? While challenging they are too simple to be useful • Structured representations are more powerful
Model-based RL: structured models • Transition model P is represented compactly using a Dynamic Bayes Net (or factored MDP) • V is represented as a tree • Backups look like goal regression operators • Converging with the AI planning community
ot ot+2 ot+1 at at+1 at+2 st st+1 st+2 rt+1 rt+2 Challenge: Learning with Hidden State • Learning in a POMDP, or k-Markov environment • Planning in POMDPs is intractable • Factored POMDPs look promising • POMDPs are the basis of the state of the art in mobile robotics
Wrap up • RL is a class of problems • We can pose some optimal learning problems elegantly in this framework • Can’t be perfect, but we can do alright • BUT: probabilistic representations while very useful in many fields are a source of frequent intractability • General probabilistic representations are best avoided • How?