Exploration in Reinforcement Learning

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK jlw@cs.bham.ac.uk www.cs.bham.ac.uk/research/robotics www.cs.bham.ac.uk/~jlw

The talk in one slide • Optimal learning problems: how to act while learning how to act • We’re going to look at this while learning from rewards • Old heuristic: be optimistic in the face of uncertainty • Our method: apply principle of optimism directly to model of how the world works • Bayesian

Plan • Reinforcement Learning • How to act while learning from rewards • An approximate algorithm • Results • Learning with structure

+3 +50 -1 -1 r1 r9 r4 r5 Reinforcement Learning (RL) • Learning from punishments and rewards • Agent moves through world, observing states and rewards • Adapts its behaviour to maximise some function of reward … … s1 s2 s3 s4 s5 s9 … a1 a2 a3 a4 a5 a9 …

+3 +50 -1 -1 r1 r9 r4 r5 Reinforcement Learning (RL) • Let’s assume our agent acts according to some rules, called a policy, p • The return Rt is a measure of long term reward collected after time t

Reinforcement Learning (RL) • Rt is a random variable • So it has an expected value in a state under a given policy • RL problem is to find optimal policy p*that maximises the expected value in every state

a1 1 r = 2 1 2 r = 0 a2 Markov Decision Processes (MDPs) • The transitions between states are uncertain • The probabilities depend only on the current state • Transition matrix P, and reward function R

3 5 4 a i Bellman equations and bootstrapping • Conditional independence allows us to define the expected return V*for the optimal policy p*in terms of a recurrence relation: where • We can use the recurrence relation to bootstrap our estimate of V*in two ways

4 3 5 a i at st+1 st rt+1 Two types of bootstrapping • We can bootstrap using explicit knowledge of PandR (Dynamic Programming) • Or we can bootstrap using samples from PandR (temporal difference learning)

Multi-agent RL: Learning to play football • Learning to play in a team • Too time consuming to do on real robots • There is a well established simulator league • We can learn effectively from reinforcement

Learning to play backgammon • TD(l) learning and a Backprop net with one hidden layer • 1,500,000 training games (self play) • Equivalent in skill to the top dozen human players • Backgammon has ~1020 states, so can’t be solved using DP

The Exploration problem: intuition • We are learning to maximise performance • But how should we act while learning? • Trade-off: exploit what we know or explore to gain new information? • Optimal learning: maximise performance while learning given your imperfect knowledge

The optimal learning problem • If we knew P it would be easy • However … • We estimate P from observations • P is a random variable • There is a density f(P) over the space of possible MDPs • What is it? How does it help us solve our problem?

1 A density over MDPs • Suppose we’ve wandered around for a while • We have a matrix M containing the transition counts • The density over possible P depends on M, f(P|M) and is a product of Dirichlet densities a1 r = 2 1 2 r = 0 a2 1 0 0 1

A density over multinomials a1 1 1 2 1 0 0 1

a1 1 r = 2 1 2 r = 0 a2 Optimal learning formally stated • Given f(P|M), find p that maximises 1 0 0 1

Transforming the problem • When we evaluate the integral we get another MDP! • This one is defined over the space of information states • This space grows exponentially in the depth of the look ahead a1 R2 R3 a2

A heuristic: Optimistic Model Selection • Solving the information state space MDP is intractable • An old heuristic is be optimistic in the face of uncertainty • So here we pick an optimistic P • Find V* for that P only • How do we pick P optimistically?

Optimistic Model Selection • h Do some DP style bootstrapping to improve estimated V

Experimental results

Bayesian view: performance while learning

Bayesian view: policy quality

Do we really care? • Why solve for MDPs? While challenging they are too simple to be useful • Structured representations are more powerful

Model-based RL: structured models • Transition model P is represented compactly using a Dynamic Bayes Net (or factored MDP) • V is represented as a tree • Backups look like goal regression operators • Converging with the AI planning community

Structured Exploration: results

ot ot+2 ot+1 at at+1 at+2 st st+1 st+2 rt+1 rt+2 Challenge: Learning with Hidden State • Learning in a POMDP, or k-Markov environment • Planning in POMDPs is intractable • Factored POMDPs look promising • POMDPs are the basis of the state of the art in mobile robotics

Wrap up • RL is a class of problems • We can pose some optimal learning problems elegantly in this framework • Can’t be perfect, but we can do alright • BUT: probabilistic representations while very useful in many fields are a source of frequent intractability • General probabilistic representations are best avoided • How?

Cumulative Discounted Return

Policy Quality

Exploration in Reinforcement Learning

Exploration in Reinforcement Learning

Presentation Transcript

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

REINFORCEMENT LEARNING

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Optimal Tuning of Continual Online Exploration in Reinforcement Learning

Exploration and Apprenticeship Learning in Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning