Reinforcement Learning

Reinforcement Learning Ata Kaban A.Kaban@cs.bham.ac.uk School of Computer Science University of Birmingham

Learning by reinforcement • Examples: • Learning to play Backgammon • Robot learning to dock on battery charger • Characteristics: • No direct training examples – delayed rewards instead • Need for exploration & exploitation • The environment is stochastic and unknown • The actions of the learner affect future rewards

Brief history & successes • Minsky’s PhD thesis (1954): Stochastic Neural-Analog Reinforcement Computer • Analogies with animal learning and psychology • TD-Gammon (Tesauro, 1992) – big success story • Job-shop scheduling for NASA space missions (Zhang and Dietterich, 1997) • Robotic soccer (Stone and Veloso, 1998) – part of the world-champion approach • ‘An approximate solution to a complex problem can be better than a perfect solution to a simplified problem’

The RL problem States Actions Immediate rewards Eventual reward Discount factor from any starting state

Markov Decision Process (MDP) • MDP is a formal model of the RL problem • At each discrete time point • Agent observes state st and chooses actionat • Receives rewardrt from the environment and the state changes to st+1 • Markov assumption: rt=r(st,at) st+1=(st,at) i.e. rt and st+1 depend only on the current state and action • In general, the functions r and  may not be deterministic and are not necessarily known to the agent

Agent’s Learning Task Execute actions in environment, observe results and • Learn action policy that maximises from any starting state in S. Here is the discount factor for future rewards • Note: • Target function is • There are no training examples of the form (s,a) but only of the form ((s,a),r)

Example: TD-Gammon • Immediate reward: +100 if win -100 if lose 0 for all other states • Trained by playing 1.5 million games against itself • Now approximately equal to the best human player

States: position and velocity Actions: accelerate forward, accelerate backward, coast Rewards Reward=-1for every step, until the car reaches the top Reward=1 at the top, 0 otherwise, <1 The eventual reward will be maximised by minimising the number of steps to the top of the hill Example: Mountain-Car

Q Learning algorithm (in deterministic worlds) • For each (s,a) initialise table entry • Observe current state s • Do forever: • Select an action a and execute it • Receive immediate reward r • Observe new state s’ • Update table entry as follows: • s:=s’

Example updating Q given the Q values from a previous iteration on the arrows

Exploration versus Exploitation • The Q-learning algorithm doesn’t say how we could choose an action • If we choose an action that maximises our estimate of Q we could end up not exploring better alternatives • To converge on the true Q values we must favour higher estimated Q values but still have a chance of choosing worse estimated Q values for exploration (see the convergence proof of the Q-learning algorithm in [Mitchell, sec. 13.3.4.]). An action selection function of the following form may employed, where k>0:

Summary • Reinforcement learning is suitable for learning in uncertain environments where rewards may be delayed and subject to chance • The goal of a reinforcement learning program is to maximise the eventual reward • Q-learning is a form of reinforcement learning that doesn’t require that the learner has prior knowledge of how its actions affect the environment

Further topics: Nondeterministic case • What if the reward and the state transition are not deterministic? – e.g. in Backgammon learning and playing depends on rolls of dice! • Then V and Q needs redefined by taking expected values • Similar reasoning and convergent update iteration will apply • Will continue next week.

Reinforcement Learning