230 likes | 266 Views
Reinforcement Learning (I.). Ata Kaban A.Kaban@cs.bham.ac.uk School of Computer Science University of Birmingham. Learning by reinforcement State-action rewards Markov Decision Process Policies and Value functions Q-learning. Learning by reinforcement. Examples:
E N D
Reinforcement Learning (I.) Ata Kaban A.Kaban@cs.bham.ac.uk School of Computer Science University of Birmingham
Learning by reinforcement • State-action rewards • Markov Decision Process • Policies and Value functions • Q-learning
Learning by reinforcement • Examples: • Learning to play Backgammon • Robot learning to dock on battery charger • Characteristics: • No direct training examples – delayed rewards instead • Need for exploration & exploitation • The environment is stochastic and unknown • The actions of the learner affect future rewards
Supervised Learning Reinforcement Learning
Brief history & successes • Minsky’s PhD thesis (1954): Stochastic Neural-Analog Reinforcement Computer • Analogies with animal learning and psychology • TD-Gammon (Tesauro, 1992) – big success story • Job-shop scheduling for NASA space missions (Zhang and Dietterich, 1997) • Robotic soccer (Stone and Veloso, 1998) – part of the world-champion approach • ‘An approximate solution to a complex problem can be better than a perfect solution to a simplified problem’
The RL problem States Actions Immediate rewards Eventual reward Discount factor from any starting state
Markov Decision Process (MDP) • MDP is a formal model of the RL problem • At each discrete time point • Agent observes state st and chooses actionat • Receives rewardrt from the environment and the state changes to st+1 • Markov assumption: rt=r(st,at) st+1=(st,at) i.e. rt and st+1 depend only on the current state and action • In general, the functions r and may not be deterministic and are not necessarily known to the agent
Agent’s Learning Task Execute actions in environment, observe results and • Learn action policy that maximises from any starting state in S. Here is the discount factor for future rewards • Note: • Target function is • There are no training examples of the form (s,a) but only of the form ((s,a),r)
Example: TD-Gammon • Immediate reward: +100 if win -100 if lose 0 for all other states • Trained by playing 1.5 million games against itself • Now approximately equal to the best human player
States: position and velocity Actions: accelerate forward, accelerate backward, coast Rewards Reward=-1for every step, until the car reaches the top Reward=1 at the top, 0 otherwise, <1 The eventual reward will be maximised by minimising the number of steps to the top of the hill Example: Mountain-Car
Value function We will consider deterministic worlds first • Given a policy (adopted by the agent), define an evaluation function over states: • Property:
Example Grid world environment • Six possible states • Arrows represent possible actions • G: goal state One optimal policy – denoted * What is the best thing to do when in each state? Compute the values of the states for this policy – denoted V*
r(s,a) (immediate reward) values: V*(s) values, with =0.9 one optimal policy: V*(s6) = 100 + 0.9*0 = 100 V*(s5) = 0 + 0.9*100 = 90 V*(s4) = 0 + 0.9*90 = 81 Restated, the task is to learn the optimal policy
The task, revisited • We might try to have the agent learn the evaluation function V* • It could then do a look ahead search to chose the best action from any state using … yes, if we knew both the transition function and the reward function r. In general these are unknown to the agent, so it cannot choose actions this way • BUT: there is a way to do it!!
Q function • Define a new function very similar to V* • What difference it makes?? • If the agent learns Q, then it can choose the optimal actions even without knowing • Let us see how.
Rewrite things replacing this new definition: • Now, let denote the agent’s current approximation to Q. Consider the iterative update rule. Under some assumptions (<s,a> visited infinitely often), this will converge to the true Q:
Q Learning algorithm (in deterministic worlds) • For each (s,a) initialise table entry • Observe current state s • Do forever: • Select an action a and execute it • Receive immediate reward r • Observe new state s’ • Update table entry as follows: • s:=s’
Example updating Q given the Q values from a previous iteration on the arrows
Sketch of the convergence proof of Q-learning • Consider the case of deterministic world, where each (s,a) is visited infinitely often. • Define a full interval as an interval during which each (s,a) is visited. It can be easily shown, that during any such interval, the absolute value of the largest error in Q^ table is reduced by a factor of . • Consequently, as <1, then after infinitely many updates, the largest error converges to zero. • Go through the details from [Mitchell, sec. 13.3,6.]
Obs. As a consequence of the convergence proof, Q-learning need not train on optimal action sequences in order to converge to the optimal policy. It can learn the Q function (and hence the optimal policy) while training from actions chosen at random as long as the resulting training sequence visits every (state, action) infinitely often.
Exploration versus Exploitation • The Q-learning algorithm doesn’t say how we could choose an action • If we choose an action that maximises our estimate of Q we could end up not exploring better alternatives • To converge on the true Q values we must favour higher estimated Q values but still have a chance of choosing worse estimated Q values for exploration (see the convergence proof of the Q-learning algorithm in [Mitchell, sec. 13.3.4.]). An action selection function of the following form may employed, where k>0:
Nondeterministic case • What if the reward and the state transition are not deterministic? – e.g. in Backgammon learning and playing depends on rolls of dice! • Then V and Q needs redefined by taking expected values • Similar reasoning and convergent update iteration will apply • Will continue next week.
Summary • Reinforcement learning is suitable for learning in uncertain environments where rewards may be delayed and subject to chance • The goal of a reinforcement learning program is to maximise the eventual reward • Q-learning is a form of reinforcement learning that doesn’t require that the learner has prior knowledge of how its actions affect the environment