Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science

ECE-517 Reinforcement Learningin Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.),Eligibility Traces October 11, 2010 Dr. ItamarArel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2010

Outline • Actor-Critic Model (TD) • Eligibility Traces

Actor-Critic Methods • Explicit (and independent) representation of policy as well as value function • A critique scalar signal drives all learning in both actor and critic • These methods received much attention early on, and are being revisited now! • Appealing in context of psychological and neural models • Dopamine Neurons (W. Schultz et al. Universite de Fribourg)

Actor-Critic Details • Typically, the critic is a state-value function • After each action selection, an evaluation error is obtained in the form where V is the critic’s current value function • Positive error  action at should be strengthened for the future • Typical actor is a parameterized mapping of states to actions • Suppose actions are generated by Gibbs softmax then the agent can update the preferences as

Actor Critic Models (cont.) • Actor-Critic methods offer a powerful framework for scalable RL systems (as will be shown later) • They are particular interesting since they … • Operate inherently online • Require minimal computation in order to select actions • e.g. Draw a number from a given distribution • In Neural Networks it will be equivalent to a single feed-forward pass • Can cope with non-Markovian environments

Summary of TD • TD is based on prediction (and associated error) • Introduced one-step tabular model-free TD methods • Extended prediction to control by employing some form of GPI • On-policy control: Sarsa • Off-policy control: Q-learning • These methods bootstrap and sample, combining aspects of DP and MC methods • Have shown to have some correlation with biological systems

Unified View of RL methods (so far)

Eligibility Traces • ET are one of the basic practical mechanisms in RL • Almost any TD methods can be combined with ET to obtain a more efficient learning engine • Combine TD concepts with Monte Carlo ideas • Addresses the gap between events and training data • Temporary record of occurrence of an event • Trace marks memory parameters associated with the event as eligible for undergoing learning changes • When TD error is recorded – eligible states or actions are assigned credit or “blame” for the error • There will be two views of ET • Forward view – more theoretic • Backward view – more mechanistic

n-step TD Prediction • Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps)

Mathematics of n-step TD Prediction • Monte Carlo: • TD: • Use V(s) to estimate remaining return • n-step TD: • 2-step return: • n-step return at time t:

Learning with n-step Backups • Backup (on-line or off-line): • Error reduction property of n-step returns • Using this, one can show that n-step methods converge • Yields a family of methods, of which TD and MC are members Maximum error using n-step return Maximum error using V(s)

On-line vs. Off-line Updating • In on-line updating – updates are done during the episode, as soon as the increment is computed • In that case we have • In off-line updating – we update the value of each state at the end of the episode • Increments are accumulated and calculated “on the side” • Values are constant throughout the episode • Given a value V(s), the new value (in the next episode) will be

Random Walk Revisited: e.g. for 19-Step Random Walk

Averaging n-step Returns • n-step methods were introduced to help with TD(l) understanding • Idea: backup an average of several returns • e.g. backup half of 2-step and half of4-step • The above is called a complex backup • Draw each component • Label with the weights for that component • TD(l) can be viewed as one way of averagingn-stepbackups One backup

Forward View of TD(l) • TD(l) is a method for averaging all n-step backups • weight by ln-1 (time since visitation) • l-return: • Backup using l-return:

l-Return Weighting Function for episodic tasks After termination Until termination

Relation of l-Return to TD(0) and Monte Carlo • l-return can be rewritten as: • If l = 1, you get Monte Carlo: • If l = 0, you get TD(0)

Forward View of TD(l) • Look forward from each state to determine update from future states and rewards • Q: Can this be practically implemented?

l-Return on the Random Walk • Same 19 state random walk as before • Q: Why do you think intermediate values of l are best?

Backward View • The forward view was theoretical • The backward view is for practical mechanism • “Shout” dt backwards over time • The strength of your voice decreases with temporal distance by gl

Backward View of TD(l) • TD(l) parametrically shifts from TD to MC • New variable called eligibility trace • On each step, decay all traces by gl • g is the discount rate and l is the Return weighting coefficient • Increment the trace for the current state by 1 • Accumulating trace is thus

On-line Tabular TD(l) 0

Relation of Backwards View to MC & TD(0) • Using the update rule: • As before, if you set l to 0, you get to TD(0) • If you set l=g=1 (no decay), you get MC but in a better way • Can apply TD(1) to continuing tasks • Works incrementally and on-line (instead of waiting to the end of the episode) • In between – earlier states are given less credit for the TD Error

Backward updates Forward updates Forward View = Backward View • The forward (theoretical) view of TD(l) is equivalent to the backward (mechanistic) view for off-line updating • The book shows (pp. 176-178): • On-line updating with small a is similar algebra shown in book

On-line versus Off-line on Random Walk • Same 19 state random walk • On-line performs better over a broader range of parameters

Control: Sarsa(l) • Next we want to use ET for control, not just prediction (i.e. estimation of value functions) • Idea: we save eligibility for state-action pairs instead of just states

Sarsa(l) Algorithm 0

Implementing Q(l) • Two methods have been proposed that combine ET and Q-Learning: Watkins’s Q(l) and Peng’s Q(l) • Recall that Q-learning is an off-policy method • Learns about greedy policy while follows exploratory actions Suppose the agent follows the greedy policy for the first two steps, but not on the third Watkins: Zero out eligibility trace after a non-greedy action. Do max when backing up at first non-greedy choice.

Watkins’s Q(l) 0

Peng’s Q(l) • Disadvantage to Watkins’s method: • Early in learning, the eligibility trace will be “cut” (zeroed out), frequently resulting in little advantage to traces • Peng: • Backup max action except at the end • Never cut traces • Disadvantage: • Complicated to implement

Variable l • ET methods can improve by allowing l to change in time • Can generalize to variable l • l can be defined, for example, (as a function of time) as • States visited with high certainty values  l  0 • Use that value estimate fully and ignore subsequent states • States visited with uncertainty of values  l  1 • Causes their estimated values to have little effect on any updates

Conclusions • Eligibility Traces offer an efficient, incremental way to combine MC and TD • Includes advantages of MC • Can deal with lack of Markov property • Consider an n-step interval for improved performance • Includes advantages of TD • Using TD error • Bootstrapping • Can significantly speed learning • Does have a cost in computation

Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science