1 / 32

Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science

ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces. October 11, 2010. Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2010.

corbin
Download Presentation

Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE-517 Reinforcement Learningin Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.),Eligibility Traces October 11, 2010 Dr. ItamarArel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2010

  2. Outline • Actor-Critic Model (TD) • Eligibility Traces

  3. Actor-Critic Methods • Explicit (and independent) representation of policy as well as value function • A critique scalar signal drives all learning in both actor and critic • These methods received much attention early on, and are being revisited now! • Appealing in context of psychological and neural models • Dopamine Neurons (W. Schultz et al. Universite de Fribourg)

  4. Actor-Critic Details • Typically, the critic is a state-value function • After each action selection, an evaluation error is obtained in the form where V is the critic’s current value function • Positive error  action at should be strengthened for the future • Typical actor is a parameterized mapping of states to actions • Suppose actions are generated by Gibbs softmax then the agent can update the preferences as

  5. Actor Critic Models (cont.) • Actor-Critic methods offer a powerful framework for scalable RL systems (as will be shown later) • They are particular interesting since they … • Operate inherently online • Require minimal computation in order to select actions • e.g. Draw a number from a given distribution • In Neural Networks it will be equivalent to a single feed-forward pass • Can cope with non-Markovian environments

  6. Summary of TD • TD is based on prediction (and associated error) • Introduced one-step tabular model-free TD methods • Extended prediction to control by employing some form of GPI • On-policy control: Sarsa • Off-policy control: Q-learning • These methods bootstrap and sample, combining aspects of DP and MC methods • Have shown to have some correlation with biological systems

  7. Unified View of RL methods (so far)

  8. Eligibility Traces • ET are one of the basic practical mechanisms in RL • Almost any TD methods can be combined with ET to obtain a more efficient learning engine • Combine TD concepts with Monte Carlo ideas • Addresses the gap between events and training data • Temporary record of occurrence of an event • Trace marks memory parameters associated with the event as eligible for undergoing learning changes • When TD error is recorded – eligible states or actions are assigned credit or “blame” for the error • There will be two views of ET • Forward view – more theoretic • Backward view – more mechanistic

  9. n-step TD Prediction • Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps)

  10. Mathematics of n-step TD Prediction • Monte Carlo: • TD: • Use V(s) to estimate remaining return • n-step TD: • 2-step return: • n-step return at time t:

  11. Learning with n-step Backups • Backup (on-line or off-line): • Error reduction property of n-step returns • Using this, one can show that n-step methods converge • Yields a family of methods, of which TD and MC are members Maximum error using n-step return Maximum error using V(s)

  12. On-line vs. Off-line Updating • In on-line updating – updates are done during the episode, as soon as the increment is computed • In that case we have • In off-line updating – we update the value of each state at the end of the episode • Increments are accumulated and calculated “on the side” • Values are constant throughout the episode • Given a value V(s), the new value (in the next episode) will be

  13. Random Walk Revisited: e.g. for 19-Step Random Walk

  14. Averaging n-step Returns • n-step methods were introduced to help with TD(l) understanding • Idea: backup an average of several returns • e.g. backup half of 2-step and half of4-step • The above is called a complex backup • Draw each component • Label with the weights for that component • TD(l) can be viewed as one way of averagingn-stepbackups One backup

  15. Forward View of TD(l) • TD(l) is a method for averaging all n-step backups • weight by ln-1 (time since visitation) • l-return: • Backup using l-return:

  16. l-Return Weighting Function for episodic tasks After termination Until termination

  17. Relation of l-Return to TD(0) and Monte Carlo • l-return can be rewritten as: • If l = 1, you get Monte Carlo: • If l = 0, you get TD(0)

  18. Forward View of TD(l) • Look forward from each state to determine update from future states and rewards • Q: Can this be practically implemented?

  19. l-Return on the Random Walk • Same 19 state random walk as before • Q: Why do you think intermediate values of l are best?

  20. Backward View • The forward view was theoretical • The backward view is for practical mechanism • “Shout” dt backwards over time • The strength of your voice decreases with temporal distance by gl

  21. Backward View of TD(l) • TD(l) parametrically shifts from TD to MC • New variable called eligibility trace • On each step, decay all traces by gl • g is the discount rate and l is the Return weighting coefficient • Increment the trace for the current state by 1 • Accumulating trace is thus

  22. On-line Tabular TD(l) 0

  23. Relation of Backwards View to MC & TD(0) • Using the update rule: • As before, if you set l to 0, you get to TD(0) • If you set l=g=1 (no decay), you get MC but in a better way • Can apply TD(1) to continuing tasks • Works incrementally and on-line (instead of waiting to the end of the episode) • In between – earlier states are given less credit for the TD Error

  24. Backward updates Forward updates Forward View = Backward View • The forward (theoretical) view of TD(l) is equivalent to the backward (mechanistic) view for off-line updating • The book shows (pp. 176-178): • On-line updating with small a is similar algebra shown in book

  25. On-line versus Off-line on Random Walk • Same 19 state random walk • On-line performs better over a broader range of parameters

  26. Control: Sarsa(l) • Next we want to use ET for control, not just prediction (i.e. estimation of value functions) • Idea: we save eligibility for state-action pairs instead of just states

  27. Sarsa(l) Algorithm 0

  28. Implementing Q(l) • Two methods have been proposed that combine ET and Q-Learning: Watkins’s Q(l) and Peng’s Q(l) • Recall that Q-learning is an off-policy method • Learns about greedy policy while follows exploratory actions Suppose the agent follows the greedy policy for the first two steps, but not on the third Watkins: Zero out eligibility trace after a non-greedy action. Do max when backing up at first non-greedy choice.

  29. Watkins’s Q(l) 0

  30. Peng’s Q(l) • Disadvantage to Watkins’s method: • Early in learning, the eligibility trace will be “cut” (zeroed out), frequently resulting in little advantage to traces • Peng: • Backup max action except at the end • Never cut traces • Disadvantage: • Complicated to implement

  31. Variable l • ET methods can improve by allowing l to change in time • Can generalize to variable l • l can be defined, for example, (as a function of time) as • States visited with high certainty values  l  0 • Use that value estimate fully and ignore subsequent states • States visited with uncertainty of values  l  1 • Causes their estimated values to have little effect on any updates

  32. Conclusions • Eligibility Traces offer an efficient, incremental way to combine MC and TD • Includes advantages of MC • Can deal with lack of Markov property • Consider an n-step interval for improved performance • Includes advantages of TD • Using TD error • Bootstrapping • Can significantly speed learning • Does have a cost in computation

More Related