750 likes | 1.13k Views
Reinforcement Learning Eligibility Traces. 主講人:虞台文 大同大學資工所 智慧型多媒體研究室. Content. n-step TD prediction Forward View of TD( ) Backward View of TD( ) Equivalence of the Forward and Backward Views Sarsa( ) Q( ) Eligibility Traces for Actor-Critic Methods Replacing Traces
E N D
Reinforcement LearningEligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室
Content • n-step TD prediction • Forward View of TD() • Backward View of TD() • Equivalence of the Forward and Backward Views • Sarsa() • Q() • Eligibility Traces for Actor-Critic Methods • Replacing Traces • Implementation Issues
Reinforcement LearningEligibility Traces n-Step TD Prediction 大同大學資工所 智慧型多媒體研究室
Elementary Methods Monte Carlo Methods Dynamic Programming TD(0)
Monte Carlo vs. TD(0) • Monte Carlo • observe reward for all steps in an episode • TD(0) • observed one step only
TD (1-step) 2-step 3-step n-step Monte Carlo n-Step TD Prediction
corrected n-step truncated return n-Step TD Prediction
Backups Monte Carlo TD(0) n-step TD
n-Step TD Backup online offline When offline, the new V(s) will be for the next episode.
Error Reduction Property online offline Maximum error using n-step return Maximum error using V (current value) n-step return
start 0 0 0 0 0 1 A B C D E V(s) 1/6 2/6 3/6 4/6 5/6 Example (Random Walk) Consider 2-step TD, 3-step TD, … n=? is optimal?
start 1 0 0 0 0 1 online offline Average RMSE Over First 10 Trials Example (19-state Random Walk)
+1 1 Standard moves Exercise (Random Walk)
+1 1 Standard moves • Evaluate value function for random policy • Approximate value function using n-step TD (try different n’s and ’s), and compare their performance. • Find optimal policy. Exercise (Random Walk)
Reinforcement LearningEligibility Traces The Forward View of TD() 大同大學資工所 智慧型多媒體研究室
One backup Sum to 1 Averaging n-step Returns • We are not limited to simply using n-step TD returns • For example, we could take average n-step TD returns like:
TD() -Return • TD() is a method for averaging all n-step backups • weight by n1(time since visitation) • Called-return • Backup using -return: w1 w2 w3 wTt 1
TD() -Return • TD() is a method for averaging all n-step backups • weight by n1(time since visitation) • Called-return • Backup using -return: w1 w2 w3 wTt
Forward View of TD() A theoretical view
Reinforcement LearningEligibility Traces The Backward View of TD() 大同大學資工所 智慧型多媒體研究室
Why Backward View? • Forward view is acausal • Not implementable • Backward view is causal • Implementable • In the offline case, achieving the same result as the forward view
Eligibility Traces • Each state is associated with an additional memory variable eligibility trace, defined by:
Eligibility Traces • Each state is associated with an additional memory variable eligibility trace, defined by:
Eligibility Traces • Each state is associated with an additional memory variable eligibility trace, defined by:
Eligibility Recency of Visiting • At any time, the traces record which states have recently been visited, where “recently" is defined in terms of . • The traces indicate the degree to which each state is eligible for undergoing learning changes should a reinforcing event occur. • Reinforcing event The moment-by-moment 1-step TD errors
Reinforcing Event The moment-by-moment 1-step TD errors
TD() Eligibility Traces Reinforcing Events Value updates
Backwards View vs. MC & TD(0) • Setto 0, we get to TD(0) • Set to 1, we get MC but in a better way • Can apply TD(1) to continuing tasks • Works incrementally and on-line (instead of waiting to the end of the episode) How about 0 < < 1?
Reinforcement LearningEligibility Traces Equivalence of the Forward and Backward Views 大同大學資工所 智慧型多媒體研究室
Offline TD()’s Offline Forward TD() -Return Offline Backward TD()
Forward View = Backward View Forward updates Backward updates See the proof
Forward View = Backward View Forward updates Backward updates
Offline -return (forward) Online TD() (backward) Average RMSE Over First 10 Trials TD() on the Random Walk
Reinforcement LearningEligibility Traces Sarsa() 大同大學資工所 智慧型多媒體研究室
Sarsa() • TD() • Use eligibility traces for policy evaluation • How can eligibility traces be used for control? • Learn Qt(s, a) rather than Vt(s).
Sarsa() Eligibility Traces Reinforcing Events Updates
Sarsa() Traces in Grid World • With one trial, the agent has much more information about how to get to the goal • not necessarily the best way • Considerably accelerate learning
Reinforcement LearningEligibility Traces Q() 大同大學資工所 智慧型多媒體研究室
Q-Learning • An off-policy method • breaks from time to time to take exploratory actions • a simple time trace cannot be easily implemented • How to combine eligibility traces and Q-learning? • Three methods: • Watkins's Q() • Peng's Q () • Naïve Q ()
First non-greedy action Watkins's Q() Estimation policy (e.g., greedy) Behavior policy (e.g., -greedy) Greedy Path Non-Greedy Path
How to define the eligibility traces? Backups Watkins's Q() Two cases: • Both behavior and estimation policies take the greedy path. • Behavior path has taken a non-greedy action before the episode ends. Case 1 Case 2
Peng's Q() • Cutting off traces loses much of the advantage of using eligibility traces. • If exploratory actions are frequent, as they often are early in learning, then only rarely will backups of more than one or two steps be done, and learning may be little faster than 1-step Q-learning. • Peng's Q() is an alternate version of Q() meant to remedy this.
Backups Peng's Q() Peng, J. and Williams, R. J. (1996). Incremental Multi-Step Q-Learning.Machine Learning, 22(1/2/3). • Never cut traces • Backup max action except at end • The book says it outperforms Watkins Q(λ) and almost as well as Sarsa(λ) • Disadvantage: difficult for implementation
Peng, J. and Williams, R. J. (1996). Incremental Multi-Step Q-Learning.Machine Learning, 22(1/2/3). See for notations. Peng's Q()