590 likes | 756 Views
Decision Making in Intelligent Systems Lecture 5. BSc course Kunstmatige Intelligentie 2007 Bram Bakker Intelligent Systems Lab Amsterdam Informatics Institute Universiteit van Amsterdam bram@science.uva.nl. Overview of this lecture. Solving the full RL problem
E N D
Decision Making in Intelligent SystemsLecture 5 BSc course Kunstmatige Intelligentie 2007 Bram Bakker Intelligent Systems Lab Amsterdam Informatics Institute Universiteit van Amsterdam bram@science.uva.nl
Overview of this lecture • Solving the full RL problem • Given that we do not have the MDP model • Temporal Difference (TD) methods • Making TD methods more efficient with eligibility traces • Making TD methods more efficient with function approximation
T T T T T T T T T T T T T T T T T T T T Simplest TD Method
TD methods bootstrap and sample • Bootstrapping: update involves an estimate • MC does not bootstrap • DP bootstraps • TD bootstraps • Sampling: update does not involve an expected value • MC samples • DP does not sample • TD samples
Sarsa: On-Policy TD Control Turn TD into a control method by always updating the policy to be (epsilon-)greedy with respect to the current estimate. S A R S A: State Action Reward State Action
Windy Gridworld undiscounted, episodic, reward = –1 until goal
Cliffwalking • e-greedy, e = 0.1
Summary one-step tabular TD methods • Introduced one-step tabular model-free TD methods • Extend prediction to control by employing some form of GPI • On-policy control: Sarsa • Off-policy control: Q-learning • These methods bootstrap and sample, combining aspects of DP and MC methods
End of first part of the course • This concludes the material for the mid term exam (chapter 1-6 of Sutton & Barto) • Deeltoets, 25-3, 9-12u, gebouw B/B-B (Nieuwe Achtergracht 166, tentamenzaal)
The Book • Part I: The Problem • Introduction • Evaluative Feedback • The Reinforcement Learning Problem • Part II: Elementary Solution Methods • Dynamic Programming • Monte Carlo Methods • Temporal Difference Learning • Part III: A Unified View • Eligibility Traces • Generalization and Function Approximation • Planning and Learning • Dimensions of Reinforcement Learning • Case Studies
TD(l) • New variable called eligibility trace e • On each step, decay all traces by gl and increment the trace for the current state by 1 • Accumulating trace
Standard one-step TD • l=0 standard one-step TD-learning = TD(0)
Eligibility traces backward view • Shout dt backwards over time • The strength of your voice decreases with temporal distance by gl
Control: Sarsa(l) • Save eligibility for state-action pairs instead of just states
Sarsa(l) Gridworld Example • With one trial, the agent has much more information about how to get to the goal • not necessarily the best way • Can considerably accelerate learning
Replacing Traces • Using accumulating traces, frequently visited states can have eligibilities greater than 1 • This can be a problem for convergence • Replacing traces: Instead of adding 1 when you visit a state, set that trace to 1
Replacing Traces Example • Same 19 state random walk task as before • Replacing traces perform better than accumulating traces over more values of l
Overview of this lecture • Solving the full RL problem • Given that we do not have the MDP model • Temporal Difference (TD) methods • Making TD methods more efficient with eligibility traces • Making TD methods more efficient with function approximation
Generalization and Function Approximation • Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. • Overview of function approximation (FA) methods and how they can be adapted to RL
Generalization illustration Table Generalizing Function Approximator State V State V s s s . . . s 1 2 3 Train here N
Generalization illustration cont. So with function approximation a single value update affects a larger region of the state space
Value Prediction with FA Before, value functions were stored in lookup tables.
Adapt Supervised Learning Algorithms Training Info = desired (target) outputs Supervised Learning System Inputs Outputs Training example = {input (state), target output} Error = (target output – actual output)
Backups as Training Examples As a training example: input target output
Any FA Method? • In principle, yes: • artificial neural networks • decision trees • multivariate regression methods • etc. • But RL has some special requirements: • usually want to learn while interacting • ability to handle “moving targets”
Performance Measures • Many are applicable but… • a common and simple one is the mean-squared error (MSE) over a distribution P : • P is the distribution of states at which backups are done. • The on-policy distribution: the distribution created while following the policy being evaluated. Stronger results are available for this distribution.
Gradient Descent Iteratively move down the gradient:
Gradient Descent Cont. For the MSE given above and using the chain rule:
Gradient Descent Cont. Use just the sample gradient instead: Since each sample gradient is an unbiased estimate of the true gradient, this converges to a local minimum of the MSE if a decreases appropriately with t.
Nice Properties of Linear FA Methods • The gradient is very simple: • Linear gradient descent TD(l) converges: • Step size decreases appropriately • On-line sampling (states sampled from the on-policy distribution) • Converges to parameter vector with property: best parameter vector (Tsitsiklis & Van Roy, 1997)
Learning state-action values The general gradient-descent rule: Gradient-descent Sarsa(l) (backward view): Training examples of the form: Control with FA
States as feature vectors But how should the state features be constructed?
Tile Coding • Binary feature for each tile • Number of features present at any one time is constant • Binary features means weighted sum easy to compute • Easy to compute indices of the features present
Tile Coding Cont. Irregular tilings CMAC “Cerebellar model arithmetic computer” Albus 1971