Decision Making in Intelligent Systems Lecture 5

Decision Making in Intelligent SystemsLecture 5 BSc course Kunstmatige Intelligentie 2007 Bram Bakker Intelligent Systems Lab Amsterdam Informatics Institute Universiteit van Amsterdam bram@science.uva.nl

Overview of this lecture • Solving the full RL problem • Given that we do not have the MDP model • Temporal Difference (TD) methods • Making TD methods more efficient with eligibility traces • Making TD methods more efficient with function approximation

T T T T T T T T T T T T T T T T T T T T Simplest TD Method

TD methods bootstrap and sample • Bootstrapping: update involves an estimate • MC does not bootstrap • DP bootstraps • TD bootstraps • Sampling: update does not involve an expected value • MC samples • DP does not sample • TD samples

Learning An Action-Value Function

Sarsa: On-Policy TD Control Turn TD into a control method by always updating the policy to be (epsilon-)greedy with respect to the current estimate. S A R S A: State Action Reward State Action

Windy Gridworld undiscounted, episodic, reward = –1 until goal

Results of Sarsa on the Windy Gridworld

Q-Learning: Off-Policy TD Control

Cliffwalking • e-greedy, e = 0.1

Summary one-step tabular TD methods • Introduced one-step tabular model-free TD methods • Extend prediction to control by employing some form of GPI • On-policy control: Sarsa • Off-policy control: Q-learning • These methods bootstrap and sample, combining aspects of DP and MC methods

End of first part of the course • This concludes the material for the mid term exam (chapter 1-6 of Sutton & Barto) • Deeltoets, 25-3, 9-12u, gebouw B/B-B (Nieuwe Achtergracht 166, tentamenzaal)

The Book • Part I: The Problem • Introduction • Evaluative Feedback • The Reinforcement Learning Problem • Part II: Elementary Solution Methods • Dynamic Programming • Monte Carlo Methods • Temporal Difference Learning • Part III: A Unified View • Eligibility Traces • Generalization and Function Approximation • Planning and Learning • Dimensions of Reinforcement Learning • Case Studies

TD(l) • New variable called eligibility trace e • On each step, decay all traces by gl and increment the trace for the current state by 1 • Accumulating trace

Prediction: On-line Tabular TD(l)

Standard one-step TD • l=0 standard one-step TD-learning = TD(0)

Eligibility traces backward view • Shout dt backwards over time • The strength of your voice decreases with temporal distance by gl

Control: Sarsa(l) • Save eligibility for state-action pairs instead of just states

Sarsa(l) Algorithm

Sarsa(l) Gridworld Example • With one trial, the agent has much more information about how to get to the goal • not necessarily the best way • Can considerably accelerate learning

Watkins’s Q(l)

Replacing Traces • Using accumulating traces, frequently visited states can have eligibilities greater than 1 • This can be a problem for convergence • Replacing traces: Instead of adding 1 when you visit a state, set that trace to 1

Replacing Traces Example • Same 19 state random walk task as before • Replacing traces perform better than accumulating traces over more values of l

Overview of this lecture • Solving the full RL problem • Given that we do not have the MDP model • Temporal Difference (TD) methods • Making TD methods more efficient with eligibility traces • Making TD methods more efficient with function approximation

Generalization and Function Approximation • Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. • Overview of function approximation (FA) methods and how they can be adapted to RL

Generalization illustration Table Generalizing Function Approximator State V State V s s s . . . s 1 2 3 Train here N

Generalization illustration cont. So with function approximation a single value update affects a larger region of the state space

Value Prediction with FA Before, value functions were stored in lookup tables.

Value Prediction with FA

Adapt Supervised Learning Algorithms Training Info = desired (target) outputs Supervised Learning System Inputs Outputs Training example = {input (state), target output} Error = (target output – actual output)

Backups as Training Examples As a training example: input target output

Any FA Method? • In principle, yes: • artificial neural networks • decision trees • multivariate regression methods • etc. • But RL has some special requirements: • usually want to learn while interacting • ability to handle “moving targets”

Gradient Descent Methods

Performance Measures • Many are applicable but… • a common and simple one is the mean-squared error (MSE) over a distribution P : • P is the distribution of states at which backups are done. • The on-policy distribution: the distribution created while following the policy being evaluated. Stronger results are available for this distribution.

Gradient Descent Iteratively move down the gradient:

Gradient Descent Cont. For the MSE given above and using the chain rule:

Gradient Descent Cont. Use just the sample gradient instead: Since each sample gradient is an unbiased estimate of the true gradient, this converges to a local minimum of the MSE if a decreases appropriately with t.

But We Don’t have these Targets

What about TD(l) Targets?

On-Line Gradient-Descent TD(l)

Linear Methods

Nice Properties of Linear FA Methods • The gradient is very simple: • Linear gradient descent TD(l) converges: • Step size decreases appropriately • On-line sampling (states sampled from the on-policy distribution) • Converges to parameter vector with property: best parameter vector (Tsitsiklis & Van Roy, 1997)

Learning state-action values The general gradient-descent rule: Gradient-descent Sarsa(l) (backward view): Training examples of the form: Control with FA

GPI Linear Gradient Descent Watkins’ Q(l)

GPI with Linear Gradient Descent Sarsa(l)

States as feature vectors But how should the state features be constructed?

Coarse Coding

Shaping Generalization in Coarse Coding

Tile Coding • Binary feature for each tile • Number of features present at any one time is constant • Binary features means weighted sum easy to compute • Easy to compute indices of the features present

Tile Coding Cont. Irregular tilings CMAC “Cerebellar model arithmetic computer” Albus 1971

Decision Making in Intelligent Systems Lecture 5

Decision Making in Intelligent Systems Lecture 5

Presentation Transcript

Decision Making in Intelligent Systems

Intelligent Decision-Making Support Systems (iDMSS)

Decision Making in Intelligent Systems Lecture 3

EVOLVING AN INTELLIGENT FRAMEWORK FOR DECISION- MAKING PROCESS IN E-HEALTH SYSTEMS

Business System Analysis Decision Making - Lecture 5

Class 5 Systems Support Decision Making

Decision Models and Intelligent Systems

Intelligent Systems for Decision Support

Decision Making in NPO Lecture 27

Ensemble Based Systems in Decision Making

Intelligent Systems Lecture 1

Lecture 24 – Decision Making

Decision Making in Intelligent Systems Lecture 4

Lecture 22 – Decision Making

Decision-making Systems

Decision Making in Intelligent Systems Lecture 6

Decision Making in Intelligent Systems Lecture 9

Lecture 4 Decision Making

Decision-Making Systems

Week 5 Lecture Statistics For Decision Making

Intelligent systems Lection 5

Intelligent Decision Making