Decision Making in Intelligent Systems Lecture 3

Decision Making in Intelligent SystemsLecture 3 BSc course Kunstmatige Intelligentie 2008 Bram Bakker Intelligent Systems Lab Amsterdam Informatics Institute Universiteit van Amsterdam bram@science.uva.nl

Overview of this lecture • Solving the full RL problem • Given that we have the MDP model • Dynamic Programming • Policy iteration • Value iteration

r r r . . . . . . t +1 t +2 s s t +3 s s t +1 t +2 t +3 a a a a t t +1 t +2 t t +3 Markov Decision Processes

Returns: A Unified Notation • In episodic tasks, we number the time steps of each episode starting from zero. • Think of each episode as ending in an absorbing state that always produces reward of zero: • We can cover all cases by writing

Value Functions • The value of a state is the expected return starting from that state; depends on the agent’s policy: • The value of taking an action in a stateunder policy p is the expected return starting from that state, taking that action, and thereafter following p :

Bellman Equation for a Policy p The basic idea: So: Or, without the expectation operator:

Bellman Optimality Equation for V* The value of a state under an optimal policy must equal the expected return for the best action from that state: is the unique solution of this system of nonlinear equations.

Dynamic Programming (DP) • A collection of classical solution methods for MDPs • Policy iteration • Value iteration • DP can be used to compute value functions, and hence, optimal policies • Assumes a known MDP model (state transition model and reward model) • Combination of Policy Evaluation and Policy Improvement

Policy Evaluation: for a given policy p, compute the state-value function Policy Evaluation Recall:

Iterative Methods a “sweep” A sweep consists of applying a backup operation to each state, And results in a new value function for iteration k+1. A full policy-evaluation backup:

Bootstrapping A full policy-evaluation backup: The new estimated value for each state s is based on the estimated old value of all possible successor states s’ Bootstrapping: estimating values based on your own estimates of values

Iterative Policy Evaluation

A Small Gridworld • An undiscounted episodic task • Nonterminal states: 1, 2, . . ., 14; • Terminal states shown as shaded squares • Actions that would take agent off the grid leave state unchanged • Reward is –1 until a terminal state is reached

Iterative Policy Evaluationfor the Small Gridworld

Suppose we have computed for a deterministic policy p. Policy Improvement

Policy Improvement Continued

Policy Iteration policy evaluation policy improvement “greedification”

Policy Iteration

Value Iteration Recall the full policy-evaluation backup: Here is the full value-iteration backup: Essentially, combines policy evaluation and improvement in one step

Value Iteration Continued

Iterative Policy Evaluationfor the Small Gridworld

T T T T T T T T T T Dynamic Programming visualized T T T

T T T T T T T T T T T T T T T T T T T T Monte Carlo

Asynchronous DP • All the DP methods described so far require exhaustive sweeps of the entire state set. • Asynchronous DP does not use sweeps. Instead it works like this: • Repeat until convergence criterion is met: • Pick a state at random and apply the appropriate backup • Still need lots of computation, but does not get locked into hopelessly long sweeps • Can you select states to backup intelligently? YES: an agent’s experience can act as a guide.

Generalized Policy Iteration Generalized Policy Iteration (GPI): any interaction of policy evaluation and policy improvement, independent of their granularity. A geometric metaphor for convergence of GPI:

Efficiency of DP • To find an optimal policy is polynomial in the number of states… • BUT, the number of states is often astronomical, e.g., often growing exponentially with the number of state variables (what Bellman called “the curse of dimensionality”). • In practice, classical DP can be applied to problems with a few millions of states. • Asynchronous DP can be applied to larger problems, and appropriate for parallel computation. • It is surprisingly easy to come up with MDPs for which DP methods are not practical.

Summary • Policy evaluation: estimate value of current policy • Policy improvement: determine new current policy by being greedy w.r.t. current value function • Policy iteration: alternate the above two processes • Value iteration: combine the above two processes in 1 step • Bootstrapping: updating estimates based on your own estimates • Full backups (to be contrasted with sample backups) • Generalized Policy Iteration (GPI)

Before we start… • Questions? Are some concepts still unclear? • Are you making progress in the Prakticum? • Advice: I do not cover everything from the book in detail. Issues and algorithms that I emphasize in the lectures are the most important (also for exams).

Next Class • Lecture next Monday: • Chapters 5 (& 6?) of Sutton & Barto

Decision Making in Intelligent Systems Lecture 3

Decision Making in Intelligent Systems Lecture 3

Presentation Transcript

Decision Making in Intelligent Systems

Intelligent Decision-Making Support Systems (iDMSS)

Week 3 Lecture Statistics For Decision Making

EVOLVING AN INTELLIGENT FRAMEWORK FOR DECISION- MAKING PROCESS IN E-HEALTH SYSTEMS

Chapter 3 Decision Making

Decision Models and Intelligent Systems

Intelligent Systems for Decision Support

Decision Making in NPO Lecture 27

Ensemble Based Systems in Decision Making

Intelligent Systems Lecture 1

Lecture 24 – Decision Making

Decision Making in Intelligent Systems Lecture 4

Lecture 22 – Decision Making

Decision-making Systems

Decision Making in Intelligent Systems Lecture 6

Decision Making in Intelligent Systems Lecture 9

Decision Making in Intelligent Systems Lecture 5

Lecture 4 Decision Making

Decision-Making Systems

Chapter 3 : Decision Making

Intelligent Agents - Lecture 3

Intelligent Decision Making