290 likes | 486 Views
Decision Making in Intelligent Systems Lecture 3. BSc course Kunstmatige Intelligentie 2008 Bram Bakker Intelligent Systems Lab Amsterdam Informatics Institute Universiteit van Amsterdam bram@science.uva.nl. Overview of this lecture. Solving the full RL problem
E N D
Decision Making in Intelligent SystemsLecture 3 BSc course Kunstmatige Intelligentie 2008 Bram Bakker Intelligent Systems Lab Amsterdam Informatics Institute Universiteit van Amsterdam bram@science.uva.nl
Overview of this lecture • Solving the full RL problem • Given that we have the MDP model • Dynamic Programming • Policy iteration • Value iteration
r r r . . . . . . t +1 t +2 s s t +3 s s t +1 t +2 t +3 a a a a t t +1 t +2 t t +3 Markov Decision Processes
Returns: A Unified Notation • In episodic tasks, we number the time steps of each episode starting from zero. • Think of each episode as ending in an absorbing state that always produces reward of zero: • We can cover all cases by writing
Value Functions • The value of a state is the expected return starting from that state; depends on the agent’s policy: • The value of taking an action in a stateunder policy p is the expected return starting from that state, taking that action, and thereafter following p :
Bellman Equation for a Policy p The basic idea: So: Or, without the expectation operator:
Bellman Optimality Equation for V* The value of a state under an optimal policy must equal the expected return for the best action from that state: is the unique solution of this system of nonlinear equations.
Dynamic Programming (DP) • A collection of classical solution methods for MDPs • Policy iteration • Value iteration • DP can be used to compute value functions, and hence, optimal policies • Assumes a known MDP model (state transition model and reward model) • Combination of Policy Evaluation and Policy Improvement
Policy Evaluation: for a given policy p, compute the state-value function Policy Evaluation Recall:
Iterative Methods a “sweep” A sweep consists of applying a backup operation to each state, And results in a new value function for iteration k+1. A full policy-evaluation backup:
Bootstrapping A full policy-evaluation backup: The new estimated value for each state s is based on the estimated old value of all possible successor states s’ Bootstrapping: estimating values based on your own estimates of values
A Small Gridworld • An undiscounted episodic task • Nonterminal states: 1, 2, . . ., 14; • Terminal states shown as shaded squares • Actions that would take agent off the grid leave state unchanged • Reward is –1 until a terminal state is reached
Suppose we have computed for a deterministic policy p. Policy Improvement
Policy Iteration policy evaluation policy improvement “greedification”
Value Iteration Recall the full policy-evaluation backup: Here is the full value-iteration backup: Essentially, combines policy evaluation and improvement in one step
T T T T T T T T T T Dynamic Programming visualized T T T
T T T T T T T T T T T T T T T T T T T T Monte Carlo
Asynchronous DP • All the DP methods described so far require exhaustive sweeps of the entire state set. • Asynchronous DP does not use sweeps. Instead it works like this: • Repeat until convergence criterion is met: • Pick a state at random and apply the appropriate backup • Still need lots of computation, but does not get locked into hopelessly long sweeps • Can you select states to backup intelligently? YES: an agent’s experience can act as a guide.
Generalized Policy Iteration Generalized Policy Iteration (GPI): any interaction of policy evaluation and policy improvement, independent of their granularity. A geometric metaphor for convergence of GPI:
Efficiency of DP • To find an optimal policy is polynomial in the number of states… • BUT, the number of states is often astronomical, e.g., often growing exponentially with the number of state variables (what Bellman called “the curse of dimensionality”). • In practice, classical DP can be applied to problems with a few millions of states. • Asynchronous DP can be applied to larger problems, and appropriate for parallel computation. • It is surprisingly easy to come up with MDPs for which DP methods are not practical.
Summary • Policy evaluation: estimate value of current policy • Policy improvement: determine new current policy by being greedy w.r.t. current value function • Policy iteration: alternate the above two processes • Value iteration: combine the above two processes in 1 step • Bootstrapping: updating estimates based on your own estimates • Full backups (to be contrasted with sample backups) • Generalized Policy Iteration (GPI)
Before we start… • Questions? Are some concepts still unclear? • Are you making progress in the Prakticum? • Advice: I do not cover everything from the book in detail. Issues and algorithms that I emphasize in the lectures are the most important (also for exams).
Next Class • Lecture next Monday: • Chapters 5 (& 6?) of Sutton & Barto