1 / 30

CS 188: Artificial Intelligence Spring 2007

Learn how agents act to optimize rewards in MDPs, explore optimal policies, and understand stationarity and infinite utilities in RL. Discover the concepts of utility, Bellman's equation, policy iteration, and value iteration in reinforcement learning.

slarry
Download Presentation

CS 188: Artificial Intelligence Spring 2007

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 188: Artificial IntelligenceSpring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley

  2. Announcements • Othello tournament signup • Please send email to cs188@imail.berkeley.edu • HW on classification out • Due 4/23 • Can work in pairs

  3. Reinforcement Learning • Basic idea: • Receive feedback in the form of rewards • Agent’s utility is defined by the reward function • Must learn to act so as to maximize expected utility • Change the rewards, change the behavior • Examples: • Learning optimal paths • Playing a game, reward at the end for winning / losing • Vacuuming a house, reward for each piece of dirt picked up • Automated taxi, reward for each passenger delivered

  4. Recap: MDPs • Markov decision processes: • States S • Actions A • Transitions P(s’|s,a) (or T(s,a,s’)) • Rewards R(s,a,s’) • Start state s0 • Examples: • Gridworld, High-Low, N-Armed Bandit • Any process where the result of your action is stochastic • Goal: find the “best” policy  • Policies are maps from states to actions • What do we mean by “best”? • This is like search – it’s planning using a model, not actually interacting with the environment

  5. MDP Solutions • In deterministic single-agent search, want an optimal sequence of actions from start to a goal • In an MDP, like expectimax, want an optimal policy (s) • A policy gives an action for each state • Optimal policy maximizes expected utility (i.e. expected rewards) if followed • Defines a reflex agent Optimal policy when R(s, a, s’) = -0.04 for all non-terminals s

  6. Example Optimal Policies R(s) = -0.01 R(s) = -0.03 R(s) = -0.4 R(s) = -2.0

  7. Stationarity • In order to formalize optimality of a policy, need to understand utilities of reward sequences • Typically consider stationary preferences: • Theorem: only two ways to define stationary utilities • Additive utility: • Discounted utility:

  8. Infinite Utilities?! • Problem: infinite state sequences with infinite rewards • Solutions: • Finite horizon: • Terminate after a fixed T steps • Gives nonstationary policy ( depends on time left) • Absorbing state(s): guarantee that for every policy, agent will eventually “die” (like “done” for High-Low) • Discounting: for 0 <  < 1 • Smaller  means smaller horizon

  9. How (Not) to Solve an MDP • The inefficient way: • Enumerate policies • For each one, calculate the expected utility (discounted rewards) from the start state • E.g. by simulating a bunch of runs • Choose the best policy • We’ll return to a (better) idea like this later

  10. Utility of a State • Define the utility of a state under a policy: V(s) = expected total (discounted) rewards starting in s and following  • Recursive definition (one-step look-ahead):

  11. Policy Evaluation • Idea one: turn recursive equations into updates • Idea two: it’s just a linear system, solve with Matlab (or Mosek, or Cplex)

  12. Example: High-Low • Policy: always say “high” • Iterative updates:

  13. Optimal Utilities • Goal: calculate the optimal utility of each state V*(s) = expected (discounted) rewards with optimal actions • Why: Given optimal utilities, MEU tells us the optimal policy

  14. That’s my equation! Bellman’s Equation for Selecting actions • Definition of utility leads to a simple relationship amongst optimal utility values: Optimal rewards = maximize over first action and then follow optimal policy Formally: Bellman’s Equation

  15. Example: GridWorld

  16. Value Iteration • Idea: • Start with bad guesses at all utility values (e.g. V0(s) = 0) • Update all values simultaneously using the Bellman equation (called a value update or Bellman update): • Repeat until convergence • Theorem: will converge to unique optimal values • Basic idea: bad guesses get refined towards optimal values • Policy may converge long before values do

  17. Example: Bellman Updates

  18. Example: Value Iteration • Information propagates outward from terminal states and eventually all states have correct value estimates [DEMO]

  19. Convergence* • Define the max-norm: • Theorem: For any two approximations U and V (any two utility vectors) • I.e. any distinct approximations must get closer to each other (after the Bellman update), so, in particular, any approximation must get closer to the true U (Bellman update is U) and value iteration converges to a unique, stable, optimal solution • Theorem: • I.e. once the change in our approximation is small, it must also be close to correct

  20. Policy Iteration • Alternate approach: • Policy evaluation: calculate utilities for a fixed policy until convergence (remember the beginning of lecture) • Policy improvement: update policy based on resulting converged utilities • Repeat until policy converges • This is policy iteration • Can converge faster under some conditions

  21. Policy Iteration • If we have a fixed policy , use simplified Bellman equation to calculate utilities: • For fixed utilities, easy to find the best action according to one-step look-ahead

  22. Comparison • In value iteration: • Every pass (or “backup”) updates both utilities (explicitly, based on current utilities) and policy (possibly implicitly, based on current policy) • In policy iteration: • Several passes to update utilities with frozen policy • Occasional passes to update policies • Hybrid approaches (asynchronous policy iteration): • Any sequences of partial updates to either policy entries or utilities will converge if every state is visited infinitely often

  23. Reinforcement Learning • Reinforcement learning: • Still have an MDP: • A set of states s  S • A model T(s,a,s’) • A reward function R(s) • Still looking for a policy (s) • New twist: don’t know T or R • I.e. don’t know which states are good or what the actions do • Must actually try actions and states out to learn

  24. Example: Animal Learning • RL studied experimentally for more than 60 years in psychology • Rewards: food, pain, hunger, drugs, etc. • Mechanisms and sophistication debated • Example: foraging • Bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies • Bees have a direct neural connection from nectar intake measurement to motor planning area

  25. Example: Backgammon • Reward only for win / loss in terminal states, zero otherwise • TD-Gammon learns a function approximation to U(s) using a neural network • Combined with depth 3 search, one of the top 3 players in the world

  26. Passive Learning • Simplified task • You don’t know the transitions T(s,a,s’) • You don’t know the rewards R(s,a,s’) • You are given a policy (s) • Goal: learn the state values (and maybe the model) • In this case: • No choice about what actions to take • Just execute the policy and learn from experience • We’ll get to the general case soon

  27. Example: Direct Estimation y • Episodes: +100 (1,1) up -1 (1,2) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) right -1 (4,2) right -100 (done) -100 x  = 1, R = -1 U(1,1) ~ (93 + -105) / 2 = -6 U(3,3) ~ (100 + 98 + -101) / 3 = 32.3

  28. Model-Based Learning • Idea: • Learn the model empirically (rather than values) • Solve the MDP as if the learned model were correct • Empirical model learning • Simplest case: • Count outcomes for each s,a • Normalize to give estimate of T(s,a,s’) • Discover R(s,a,s’) the first time we experience (s,a,s’) • More complex learners are possible (e.g. if we know that all squares have related action outcomes, e.g. “stationary noise”)

  29. Example: Model-Based Learning y • Episodes: +100 (1,1) up -1 (1,2) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) right -1 (4,2) right -100 (done) -100 x  = 1 T(<3,3>, right, <4,3>) = 1 / 3 T(<2,3>, right, <3,3>) = 2 / 2

  30. Model-Based Learning • In general, want to learn the optimal policy, not evaluate a fixed policy • Idea: adaptive dynamic programming • Learn an initial model of the environment: • Solve for the optimal policy for this model (value or policy iteration) • Refine model through experience and repeat • Crucial: we have to make sure we actually learn about all of the model

More Related