300 likes | 441 Views
CS 188: Artificial Intelligence Spring 2007. Lecture 21:Reinforcement Learning: II MDP 4/12/2007. Srini Narayanan – ICSI and UC Berkeley. Announcements. Othello tournament signup Please send email to cs188@imail.berkeley.edu HW on classification out Due 4/23 Can work in pairs.
E N D
CS 188: Artificial IntelligenceSpring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley
Announcements • Othello tournament signup • Please send email to cs188@imail.berkeley.edu • HW on classification out • Due 4/23 • Can work in pairs
Reinforcement Learning • Basic idea: • Receive feedback in the form of rewards • Agent’s utility is defined by the reward function • Must learn to act so as to maximize expected utility • Change the rewards, change the behavior • Examples: • Learning optimal paths • Playing a game, reward at the end for winning / losing • Vacuuming a house, reward for each piece of dirt picked up • Automated taxi, reward for each passenger delivered
Recap: MDPs • Markov decision processes: • States S • Actions A • Transitions P(s’|s,a) (or T(s,a,s’)) • Rewards R(s,a,s’) • Start state s0 • Examples: • Gridworld, High-Low, N-Armed Bandit • Any process where the result of your action is stochastic • Goal: find the “best” policy • Policies are maps from states to actions • What do we mean by “best”? • This is like search – it’s planning using a model, not actually interacting with the environment
MDP Solutions • In deterministic single-agent search, want an optimal sequence of actions from start to a goal • In an MDP, like expectimax, want an optimal policy (s) • A policy gives an action for each state • Optimal policy maximizes expected utility (i.e. expected rewards) if followed • Defines a reflex agent Optimal policy when R(s, a, s’) = -0.04 for all non-terminals s
Example Optimal Policies R(s) = -0.01 R(s) = -0.03 R(s) = -0.4 R(s) = -2.0
Stationarity • In order to formalize optimality of a policy, need to understand utilities of reward sequences • Typically consider stationary preferences: • Theorem: only two ways to define stationary utilities • Additive utility: • Discounted utility:
Infinite Utilities?! • Problem: infinite state sequences with infinite rewards • Solutions: • Finite horizon: • Terminate after a fixed T steps • Gives nonstationary policy ( depends on time left) • Absorbing state(s): guarantee that for every policy, agent will eventually “die” (like “done” for High-Low) • Discounting: for 0 < < 1 • Smaller means smaller horizon
How (Not) to Solve an MDP • The inefficient way: • Enumerate policies • For each one, calculate the expected utility (discounted rewards) from the start state • E.g. by simulating a bunch of runs • Choose the best policy • We’ll return to a (better) idea like this later
Utility of a State • Define the utility of a state under a policy: V(s) = expected total (discounted) rewards starting in s and following • Recursive definition (one-step look-ahead):
Policy Evaluation • Idea one: turn recursive equations into updates • Idea two: it’s just a linear system, solve with Matlab (or Mosek, or Cplex)
Example: High-Low • Policy: always say “high” • Iterative updates:
Optimal Utilities • Goal: calculate the optimal utility of each state V*(s) = expected (discounted) rewards with optimal actions • Why: Given optimal utilities, MEU tells us the optimal policy
That’s my equation! Bellman’s Equation for Selecting actions • Definition of utility leads to a simple relationship amongst optimal utility values: Optimal rewards = maximize over first action and then follow optimal policy Formally: Bellman’s Equation
Value Iteration • Idea: • Start with bad guesses at all utility values (e.g. V0(s) = 0) • Update all values simultaneously using the Bellman equation (called a value update or Bellman update): • Repeat until convergence • Theorem: will converge to unique optimal values • Basic idea: bad guesses get refined towards optimal values • Policy may converge long before values do
Example: Value Iteration • Information propagates outward from terminal states and eventually all states have correct value estimates [DEMO]
Convergence* • Define the max-norm: • Theorem: For any two approximations U and V (any two utility vectors) • I.e. any distinct approximations must get closer to each other (after the Bellman update), so, in particular, any approximation must get closer to the true U (Bellman update is U) and value iteration converges to a unique, stable, optimal solution • Theorem: • I.e. once the change in our approximation is small, it must also be close to correct
Policy Iteration • Alternate approach: • Policy evaluation: calculate utilities for a fixed policy until convergence (remember the beginning of lecture) • Policy improvement: update policy based on resulting converged utilities • Repeat until policy converges • This is policy iteration • Can converge faster under some conditions
Policy Iteration • If we have a fixed policy , use simplified Bellman equation to calculate utilities: • For fixed utilities, easy to find the best action according to one-step look-ahead
Comparison • In value iteration: • Every pass (or “backup”) updates both utilities (explicitly, based on current utilities) and policy (possibly implicitly, based on current policy) • In policy iteration: • Several passes to update utilities with frozen policy • Occasional passes to update policies • Hybrid approaches (asynchronous policy iteration): • Any sequences of partial updates to either policy entries or utilities will converge if every state is visited infinitely often
Reinforcement Learning • Reinforcement learning: • Still have an MDP: • A set of states s S • A model T(s,a,s’) • A reward function R(s) • Still looking for a policy (s) • New twist: don’t know T or R • I.e. don’t know which states are good or what the actions do • Must actually try actions and states out to learn
Example: Animal Learning • RL studied experimentally for more than 60 years in psychology • Rewards: food, pain, hunger, drugs, etc. • Mechanisms and sophistication debated • Example: foraging • Bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies • Bees have a direct neural connection from nectar intake measurement to motor planning area
Example: Backgammon • Reward only for win / loss in terminal states, zero otherwise • TD-Gammon learns a function approximation to U(s) using a neural network • Combined with depth 3 search, one of the top 3 players in the world
Passive Learning • Simplified task • You don’t know the transitions T(s,a,s’) • You don’t know the rewards R(s,a,s’) • You are given a policy (s) • Goal: learn the state values (and maybe the model) • In this case: • No choice about what actions to take • Just execute the policy and learn from experience • We’ll get to the general case soon
Example: Direct Estimation y • Episodes: +100 (1,1) up -1 (1,2) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) right -1 (4,2) right -100 (done) -100 x = 1, R = -1 U(1,1) ~ (93 + -105) / 2 = -6 U(3,3) ~ (100 + 98 + -101) / 3 = 32.3
Model-Based Learning • Idea: • Learn the model empirically (rather than values) • Solve the MDP as if the learned model were correct • Empirical model learning • Simplest case: • Count outcomes for each s,a • Normalize to give estimate of T(s,a,s’) • Discover R(s,a,s’) the first time we experience (s,a,s’) • More complex learners are possible (e.g. if we know that all squares have related action outcomes, e.g. “stationary noise”)
Example: Model-Based Learning y • Episodes: +100 (1,1) up -1 (1,2) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) right -1 (4,2) right -100 (done) -100 x = 1 T(<3,3>, right, <4,3>) = 1 / 3 T(<2,3>, right, <3,3>) = 2 / 2
Model-Based Learning • In general, want to learn the optimal policy, not evaluate a fixed policy • Idea: adaptive dynamic programming • Learn an initial model of the environment: • Solve for the optimal policy for this model (value or policy iteration) • Refine model through experience and repeat • Crucial: we have to make sure we actually learn about all of the model