470 likes | 571 Views
Learning to Maximize Reward: Reinforcement Learning. Brian C. Williams 16.412J/6.834J October 28 th , 2002. Slides adapted from: Manuela Veloso, Reid Simmons, & Tom Mitchell, CMU. 8/6/2014. Reading. Today: Reinforcement Learning
E N D
Learning to Maximize Reward:Reinforcement Learning Brian C. Williams 16.412J/6.834J October 28th, 2002 Slides adapted from: Manuela Veloso, Reid Simmons, & Tom Mitchell, CMU 8/6/2014
Reading • Today: Reinforcement Learning • Read 2nd ed AIMA Chapter 19, or1st ed AIMA Chapter 20 • Read “Reinforcement Learning: A Survey” by L. Kaebling, M. Littman and A. Moore, Journal of Artificial Intelligence Research 4 (1996) 237-285. • For Markov Decision Processes • Read 1st/2nd ed AIMA Chapter 17 sections 1 – 4. • Optional Reading: : Planning and Acting in Partially Observable Stochastic Domains, by L. Kaebling, M. Littman and A. Cassandra, Elsevier (1998) 237-285.
Markov Decision Processes and Reinforcement Learning • Motivation • Learning policies through reinforcement • Q values • Q learning • Multi-step backups • Nondeterministic MDPs • Function Approximators • Model-based Learning • Summary
Example: TD-Gammon [Tesauro, 1995] Learns to play Backgammon Situations: • Board configurations (1020) Actions: • Moves Rewards: • +100 if win • - 100 if lose • 0 for all other states • Trained by playing 1.5 million games against self. • Currently, roughly equal to best human player.
Agent State Action Reward Environment a0 a1 a2 s3 s2 r1 r2 s0 s1 r0 Reinforcement Learning Problem Given: Repeatedly… • Executed action • Observed state • Observed reward Learn action policy p: S A • Maximizes life rewardr0 + gr1 + g 2r2 . . .from any start state. • Discount: 0 <g< 1 Note: • Unsupervised learning • Delayed reward Goal: Learn to choose actions that maximize life reward r0 + gr1 + g 2r2 . . .
How About Learning the Policy Directly? • p*: S A • fill out table entries for p* by collecting statistics on training pairs <s,a*>. • Where does a*come from?
How About Learning the Value Function? • Have agent learn value function Vp*, denoted V*. • Given learned V*, agent selects optimal action by one step lookahead: p*(s) = argmaxa [r(s,a) + gV*(d(s, a)] Problem: • Works well if agent knows the environment model. • d: S x A S • r: S x A • With no model, agent can’t choose action from V*. • With a model, could compute V* via value iteration, why learn it?
How About Learning the Model as Well? • Have agent learn d and r by statistics on training instances <st,rt+1,st+1> • ComputeV* by value iteration.Vt+1(s) maxa [r(s,a) + gV t(d(s, a))] • Agent selects optimal action by one step lookahead: p*(s) = argmaxa [r(s,a) + gV*(d(s, a)] Problem: A viable strategy for many problems, but … • When do you stop learning the model and compute V*? • May take a long time to converge on model. • Would like to continuously interleave learning and acting, but repeatedly computing V*is costly. • How can we avoid learning the model and V* explicitly?
Eliminating the Model with Q Functions p*(s) = argmaxa [r(s,a) + gV*(d(s, a)] Key idea: • Define function that encapsulates V*, d and r: Q(s,a)= r(s,a) + gV*(d(s, a)) • From learned Q, can choose an optimal action without knowing d or r. p*(s) = argmaxaQ(s,a) V = Cumulative reward of being in s. Q = Cumulative reward of being in s and taking action a.
Markov Decision Processes and Reinforcement Learning • Motivation • Learning policies through reinforcement • Q values • Q learning • Multi-step backups • Nondeterministic MDPs • Function Approximators • Model-based Learning • Summary
How Do We Learn Q? Q(st,at)= r(st,at) + gV*(d(st, at)) Idea: • Create update rule similar to Bellman equation. • Perform updates on training examples <st , at , rt+1 , st+1 > Q(st,at)rt+1 + gV*(st+1 ) How do we eliminate V*? • Q and V* are closely related: V*(s) = maxa’ Q(s,a’) • Substituting Q for V*: Q(st,at)rt+1 + g maxa’ Q(st+1,a’) Called a backup
Q-Learning for Deterministic Worlds Let Q denote the current approximation to Q. Initially: • For each s, a initialize table entry Q(s, a) 0 • Observe initial state s0 Do for all time t: • Select an action at and execute it • Receive immediate reward rt+1 • Observe the new state st+1 • Update the table entry for Q (st, at) as follows: Q(st, at) rt+1+ g maxa’Q(st+1,a’) • st st+1
100 63 81 Example – QLearning Update 72 100 g = 0.9 63 81 0 reward received
100 63 81 s1 s2 Max rt Example – QLearning Update 90 aright s1 s2 72 100 Q(s1,aright) r(s1,aright) + g maxa’Q(s2,a’) 0 + 0.9max {63, 81, 100} 90 g = 0.9 63 81 0 reward received Note: if rewards are non-negative: • For all s, a, n, Qn(s, a) Qn+1(s, a) • For all s, a, n, 0 Qn(s, a) Q(s, a)
s1 s2 s3 10 G 10 10 s6 s5 s4 Q-Learning Iterations: Episodic • Start at upper left – move clockwise; table initially 0; g = 0.8 Q(s, a) r+ g maxa’Q(s’,a’)
s1 s2 s3 10 G 10 10 s6 s5 s4 Q-Learning Iterations: Episodic • Start at upper left – move clockwise; table initially 0; g = 0.8 Q(s, a) r+ g maxa’Q(s’,a’)
s1 s2 s3 10 G 10 10 s6 s5 s4 Q-Learning Iterations • Start at upper left – move clockwise; g = 0.8 Q(s, a) r+ g maxa’Q(s’,a’)
s1 s2 s3 10 G 10 10 s6 s5 s4 Q-Learning Iterations • Start at upper left – move clockwise; g = 0.8 Q(s, a) r+ g maxa’Q(s’,a’)
s1 s2 s3 10 G 10 10 s6 s5 s4 Q-Learning Iterations • Start at upper left – move clockwise; g = 0.8 Q(s, a) r+ g maxa’Q(s’,a’)
s1 s2 s3 10 G 10 10 s6 s5 s4 Q-Learning Iterations • Start at upper left – move clockwise; g = 0.8 Q(s, a) r+ g maxa’Q(s’,a’)
s1 s2 s3 10 G 10 10 s6 s5 s4 Q-Learning Iterations • Start at upper left – move clockwise; g = 0.8 Q(s, a) r+ g maxa’Q(s’,a’)
100 90 G 81 72 81 81 100 81 90 72 81 Q(s, a) values 90 100 0 100 G G G 100 81 90 100 V*(s) values One Optimal Policy Example Summary: Value Iteration and Q-Learning R(s, a) values
Exploration vs Exploitation How do you pick actions as you learn? • Greedy Action Selection: • Always select the action that looks best: p*(s) = arg maxaQ(s,a) • Probabilistic Action Selection: • Likelihood of a is proportional to current Q value. • P(ai|s) = kQ(s, ai) / S j kQ(s, aj)
Markov Decision Processes and Reinforcement Learning • Motivation • Learning policies through reinforcement • Q values • Q learning • Multi-step backups • Nondeterministic MDPs • Function Approximators • Model-based Learning • Summary
TD(l): Temporal Difference Learningfrom lecture slides: Machine Learning, T. Mitchell, McGraw Hill, 1997. Q learning: reduce discrepancy between successive Q estimates One step time difference: Q(1)(st,at) = rt + g maxaQ(st+1,at) Why not two steps? Q(2)(st,at) = rt + g rt+1 + g2 maxaQ(st+2,at+1) Or n ? Q(n)(st,at) = rt + g rt+1 + g(n-1) rt+n-1 + gn maxaQ(st+n,at+n-1) Blend all of these: Ql(st,at) = (1-l) [Q(1)(st,at) + l Q(2)(st,at) + l2Q(3)(st,at) + …]…
Visits to data point <s,a>: t Accumulating trace: t Eligibility Traces Idea: Perform backups on N previous data points, as well as most recent data point. • Select data to backup based on frequency of visitation. • Bias towards frequent data by geometric decay gi-j.
Markov Decision Processes and Reinforcement Learning • Motivation • Learning policies through reinforcement • Nondeterministic MDPs: • Value Iteration • Q Learning • Function Approximators • Model-based Learning • Summary
0.9 R – Research path D – Development path Example R 0.1 1.0 S3 Grad School S4 Academia S1 Unemployed S2 Industry D D D D 1.0 R R R 0.9 1.0 0.9 0.1 0.1 1.0 0.1 0.9 Nondeterministic MDPs state transitions become probabilistic: d(s,a,s’)
NonDeterministic Case • How do we redefine cumulative reward to handle non-determinism? • Define V and Q based on expected values: Vp(st) = E[rt + grt+1 + g 2rt+2 . . .] Vp(st) = E[ g irt+I ] Q(st,at)= E[r(st,at) + gV*(d(st, at))]
Value Iteration for Non-deterministic MDPs V1(s) := 0 for all s t := 1 loop t := t + 1 loop for all s in S loop for all a in A Qt(s ,a) := r(s,a) + g Ss’ in Sd(s,a,s’) V* t(s’) end loop Vt(s) := maxa[Qt(s,a)] end loop until |V*t+1(s) - V* t (s) | < e for all s in S
Q Learning for Nondeterministic MDPs Q* (s) = r(s,a) + g Ss’ in Sd(s,a,s’) maxa’[Q* (s’,a’)] • Alter training rule for non-deterministic Qn:Qn(st, at) (1- an) Qn-1 (st,at) +an [rt+1+ g maxa’Qn-1(st+1,a’)] where an = 1/(1+visitsn(s,a)) Can still prove convergence of Q [Watkins and Dayan, 92]
Markov Decision Processes and Reinforcement Learning • Motivation • Learning policies through reinforcement • Nondeterministic MDPs • Function Approximators • Model-based Learning • Summary
Function Approximation Function Approximators: • Backprop Neural Network • Radial Basis Function Network • CMAC Network • Nearest Neighbor, Memory-based • Decision Tree Function Approximator Q(s,a) s a targets or error gradient- descent methods
Standard Backprop gradient weight vector target value estimated value Function Approximation Example:Adjusting Network Weights Function Approximator: • Q(s,a) = f(s,a,w) Update: Gradient-descent Sarsa: • w w + a[rt+1 + gQ(st+1,at+1)-Q(st,at)] w f(st,at,w) Function Approximator Q(s,a) s a targets or error
Example: TD-Gammon [Tesauro, 1995] Learns to play Backgammon Situations: • Board configurations (1020) Actions: • Moves Rewards: • +100 if win • - 100 if lose • 0 for all other states • Trained by playing 1.5 million games against self. • Currently, roughly equal to best human player.
Example: TD-Gammon [Tesauro, 1995] V(s) predicted probability of winning On win: Outcome = 1 On Loss: Outcome = 0 TD error V(st+1) – V(st) Hidden Units 0 - 160 Random Initial Weights Raw Board Position (# of pieces at each position)
Markov Decision Processes and Reinforcement Learning • Motivation • Learning policies through reinforcement • Nondeterministic MDPs • Function Approximators • Model-based Learning • Summary
Model-based Learning: Certainty-Equivalence Method For every step: • Use new experience to update model parameters. • Transitions • Rewards • Solve the model for V and p. • Value iteration. • Policy iteration. • Use the policy to choose the next action.
Learning the Model For each state-action pair <s,a> visited accumulate: • Mean Transition: T(s, a, s’) = number-times-seen(s, a s’) number-times-tried(s,a) • Mean Reward: R(s, a)
Comparison of Model-based and Model-free methods Temporal Differencing / Q Learning: Only does computation for the states the system is actually in. • Good real-time performance • Inefficient use of data Model-based methods: Computes the best estimates for every state on every time step. • Efficient use of data • Terrible real-time performance What is a middle ground?
Dyna: A Middle Ground[Sutton, Intro to RL, 97] At each step, incrementally: • Update model based on new data • Update policy based on new data • Update policy based on updated model Performance, until optimal, on Grid World: • Q-Learning: • 531,000 Steps • 531,000 Backups • Dyna: • 61,908 Steps • 3,055,000 Backups
Dyna Algorithm Given state s: • Choose action a using estimated policy. • Observe new state s’ and reward r. • Update T and R of model. • Update V at <s, a>: V(s) maxa [r(s,a) + g s’T(s,a,s’)V(s’))] • Perform k additional updates: • Pick k random states sj in{s1, s2, . . . sk} • Update each V(sj): V(sj) maxa [r(sj,a) + g s’T(sj,a,s’)V(s’))]
Markov Decision Processes and Reinforcement Learning • Motivation • Learning policies through reinforcement • Nondeterministic MDPs • Function Approximators • Model-based Learning • Summary
Ongoing Research • Handling cases where state is only partially observable • Design of optimal exploration strategies • Extend to continuous action, state • Learn and use d : S x A S • Scaling up in the size of the state space • Function approximators (neural net instead of table) • Generalization • Macros • Exploiting substructure • Multiple learners – Multi-agent reinforcement learning
Deterministic Example: a0 a1 a2 s3 s2 r1 r2 10 G Legal transitions shown Reward on unlabeled transitions is 0. 10 10 Markov Decision Processes (MDPs) Model: • Finite set of states, S • Finite set of actions, A • Probabilistic state transitions, d(s,a) • Reward for each state and action, R(s,a) Process: • Observe state st in S • Choose action at in A • Receive immediate reward rt • State changes to st+1 s0 s1 s1 r0 a1
Crib Sheet: MDPs by Value Iteration Insight: Can calculate optimal values iteratively using Dynamic Programming. Algorithm: • Iteratively calculate value using Bellman’s Equation: V*t+1(s) maxa [r(s,a) + gV* t(d(s, a))] • Terminate when values are “close enough” |V*t+1(s) - V* t (s) | < e • Agent selects optimal action by one step lookahead on V*: p*(s) = argmaxa [r(s,a) + gV*(d(s, a)]
Crib Sheet: Q-Learning for Deterministic Worlds Let Q denote the current approximation to Q. Initially: • For each s, a initialize table entry Q(s, a) 0 • Observe current state s Do forever: • Select an action a and execute it • Receive immediate reward r • Observe the new state s’ • Update the table entry for Q (s, a) as follows: Q(s, a) r+ g maxa’Q(s’,a’) • s s’