Learning to Maximize Reward: Reinforcement Learning

Learning to Maximize Reward:Reinforcement Learning Brian C. Williams 16.412J/6.834J October 28th, 2002 Slides adapted from: Manuela Veloso, Reid Simmons, & Tom Mitchell, CMU 8/6/2014

Reading • Today: Reinforcement Learning • Read 2nd ed AIMA Chapter 19, or1st ed AIMA Chapter 20 • Read “Reinforcement Learning: A Survey” by L. Kaebling, M. Littman and A. Moore, Journal of Artificial Intelligence Research 4 (1996) 237-285. • For Markov Decision Processes • Read 1st/2nd ed AIMA Chapter 17 sections 1 – 4. • Optional Reading: : Planning and Acting in Partially Observable Stochastic Domains, by L. Kaebling, M. Littman and A. Cassandra, Elsevier (1998) 237-285.

Markov Decision Processes and Reinforcement Learning • Motivation • Learning policies through reinforcement • Q values • Q learning • Multi-step backups • Nondeterministic MDPs • Function Approximators • Model-based Learning • Summary

Example: TD-Gammon [Tesauro, 1995] Learns to play Backgammon Situations: • Board configurations (1020) Actions: • Moves Rewards: • +100 if win • - 100 if lose • 0 for all other states • Trained by playing 1.5 million games against self. • Currently, roughly equal to best human player.

Agent State Action Reward Environment a0 a1 a2 s3 s2 r1 r2 s0 s1 r0 Reinforcement Learning Problem Given: Repeatedly… • Executed action • Observed state • Observed reward Learn action policy p: S  A • Maximizes life rewardr0 + gr1 + g 2r2 . . .from any start state. • Discount: 0 <g< 1 Note: • Unsupervised learning • Delayed reward Goal: Learn to choose actions that maximize life reward r0 + gr1 + g 2r2 . . .

How About Learning the Policy Directly? • p*: S  A • fill out table entries for p* by collecting statistics on training pairs <s,a*>. • Where does a*come from?

How About Learning the Value Function? • Have agent learn value function Vp*, denoted V*. • Given learned V*, agent selects optimal action by one step lookahead: p*(s) = argmaxa [r(s,a) + gV*(d(s, a)] Problem: • Works well if agent knows the environment model. • d: S x A  S • r: S x A  • With no model, agent can’t choose action from V*. • With a model, could compute V* via value iteration, why learn it?

How About Learning the Model as Well? • Have agent learn d and r by statistics on training instances <st,rt+1,st+1> • ComputeV* by value iteration.Vt+1(s)  maxa [r(s,a) + gV t(d(s, a))] • Agent selects optimal action by one step lookahead: p*(s) = argmaxa [r(s,a) + gV*(d(s, a)] Problem: A viable strategy for many problems, but … • When do you stop learning the model and compute V*? • May take a long time to converge on model. • Would like to continuously interleave learning and acting, but repeatedly computing V*is costly. • How can we avoid learning the model and V* explicitly?

Eliminating the Model with Q Functions p*(s) = argmaxa [r(s,a) + gV*(d(s, a)] Key idea: • Define function that encapsulates V*, d and r: Q(s,a)= r(s,a) + gV*(d(s, a)) • From learned Q, can choose an optimal action without knowing d or r. p*(s) = argmaxaQ(s,a) V = Cumulative reward of being in s. Q = Cumulative reward of being in s and taking action a.

How Do We Learn Q? Q(st,at)= r(st,at) + gV*(d(st, at)) Idea: • Create update rule similar to Bellman equation. • Perform updates on training examples <st , at , rt+1 , st+1 > Q(st,at)rt+1 + gV*(st+1 ) How do we eliminate V*? • Q and V* are closely related: V*(s) = maxa’ Q(s,a’) • Substituting Q for V*: Q(st,at)rt+1 + g maxa’ Q(st+1,a’) Called a backup

Q-Learning for Deterministic Worlds Let Q denote the current approximation to Q. Initially: • For each s, a initialize table entry Q(s, a)  0 • Observe initial state s0 Do for all time t: • Select an action at and execute it • Receive immediate reward rt+1 • Observe the new state st+1 • Update the table entry for Q (st, at) as follows: Q(st, at)  rt+1+ g maxa’Q(st+1,a’) • st  st+1

100 63 81 Example – QLearning Update 72 100 g = 0.9 63 81 0 reward received

100 63 81 s1 s2 Max rt Example – QLearning Update 90 aright s1 s2 72 100 Q(s1,aright)  r(s1,aright) + g maxa’Q(s2,a’)  0 + 0.9max {63, 81, 100}  90 g = 0.9 63 81 0 reward received Note: if rewards are non-negative: • For all s, a, n, Qn(s, a)  Qn+1(s, a) • For all s, a, n, 0  Qn(s, a)  Q(s, a)

s1 s2 s3 10 G 10 10 s6 s5 s4 Q-Learning Iterations: Episodic • Start at upper left – move clockwise; table initially 0; g = 0.8 Q(s, a)  r+ g maxa’Q(s’,a’)

s1 s2 s3 10 G 10 10 s6 s5 s4 Q-Learning Iterations • Start at upper left – move clockwise; g = 0.8 Q(s, a)  r+ g maxa’Q(s’,a’)

100 90 G 81 72 81 81 100 81 90 72 81 Q(s, a) values 90 100 0 100 G G G 100 81 90 100 V*(s) values One Optimal Policy Example Summary: Value Iteration and Q-Learning R(s, a) values

Exploration vs Exploitation How do you pick actions as you learn? • Greedy Action Selection: • Always select the action that looks best: p*(s) = arg maxaQ(s,a) • Probabilistic Action Selection: • Likelihood of a is proportional to current Q value. • P(ai|s) = kQ(s, ai) / S j kQ(s, aj)

TD(l): Temporal Difference Learningfrom lecture slides: Machine Learning, T. Mitchell, McGraw Hill, 1997. Q learning: reduce discrepancy between successive Q estimates One step time difference: Q(1)(st,at) = rt + g maxaQ(st+1,at) Why not two steps? Q(2)(st,at) = rt + g rt+1 + g2 maxaQ(st+2,at+1) Or n ? Q(n)(st,at) = rt + g rt+1 + g(n-1) rt+n-1 + gn maxaQ(st+n,at+n-1) Blend all of these: Ql(st,at) = (1-l) [Q(1)(st,at) + l Q(2)(st,at) + l2Q(3)(st,at) + …]…

Visits to data point <s,a>: t Accumulating trace: t Eligibility Traces Idea: Perform backups on N previous data points, as well as most recent data point. • Select data to backup based on frequency of visitation. • Bias towards frequent data by geometric decay gi-j.

Markov Decision Processes and Reinforcement Learning • Motivation • Learning policies through reinforcement • Nondeterministic MDPs: • Value Iteration • Q Learning • Function Approximators • Model-based Learning • Summary

0.9 R – Research path D – Development path Example R 0.1 1.0 S3 Grad School S4 Academia S1 Unemployed S2 Industry D D D D 1.0 R R R 0.9 1.0 0.9 0.1 0.1 1.0 0.1 0.9 Nondeterministic MDPs state transitions become probabilistic: d(s,a,s’)

NonDeterministic Case • How do we redefine cumulative reward to handle non-determinism? • Define V and Q based on expected values: Vp(st) = E[rt + grt+1 + g 2rt+2 . . .] Vp(st) = E[  g irt+I ] Q(st,at)= E[r(st,at) + gV*(d(st, at))]

Value Iteration for Non-deterministic MDPs V1(s) := 0 for all s t := 1 loop t := t + 1 loop for all s in S loop for all a in A Qt(s ,a) := r(s,a) + g Ss’ in Sd(s,a,s’) V* t(s’) end loop Vt(s) := maxa[Qt(s,a)] end loop until |V*t+1(s) - V* t (s) | < e for all s in S

Q Learning for Nondeterministic MDPs Q* (s) = r(s,a) + g Ss’ in Sd(s,a,s’) maxa’[Q* (s’,a’)] • Alter training rule for non-deterministic Qn:Qn(st, at)  (1- an) Qn-1 (st,at) +an [rt+1+ g maxa’Qn-1(st+1,a’)] where an = 1/(1+visitsn(s,a)) Can still prove convergence of Q [Watkins and Dayan, 92]

Markov Decision Processes and Reinforcement Learning • Motivation • Learning policies through reinforcement • Nondeterministic MDPs • Function Approximators • Model-based Learning • Summary

Function Approximation Function Approximators: • Backprop Neural Network • Radial Basis Function Network • CMAC Network • Nearest Neighbor, Memory-based • Decision Tree Function Approximator Q(s,a) s a targets or error gradient- descent methods

Standard Backprop gradient weight vector target value estimated value Function Approximation Example:Adjusting Network Weights Function Approximator: • Q(s,a) = f(s,a,w) Update: Gradient-descent Sarsa: • w  w + a[rt+1 + gQ(st+1,at+1)-Q(st,at)] w f(st,at,w) Function Approximator Q(s,a) s a targets or error

Example: TD-Gammon [Tesauro, 1995] Learns to play Backgammon Situations: • Board configurations (1020) Actions: • Moves Rewards: • +100 if win • - 100 if lose • 0 for all other states • Trained by playing 1.5 million games against self. • Currently, roughly equal to best human player.

Example: TD-Gammon [Tesauro, 1995] V(s) predicted probability of winning On win: Outcome = 1 On Loss: Outcome = 0 TD error V(st+1) – V(st) Hidden Units 0 - 160 Random Initial Weights Raw Board Position (# of pieces at each position)

Model-based Learning: Certainty-Equivalence Method For every step: • Use new experience to update model parameters. • Transitions • Rewards • Solve the model for V and p. • Value iteration. • Policy iteration. • Use the policy to choose the next action.

Learning the Model For each state-action pair <s,a> visited accumulate: • Mean Transition: T(s, a, s’) = number-times-seen(s, a s’) number-times-tried(s,a) • Mean Reward: R(s, a)

Comparison of Model-based and Model-free methods Temporal Differencing / Q Learning: Only does computation for the states the system is actually in. • Good real-time performance • Inefficient use of data Model-based methods: Computes the best estimates for every state on every time step. • Efficient use of data • Terrible real-time performance What is a middle ground?

Dyna: A Middle Ground[Sutton, Intro to RL, 97] At each step, incrementally: • Update model based on new data • Update policy based on new data • Update policy based on updated model Performance, until optimal, on Grid World: • Q-Learning: • 531,000 Steps • 531,000 Backups • Dyna: • 61,908 Steps • 3,055,000 Backups

Dyna Algorithm Given state s: • Choose action a using estimated policy. • Observe new state s’ and reward r. • Update T and R of model. • Update V at <s, a>: V(s)  maxa [r(s,a) + g s’T(s,a,s’)V(s’))] • Perform k additional updates: • Pick k random states sj in{s1, s2, . . . sk} • Update each V(sj): V(sj)  maxa [r(sj,a) + g s’T(sj,a,s’)V(s’))]

Ongoing Research • Handling cases where state is only partially observable • Design of optimal exploration strategies • Extend to continuous action, state • Learn and use d : S x A S • Scaling up in the size of the state space • Function approximators (neural net instead of table) • Generalization • Macros • Exploiting substructure • Multiple learners – Multi-agent reinforcement learning

Deterministic Example: a0 a1 a2 s3 s2 r1 r2 10 G Legal transitions shown Reward on unlabeled transitions is 0. 10 10 Markov Decision Processes (MDPs) Model: • Finite set of states, S • Finite set of actions, A • Probabilistic state transitions, d(s,a) • Reward for each state and action, R(s,a) Process: • Observe state st in S • Choose action at in A • Receive immediate reward rt • State changes to st+1 s0 s1 s1 r0 a1

Crib Sheet: MDPs by Value Iteration Insight: Can calculate optimal values iteratively using Dynamic Programming. Algorithm: • Iteratively calculate value using Bellman’s Equation: V*t+1(s)  maxa [r(s,a) + gV* t(d(s, a))] • Terminate when values are “close enough” |V*t+1(s) - V* t (s) | < e • Agent selects optimal action by one step lookahead on V*: p*(s) = argmaxa [r(s,a) + gV*(d(s, a)]

Crib Sheet: Q-Learning for Deterministic Worlds Let Q denote the current approximation to Q. Initially: • For each s, a initialize table entry Q(s, a)  0 • Observe current state s Do forever: • Select an action a and execute it • Receive immediate reward r • Observe the new state s’ • Update the table entry for Q (s, a) as follows: Q(s, a)  r+ g maxa’Q(s’,a’) • s  s’

Learning to Maximize Reward: Reinforcement Learning

Learning to Maximize Reward: Reinforcement Learning

Presentation Transcript

Transfer in Variable - Reward Hierarchical Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

REINFORCEMENT LEARNING

Reinforcement Learning and the Reward Engineering Principle

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning