Reinforcement Learning

Reinforcement Learning Based on Slides by Avi Pfeffer and David Parkes

Mechanism Reward State Closed Loop Interactions Environment Agent Sensors Actuators Percepts Actions

Reinforcement Learning • When mechanism(=model) is unknown • When mechanism is known, but model is too hard to solve

Basic Idea • Select an action using some sort of action selection process • If it leads to a reward, reinforce taking that action in future • If it leads to a punishment, avoid taking that action in future

But It’s Not So Simple • Rewards and punishments may be delayed • credit assignment problem: how do you figure out which actions were responsible? • study -> get-degree -> get job • How do you choose an action? • exploration versus exploitation • What if the state space is very large so you can’t visit all states?

Model-Based RL

Model-Based Reinforcement Learning • Mechanism is an MDP • Approach: • learn the MDP • solve it to determine the optimal policy • Works when model is unknown, but it is not too large to store and solve

Learning the MDP • We need to learn the parameters of the reward and transition models • We assume the agent plays every action in every state a number of times • Let Rai = total reward received for playing a in state i • Let Nai = number of times played a in state i • Let Naij = number of times j was reached when played a in state i • R(i,a) = Rai / Nai • Taij = Naij / Nai

Note • Learning and solving the MDP need not be a one-off thing • Instead, we can repeatedly re-evalute the model and resolve it to get better and better policies • How often should we solve the MDP? • depends how expensive solving is compared to acting in the world

Model-Based Reinforcement Learning Algorithm Let π0 be arbitrary k ← 0 Experience ← ∅ Repeat k ← k + 1 Begin in state i For a while: Choose action a based on πk-1 Receive reward r and transition to j Experience ← Experience ∪ < i, a, r, j > i ← j Learn MDP M from Experience Solve M to obtain πk

Credit Assignment • How does model-based RL deal with the credit assignment problem? • By learning the MDP, the agent knows which states lead to which other states • Solving the MDP ensures that the agent plans ahead and takes the long run effects of actions into account • So the problem is solved optimally

Action Selection • The line in the algorithm Choose action a based on πk-1 is not specific • How do we choose the action?

Action Selection • The line in the algorithm Choose action a based on πk-1 is not specific • How do we choose the action? • Obvious answer: the policy tells us the action to perform • But is that always what we want to do?

Exploration versus Exploitation • Exploit: use your learning results to play the action that maximizes your expected utility, relative to the model you have learned • Explore: play an action that will help you learn the model better

Questions • When to explore • How to explore • simple answer: play an action you haven’t played much yet in the current state • more sophisticated: play an action that will probably lead you to part of the space you haven’t explored much • How to exploit • we know the answer to this: follow the learned policy

Conditions for Optimality To ensure that the optimal policy will eventually be reached, we need to ensure that • Every action is taken in every state infinitely often in the long run • The probability of exploitation tends to 1

Possible Exploration Strategies: 1 • Explore until time T, then exploit • Why is this bad?

Possible Exploration Strategies: 1 • Explore until time T, then exploit • Why is this bad? • We may not explore long enough to get an accurate model • As a result, the optimal policy will not be reached

Possible Exploration Strategies: 1 • Explore until time T, then exploit • Why is this bad? • We may not explore long enough to get an accurate model • As a result, the optimal policy will not be reached • But makes sense if we’re planning to learn the MDP once, then solve it, then play according to the learned policy

Possible Exploration Strategies: 1 • Explore until time T, then exploit • Why is this bad? • We may not explore long enough to get an accurate model • As a result, the optimal policy will not be reached • But it works well if we’re planning to learn the MDP once, then solve it, then play according to the learned policy • Works well for learning from simulation and performing in the real world

Possible Exploration Strategies: 2 • Explore with a fixed probability of p • Why is this bad?

Possible Exploration Strategies: 2 • Explore with a fixed probability of p • Why is this bad? • Does not fully exploit when learning has converged to optimal policy

Possible Exploration Strategies: 2 • Explore with a fixed probability of p • Why is this bad? • Does not fully exploit when learning has converged to optimal policy • When could this approach be useful?

Possible Exploration Strategies: 2 • Explore with a fixed probability of p • Why is this bad? • Does not fully exploit when learning has converged to optimal policy • When could this approach be useful? • If world is changing gradually

Boltzmann Exploration • In state i, choose action a with probability • T is called the temperature • High temperature: more exploration • T should be cooled down to reduce amount of exploration over time • Sensitive to cooling schedule

Guarantee • If: • every action is taken in every state infinitely often • probability of exploration tends to zero • Then: • Model-based reinforcement learning will converge to the optimal policy with probability 1

Pros and Cons • Pro: • makes maximal use of experience • solves model optimally given experience • Con: • assumes model is small enough to solve • requires expensive solution procedure

R-Max • Assume R(s,a)=R-max (the maximal possible reward • Called optimism bias • Assume a special “heavens” state • R(heavens)=R-max • Tr(heavens,a,heavens)=1 • Solve and act optimally • When Nai > c, update R(i,a) and Tr(i,a,j) • After each update, resolve • If you choose c properly, converges to the optimal policy in polynomial number of iterations

Model-Free RL

Monte Carlo Sampling • If we want to estimate y = Ex~D[f(x)] we can • Generate random samples x1,…,xN from D • Estimate • Guaranteed to converge to correct estimate with sufficient samples • Requires keeping count of # of samples • Alternative, update average: • Generate random samples x1,…,xN from D • Estimate

Estimating the Value of a Policy Using Monte-Carlo Sampling • Fix a policy π • When starting in state i, taking action a according to π, getting reward r and transitioning to j, we get a sample of • So we can update Vπ(i) ← (1-α)Vπ(i) + α(r + Vπ(j) called bootstrapping -- we use V to update itself • Initial Vπ(j)‘s value can be 0 or some guess

Temporal Difference Algorithm For each state i: V(i) ← 0 Begin in state i Repeat: Apply action a based on current policy Receive reward r and transition to j i ← j

Credit Assignment • By linking values to those of the next state, rewards and punishments are eventually propagated backwards • We wait until end of game and then propagate backwards in reverse order • Long term impact of a choice is inherent in the definition of value function

But how do we learn to act • We want to implement something like policy iteration • This requires learning the Q function: • We use a TD method, known as SARSA to estimate the Q function w.r.t. the current policy • We can then update the policy as usual (policy improvement)

TD for Control: SARSA Initialize Q(s,a) arbitrarily Repeat (for each episode): Initialize s Choose a from s using policy derived from Q (e.g., ε-greedy) Repeat (for each step of episode): Take action a, observe r, Choose a’ from s’ using policy derived from Q (e.g., ε-greedy) Update: s s’, aa’ Until s is terminal

Off-Policy vs. On-Policy • On-policy learning: learn only the value of actions used in the current policy. SARSA is an example of an on-policy method. We learn the Q values w.r.t. the policy we are currently using • Off-policy learning: can learn the value of a policy different than the one used – separating learning from control. Q-learning is an example. It learns about the optimal policy by using a different policy (e.g., e-greedy policy).

Recursive Formulation of Q Function

Learning the Q Values • We don’t know Tai and we don’t want to learn an explicit model

Learning the Q Values • We don’t know Tai and we don’t want to learn an explicit model • If only we knew that our future Q values were accurate… • …every time we applied a in state i and transitioned to j, receiving reward r, we would get a sample of R(i,a)+maxbQ(j,b)

Learning the Q Values • We don’t know Tai and we don’t want to learn an explicit model • If only we knew that our future Q values were accurate… • …every time we applied a in state i and transitioned to j, receiving reward r, we would get a sample of R(i,a)+maxbQ(j,b) • So we pretend that they are accurate • (after all, they get more and more accurate)

Q Learning Update Rule • On transitioning from i to j, taking action a, receiving reward r, update

Q Learning Update Rule • On transitioning from i to j, taking action a, receiving reward r, update • α is the learning rate • Large α: • learning is quicker • but may not converge • α is often decreased over the course of learning

Q Learning Algorithm For each state i and action a: Q(i,a) ← 0 Begin in state i Repeat: Choose action a based on the Q values for state i for all actions Receive reward r and transition to j i ← j

Choosing Which Action to Take • Once you have learned the Q function, you can use it to determine the policy • in state i, choose action a that has highest estimated Q(i,a) • But we need to combine exploitation with exploration • same methods as before

Guarantee • If: • every action is taken in every state infinitely often • α is sufficiently small • Then Q learning will converge to the optimal Q values with probability 1 • If also: • probability of exploration tends to zero • Then Q learning will converge to the optimal policy with probability 1

Credit Assignment • By linking Q values to those of the next state, rewards and punishments are eventually propagated backwards • But may take a long time • Idea: wait until end of game and then propagate backwards in reverse order

S8 S1 S2 S3 S4 S5 S6 S7 S9 Q-learning (α = 1) a,b a,b a,b a 0 0 1 0 0 a,b a,b a,b b 0 0 -1 After playing aaaa: After playing bbbb: Q(S4,a) = 1 Q(S2,a) = 1 Q(S8,a) = 0 Q(S6,a) = 0 Q(S4,b) = 0 Q(S2,b) = 0 Q(S8,b) = -1 Q(S6,b) = 0 Q(S3,a) = 1 Q(S1,a) = 1 Q(S7,a) = 0 Q(S1,a) = 1 Q(S3,b) = 0 Q(S1,b) = 0 Q(S7,b) = 0 Q(S1,b) = 0

Bottom Line • Q learning makes optimistic assumption about the future • Rewards will be propagated back in linear time, but punishments may take exponential time to be propagated • But eventually, Q learning will converge to optimal policy

SARSA vs. Q-learning how will each perform here?

Reinforcement Learning