1 / 69

Reinforcement Learning

Reinforcement Learning. Based on Slides by Avi Pfeffer and David Parkes. Mechanism. Reward. State. Closed Loop Interactions. Environment. Agent. Sensors. Actuators. Percepts. Actions. Reinforcement Learning. When mechanism(=model) is unknown

heaton
Download Presentation

Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement Learning Based on Slides by Avi Pfeffer and David Parkes

  2. Mechanism Reward State Closed Loop Interactions Environment Agent Sensors Actuators Percepts Actions

  3. Reinforcement Learning • When mechanism(=model) is unknown • When mechanism is known, but model is too hard to solve

  4. Basic Idea • Select an action using some sort of action selection process • If it leads to a reward, reinforce taking that action in future • If it leads to a punishment, avoid taking that action in future

  5. But It’s Not So Simple • Rewards and punishments may be delayed • credit assignment problem: how do you figure out which actions were responsible? • study -> get-degree -> get job • How do you choose an action? • exploration versus exploitation • What if the state space is very large so you can’t visit all states?

  6. Model-Based RL

  7. Model-Based Reinforcement Learning • Mechanism is an MDP • Approach: • learn the MDP • solve it to determine the optimal policy • Works when model is unknown, but it is not too large to store and solve

  8. Learning the MDP • We need to learn the parameters of the reward and transition models • We assume the agent plays every action in every state a number of times • Let Rai = total reward received for playing a in state i • Let Nai = number of times played a in state i • Let Naij = number of times j was reached when played a in state i • R(i,a) = Rai / Nai • Taij = Naij / Nai

  9. Note • Learning and solving the MDP need not be a one-off thing • Instead, we can repeatedly re-evalute the model and resolve it to get better and better policies • How often should we solve the MDP? • depends how expensive solving is compared to acting in the world

  10. Model-Based Reinforcement Learning Algorithm Let π0 be arbitrary k ← 0 Experience ← ∅ Repeat k ← k + 1 Begin in state i For a while: Choose action a based on πk-1 Receive reward r and transition to j Experience ← Experience ∪ < i, a, r, j > i ← j Learn MDP M from Experience Solve M to obtain πk

  11. Credit Assignment • How does model-based RL deal with the credit assignment problem? • By learning the MDP, the agent knows which states lead to which other states • Solving the MDP ensures that the agent plans ahead and takes the long run effects of actions into account • So the problem is solved optimally

  12. Action Selection • The line in the algorithm Choose action a based on πk-1 is not specific • How do we choose the action?

  13. Action Selection • The line in the algorithm Choose action a based on πk-1 is not specific • How do we choose the action? • Obvious answer: the policy tells us the action to perform • But is that always what we want to do?

  14. Exploration versus Exploitation • Exploit: use your learning results to play the action that maximizes your expected utility, relative to the model you have learned • Explore: play an action that will help you learn the model better

  15. Questions • When to explore • How to explore • simple answer: play an action you haven’t played much yet in the current state • more sophisticated: play an action that will probably lead you to part of the space you haven’t explored much • How to exploit • we know the answer to this: follow the learned policy

  16. Conditions for Optimality To ensure that the optimal policy will eventually be reached, we need to ensure that • Every action is taken in every state infinitely often in the long run • The probability of exploitation tends to 1

  17. Possible Exploration Strategies: 1 • Explore until time T, then exploit • Why is this bad?

  18. Possible Exploration Strategies: 1 • Explore until time T, then exploit • Why is this bad? • We may not explore long enough to get an accurate model • As a result, the optimal policy will not be reached

  19. Possible Exploration Strategies: 1 • Explore until time T, then exploit • Why is this bad? • We may not explore long enough to get an accurate model • As a result, the optimal policy will not be reached • But makes sense if we’re planning to learn the MDP once, then solve it, then play according to the learned policy

  20. Possible Exploration Strategies: 1 • Explore until time T, then exploit • Why is this bad? • We may not explore long enough to get an accurate model • As a result, the optimal policy will not be reached • But it works well if we’re planning to learn the MDP once, then solve it, then play according to the learned policy • Works well for learning from simulation and performing in the real world

  21. Possible Exploration Strategies: 2 • Explore with a fixed probability of p • Why is this bad?

  22. Possible Exploration Strategies: 2 • Explore with a fixed probability of p • Why is this bad? • Does not fully exploit when learning has converged to optimal policy

  23. Possible Exploration Strategies: 2 • Explore with a fixed probability of p • Why is this bad? • Does not fully exploit when learning has converged to optimal policy • When could this approach be useful?

  24. Possible Exploration Strategies: 2 • Explore with a fixed probability of p • Why is this bad? • Does not fully exploit when learning has converged to optimal policy • When could this approach be useful? • If world is changing gradually

  25. Boltzmann Exploration • In state i, choose action a with probability • T is called the temperature • High temperature: more exploration • T should be cooled down to reduce amount of exploration over time • Sensitive to cooling schedule

  26. Guarantee • If: • every action is taken in every state infinitely often • probability of exploration tends to zero • Then: • Model-based reinforcement learning will converge to the optimal policy with probability 1

  27. Pros and Cons • Pro: • makes maximal use of experience • solves model optimally given experience • Con: • assumes model is small enough to solve • requires expensive solution procedure

  28. R-Max • Assume R(s,a)=R-max (the maximal possible reward • Called optimism bias • Assume a special “heavens” state • R(heavens)=R-max • Tr(heavens,a,heavens)=1 • Solve and act optimally • When Nai > c, update R(i,a) and Tr(i,a,j) • After each update, resolve • If you choose c properly, converges to the optimal policy in polynomial number of iterations

  29. Model-Free RL

  30. Monte Carlo Sampling • If we want to estimate y = Ex~D[f(x)] we can • Generate random samples x1,…,xN from D • Estimate • Guaranteed to converge to correct estimate with sufficient samples • Requires keeping count of # of samples • Alternative, update average: • Generate random samples x1,…,xN from D • Estimate

  31. Estimating the Value of a Policy Using Monte-Carlo Sampling • Fix a policy π • When starting in state i, taking action a according to π, getting reward r and transitioning to j, we get a sample of • So we can update Vπ(i) ← (1-α)Vπ(i) + α(r + Vπ(j) called bootstrapping -- we use V to update itself • Initial Vπ(j)‘s value can be 0 or some guess

  32. Temporal Difference Algorithm For each state i: V(i) ← 0 Begin in state i Repeat: Apply action a based on current policy Receive reward r and transition to j i ← j

  33. Credit Assignment • By linking values to those of the next state, rewards and punishments are eventually propagated backwards • We wait until end of game and then propagate backwards in reverse order • Long term impact of a choice is inherent in the definition of value function

  34. But how do we learn to act • We want to implement something like policy iteration • This requires learning the Q function: • We use a TD method, known as SARSA to estimate the Q function w.r.t. the current policy • We can then update the policy as usual (policy improvement)

  35. TD for Control: SARSA Initialize Q(s,a) arbitrarily Repeat (for each episode): Initialize s Choose a from s using policy derived from Q (e.g., ε-greedy) Repeat (for each step of episode): Take action a, observe r, Choose a’ from s’ using policy derived from Q (e.g., ε-greedy) Update: s s’, aa’ Until s is terminal

  36. Off-Policy vs. On-Policy • On-policy learning: learn only the value of actions used in the current policy. SARSA is an example of an on-policy method. We learn the Q values w.r.t. the policy we are currently using • Off-policy learning: can learn the value of a policy different than the one used – separating learning from control. Q-learning is an example. It learns about the optimal policy by using a different policy (e.g., e-greedy policy).

  37. Q-Learning • Don’t learn the model, learn the optimal Q-function, Q*, directly • Works particularly well when model is too large to store, to solve or to learn • size of model: O(|States|2) • cost of solution by policy iteration: O(|States|3) • size of Q function: O(|Actions|*|States|)

  38. Recursive Formulation of Q Function

  39. Learning the Q Values • We don’t know Tai and we don’t want to learn an explicit model

  40. Learning the Q Values • We don’t know Tai and we don’t want to learn an explicit model • If only we knew that our future Q values were accurate… • …every time we applied a in state i and transitioned to j, receiving reward r, we would get a sample of R(i,a)+maxbQ(j,b)

  41. Learning the Q Values • We don’t know Tai and we don’t want to learn an explicit model • If only we knew that our future Q values were accurate… • …every time we applied a in state i and transitioned to j, receiving reward r, we would get a sample of R(i,a)+maxbQ(j,b) • So we pretend that they are accurate • (after all, they get more and more accurate)

  42. Q Learning Update Rule • On transitioning from i to j, taking action a, receiving reward r, update

  43. Q Learning Update Rule • On transitioning from i to j, taking action a, receiving reward r, update • α is the learning rate • Large α: • learning is quicker • but may not converge • α is often decreased over the course of learning

  44. Q Learning Algorithm For each state i and action a: Q(i,a) ← 0 Begin in state i Repeat: Choose action a based on the Q values for state i for all actions Receive reward r and transition to j i ← j

  45. Choosing Which Action to Take • Once you have learned the Q function, you can use it to determine the policy • in state i, choose action a that has highest estimated Q(i,a) • But we need to combine exploitation with exploration • same methods as before

  46. Guarantee • If: • every action is taken in every state infinitely often • α is sufficiently small • Then Q learning will converge to the optimal Q values with probability 1 • If also: • probability of exploration tends to zero • Then Q learning will converge to the optimal policy with probability 1

  47. Credit Assignment • By linking Q values to those of the next state, rewards and punishments are eventually propagated backwards • But may take a long time • Idea: wait until end of game and then propagate backwards in reverse order

  48. S8 S1 S2 S3 S4 S5 S6 S7 S9 Q-learning (α = 1) a,b a,b a,b a 0 0 1 0 0 a,b a,b a,b b 0 0 -1 After playing aaaa: After playing bbbb: Q(S4,a) = 1 Q(S2,a) = 1 Q(S8,a) = 0 Q(S6,a) = 0 Q(S4,b) = 0 Q(S2,b) = 0 Q(S8,b) = -1 Q(S6,b) = 0 Q(S3,a) = 1 Q(S1,a) = 1 Q(S7,a) = 0 Q(S1,a) = 1 Q(S3,b) = 0 Q(S1,b) = 0 Q(S7,b) = 0 Q(S1,b) = 0

  49. Bottom Line • Q learning makes optimistic assumption about the future • Rewards will be propagated back in linear time, but punishments may take exponential time to be propagated • But eventually, Q learning will converge to optimal policy

  50. SARSA vs. Q-learning how will each perform here?

More Related