1 / 71

Reinforcement Learning

Reinforcement Learning . Textbook, Chapter 13. Reinforcement Learning Overview. We’ve focused a lot on supervised learning: Training examples: ( x 1 , y 1 ) , ( x 2 , y 2 ) , ...

kyrene
Download Presentation

Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement Learning Textbook, Chapter 13

  2. Reinforcement Learning Overview • We’ve focused a lot on supervised learning: • Training examples: (x1, y1), (x2, y2),... • But consider a different type of learning problem, in which a robot has to learn to do tasks in a particular environment. • E.g., • Navigate without crashing into anything • Locate and retrieve an object • Perform some multi-step manipulation of objects resulting in a desired configuration (e.g., sorting objects)

  3. This type of problem doesn’t typically provide clear “training examples” with detailed feedback at each step. • Rather, robot gets intermittent “rewards” for doing “the right thing”, or “punishment” for doing “the wrong thing”. • Goal: To have robot (or other learning system) learn, based on such intermittent rewards, what action to take in a given situation. • Ideas for this have been inspired by “reinforcement learning” experiments in psychology literature.

  4. Exploitation vs. Exploration • On-line versus off-line learning • On-line learning requires the correct balance between “exploitation” and “exploration”

  5. Exploitation • Exploit current good strategies to obtain known reward • Exploration • Explore new strategies in hope of finding ways to increase reward Exploitation vs. exploration in genetic algorithms?

  6. Two-armed bandit model for exploitationand exploration with non-deterministic rewards (From D. E. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley 1989.)

  7. Two-armed bandit model for exploitationand exploration with non-deterministic rewards • You are given nquarters to play with, and don’t know the probabilities of payoffs of the respective arms, a1 and a2. • What is the optimal way to allocate your quarters between the two arms so as to maximize your earnings (or minimize your losses) over the n arm-pulls ? a1 a2 (From D. E. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley 1989.)

  8. Each arm roughly corresponds with a possible “strategy” to test. • The question is, how should you allocate the next sample between the two different arms, given what rewards you have obtained so far? • Let Q(a) = expected reward of arm a.

  9. Here is one algorithm for estimating Q(a): Repeat fort =1 ton: where rtis the reward obtained by pulling arm a* at time t, and η is a learning rate parameter.

  10. Can generalize to “k-armed bandit”, where each arm is a possible “strategy”. • This model was used both for GAs and for reinforcement learning.

  11. Applications of reinforcement learning:A few examples • Learning to play backgammon • Robot arm control (juggling) • Robo-soccer • Robot navigation • Elevator dispatching • Power systems stability control • Job-shop scheduling • Air traffic control

  12. Robby the Robot can learn via reinforcement learning “policy” = “strategy” Sensors: N,S,E,W,C(urrent) Actions: Move N Move S Move E Move W Move random Stay put Try to pick up can Rewards/Penalties (points): Picks up can: 10 Tries to pick up can on empty site: -1 Crashes into wall: -5

  13. Formalization Reinforcement learning is typically formalized as a Markov decision process (MDP): Agent L only knows current state and actions available from that state. Agent L can: • perceive a set S of distinct states of an environment • perform a set Aof actions. Components of Markov Decision Process (MDP): • Possible set S of states (state space) • Possible set A of actions (action space) • State transition function (unknown to learner L) δ: S × AS δ(st,at) = st+1 • Reward function (unknown to learner L) r: S × Ar(st,at) = rt

  14. Formalization Reinforcement learning is typically formalized as a Markov decision process (MDP): Agent L only knows current state and actions available from that state. Agent L can: • perceive a set S of distinct states of an environment • perform a set Aof actions. Components of Markov Decision Process (MDP): • Possible set S of states (state space) • Possible set A of actions (action space) • State transition function (unknown to learner L) δ: S × AS δ(st,at) = st+1 • Reward function (unknown to learner L) r: S × Ar(st,at) = rt Note: Both δ and r can be deterministic or probabilistic. In the Robby the Robot example, they are both deterministic

  15. Goal is for agent L to learn policyπ, π: S A, such thatπmaximizes cumulative reward

  16. Example: Cumulative value of a policy Policy A: Always move east, picking up cans as you go Policy B: Move up and down the columns of the grid, picking up cans as you go Sensors: N,S,E,W,C(urrent) Actions: Move N Move S Move E Move W Move random Stay put Try to pick up can Rewards/Penalties (points): Picks up can: 10 Tries to pick up can on empty site: -1 Crashes into wall: -5

  17. Value function: Formal definition of “cumulative value” with discounted rewards Let Vπ(st) denote the “cumulative value” ofπstarting from initial state st: • where 0 ≤ γ ≤ 1 is a “discounting” constant that determines the relative value of delayed versus immediate rewards. • Vπ(st)is called the “value function” for policyπ . It gives the expected value of starting in state st and following policy π “forever”.

  18. Value function: Formal definition of “cumulative value” with discounted rewards Let Vπ(st) denote the “cumulative value” ofπstarting from initial state st: • where 0 ≤ γ ≤ 1 is a “discounting” constant that determines the relative value of delayed versus immediate rewards. • Vπ(st)is called the “value function” for policyπ . It gives the expected value of starting in state st and following policy π “forever”. Why use a discounting constant?

  19. Note that rewards received itimes steps into the future are discounted exponentially ( by a factor ofγi). • If γ= 0, we only care about immediate reward. • The closerγis to 1, the more we care about future rewards, relative to immediate reward.

  20. Precise specification of learning task: We require that the agent learn policyπthat maximizesVπ(s)for all statess. We call such a policy an optimal policy, and denote it byπ*: To simplify notation, let

  21. What should L learn? • Hard to learn π* directly, since we don’t have training data of the form (s, a) • Only training data availale to learner is sequence of rewards: r(s0, a0), r(s1, a1), ... • So, what should the learner L learn?

  22. Learn evaluation function V? • One possibility: learn evaluation function V*(s). • Then L would know what action to take next: • L should prefer state s1over s2whenever V*(s1) > V*(s2) • Optimal action a in state sis the one that maximizes sum of r(s,a)(immediate reward) and V* of the immediate successor state, discounted by γ:

  23. Problem with learning V* Problem: Using V* to obtain optimal policy π* requires perfect knowledge of δ and r, which we earlier said are unknown to L.

  24. Alternative: Q Learning • Alternative: we will estimate the following evaluation function Q: S × A, as follows: • Suppose learner L has a good estimate for Q(s,a). Then, at each time step, L simply chooses action that maximizes Q(s,a).

  25. How to learn Q ?(In theory) We have: We can learn Q via iterative approximation.

  26. How to learn Q ?(In practice) • Initialize Q(s,a) to small random values for all s and a • Initializes • Repeat forever (or as long as you have time for): • Select action a • Take action a and receive reward r • Observe new state s´ • UpdateQ(s,a)Q(s,a) + η (r + γmaxa´Q(s´,a´) – Q(s, a)) • Update ss´

  27. Example of Q learning Let γ= .2 Let η = 1 State: North, South, East, West, Current W = Wall E = Empty C = Can Reward:

  28. Example of Q learning Initialize Q(s,a) to small random values (here, 0) for all s and a State: North, South, East, West, Current W = Wall E = Empty C = Can Reward:

  29. Example of Q learning Initialize s State: North, South, East, West, Current W = Wall E = Empty C = Can Reward:

  30. Example of Q learning Select action a State: North, South, East, West, Current W = Wall E = Empty C = Can Reward: Trial: 1

  31. Example of Q learning Perform action a and receive reward r State: North, South, East, West, Current W = Wall E = Empty C = Can Reward: −1 Trial: 1

  32. Example of Q learning • Observe new state s´ State: North, South, East, West, Current W = Wall E = Empty C = Can Reward: −1 Trial: 1

  33. Example of Q learning Update Q(s,a)Q(s,a) + η (r + γmaxa´Q(s´,a´) – Q(s, a)) State: North, South, East, West, Current W = Wall E = Empty C = Can Reward: −1 Trial: 1

  34. Example of Q learning • Update ss´ State: North, South, East, West, Current W = Wall E = Empty C = Can Reward: −1 Trial: 1

  35. Example of Q learning Select action a State: North, South, East, West, Current W = Wall E = Empty C = Can Reward: Trial: 2

  36. Example of Q learning Perform action a and receive reward r State: North, South, East, West, Current W = Wall E = Empty C = Can Reward: 0 Trial: 2

  37. Example of Q learning • Observe new state s´ State: North, South, East, West, Current W = Wall E = Empty C = Can Reward: 0 Trial: 2

  38. Example of Q learning Update Q(s,a)Q(s,a) + η (r + γmaxa´Q(s´,a´) – Q(s, a)) State: North, South, East, West, Current W = Wall E = Empty C = Can Reward: 0 Trial: 2

  39. Example of Q learning • Update ss´ State: North, South, East, West, Current W = Wall E = Empty C = Can Reward: 0 Trial: 2

  40. Example of Q learning Select action a State: North, South, East, West, Current W = Wall E = Empty C = Can Reward: Trial: 3

  41. Example of Q learning Perform action a and receive reward r State: North, South, East, West, Current W = Wall E = Empty C = Can Reward: 10 Trial: 3

  42. Example of Q learning • Observe new state s´ State: North, South, East, West, Current W = Wall E = Empty C = Can Reward: 10 Trial: 3

  43. Example of Q learning Update Q(s,a)Q(s,a) + η (r + γmaxa´Q(s´,a´) – Q(s, a)) State: North, South, East, West, Current W = Wall E = Empty C = Can Reward: 10 Trial: 3

  44. Example of Q learning • Update ss´ State: North, South, East, West, Current W = Wall E = Empty C = Can Reward: 10 Trial: 3

  45. Skipping ahead... Select action a State: North, South, East, West, Current W = Wall E = Empty C = Can Reward: Trial:m

  46. Skipping ahead... Perform action a and receive reward r State: North, South, East, West, Current W = Wall E = Empty C = Can Reward: 10 Trial:m

  47. Skipping ahead... • Observe new state s´ State: North, South, East, West, Current W = Wall E = Empty C = Can Reward: 10 Trial:m

  48. Skipping ahead... Update Q(s,a)Q(s,a) + η (r + γmaxa´Q(s´,a´) – Q(s, a)) State: North, South, East, West, Current W = Wall E = Empty C = Can Reward: 10 Trial:m

  49. Now on a new environment... Select action a State: North, South, East, West, Current W = Wall E = Empty C = Can Reward: Trial:n

More Related