1 / 25

Reinforcement Learning

Reinforcement Learning. Guest Lecturer: Chengxiang Zhai 15-681 Machine Learning December 6, 2001. Outline For Today. The Reinforcement Learning Problem Markov Decision Process Q-Learning Summary. The Checker Problem Revisited. Goal: To win every game!

laban
Download Presentation

Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement Learning Guest Lecturer: Chengxiang Zhai 15-681 Machine Learning December 6, 2001

  2. Outline For Today • The Reinforcement Learning Problem • Markov Decision Process • Q-Learning • Summary

  3. The Checker Problem Revisited • Goal: To win every game! • What to learn: Given any board position, choose a “good” move • But, what is a “good” move? • A move that helps win a game • A move that will lead to a “better” board position • So, what is a “better” board position? • A position where a “good” next move exists!

  4. Structure of the Checker Problem • You are interacting/experimenting with an environment (board) • You see the state of the environment (board position) • And, you take an action (move), which will • change the state of the environment • result in an immediate reward • Immediate reward = 0 unless you win (+100) or lose (-100) the game • You want to learn to “control” the environment (board) so as to maximize your long term reward (win the game)

  5. Agent state s reward r action a t t t r Environment t +1 s t +1 a 0 s 0 r 1 a a 1 2 s s s 1 2 3 r r 2 3 r g r g 2 r +… Maximize + + 1 2 3 (0 g <1 discount factor) Reinforcement Learning Problem

  6. Three Elements in RL ? reward r state s action a (Slide from Prof. Sebastian Thrun’s lecture)

  7. Example 1 : Slot Machine • State: configuration of slots • Action: stopping time • Reward: $$$ (Slide from Prof. Sebastian Thrun’s lecture)

  8. Example 2 : Mobile Robot • State: location of robot, people, etc. • Action: motion • Reward: the number of happy faces (Slide from Prof. Sebastian Thrun’s lecture)

  9. Example 3 : Backgammon • State: Board position • Action: move • Reward: • win (+100) • lose (-100) TD-Gammon best human players in the world

  10. What Are We Learning Exactly? • A decision function/policy • Given the state, choose an action • Formally, • States: S={s1, …sn} • Actions: A={a1,…,am} • Reward: R • Find :SA that maximizes R (cumulativ reward over time)

  11. So, What’s Special About Reinforcement Learning? Find :SA  Function Approx.

  12. Agent state s reward r action a t t t r Environment t +1 s t +1 a 0 s 0 r 1 a a 1 2 s s s 1 2 3 r r 2 3 r g r g 2 r +… Maximize + + 1 2 3 (0< g <1 discount factor) Reinforcement Learning Problem

  13. What’s So Special About CL?(Answers from “the book”) • Delayed Reward • Exploration • Partially observable states • Life-long learning

  14. Now that we know the problem,How do we solve it? ==> Markov Decision Process (MDP)

  15. Markov Decision Process (MDP) • Finite set of states S • Finite set of actions A • At each time step the agent observes state stS and chooses action atA(st) • Then receives immediate reward rt+1=r(st,at) • And state changes to st+1 =d(st,at) • Markov assumption : st+1=d(st,at) and rt+1=r(st,at) • Next reward and state only depend on current state st and action at • Functions d(st,at) and r(st,at) may be non-deterministic • Functions d(st,at) and r(st,at) not necessarily known to agent

  16. Learning A Policy • A policy tells us how to choose an action given a state • An optimal policy is one that gives the best cumulative reward from any initial state • we define a cumulative value function for a policy pVp(s)= rt+ rt+1+ 2 rt+2+…= Si=0 rt+ii where rt, rt+1,… are generated by following the policy p from start state s • Task: Learn the optimal policy p* that maximizes Vp(s) s, p* = argmaxp Vp(s) • Define optimal value function V*(s) = Vp*(s)

  17. Idea 1: Enumerating Policy • For each policy p : S  A • For each state s, compute the evaluation function Vp(s) • Pick the p that has the largest Vp(s) • What’s the problem? • Complexity! • How do we get around this? • Observation: If we know V*(s) = Vp*(s), we can find p* p*(s) = argmax a [r(s,a) +  V*( d(s,a))]

  18. Idea 2: Learn V*(s) • For each state, compute V*(s)(less complexity) • Given the current state s, choose action according to p*(s) = argmax a [r(s,a) +  V*( d(s,a))] • What’s the problem this time? • This works, but only if we know r(s,a) and d(s,a) • How can we evaluate an action without knowing r(s,a) and d(s,a) ? • Observation: It seems that all we need is some function like Q(s,a) … …[ p*(s) = argmax a Q(s,a) ]

  19. Idea 3: Learn Q(s,a) • Because we know p*(s) = argmax a [r(s,a) +  V*( d(s,a))] • If we want p*(s) = argmax a Q(s,a) , then we must have Q(s,a) = r(s,a) +  V*( d(s,a)) • We can express V* in terms of Q! V*(s) = max a Q(s,a) • So, we have THE RULE FOR Q-LEARNING Q(s,a) = r(s,a) +  max a’ Q(d(s,a’),a’) Value of a on s reward of a on s Best value of any action on next state

  20. Q-Learning for Deterministic Worlds For each <s,a>, initialize table entry Q(s,a) =0 Observe current state s Do forever: • Select an action a and execute it • Receive immediate reward r • Observe the new state s’ • Update the entry for Q(s,a) as follows: • Change to state s’ Q(s,a) = r +  max a’ Q(s’,a’)

  21. Why does Q-learning Work? • Q-learning converges! • Intuitively, for non-negative rewards, Estimated Q values never decrease and never exceed true Q values • Maximum error goes down by a factor of  after each state is updated Qn+1(s,a) = r +  max a’ Qn(s’,a’) True Q(s,a) True Q(s,a)

  22. Nondeterministic Case • Both r(s,a) and d(s,a) may have probabilistic outcomes • Solution: Just add expectation! • Update rule is slightly different (partially updating, “stop” after enough visitings) • Also converges Q(s,a) =E[ r(s,a)] + Ep(s’|s,a)[max a’ Q(s’,a’)]

  23. Extensions of Q-Learning • How can we accelerate Q-learning? • Choose action a that can maximize Q(s,a) (exploration vs. exploitation) • Updating sequences • Store past state-action transitions • Exploit knowledge of transition and reward function (simulation) • What if we can’t store all the entries? • Function Approximation (Neural Networks,etc)

  24. Temporal Difference (TD) Learning • Learn by reducing discrepancies between estimates made at different times • Q-learning is a special case with one-step lookahead. Why not more than one-step? • TD(): Blend one-step, two-step, …, n-step lookahead with coefficients depending on  • When  =0, we get one-step Q-learning • When  =1, only the observed r values are considered Q (st,at) = rt +  [(1-  ) max a Q(st,at)+ Q (st+1,at+1)]

  25. What You Should Know • All basic concepts of RL (state, action, reward, policy, value functions, discounted cumulative reward, …) • Mathematical foundation of RL is MDP and dynamic programming • Details of Q-learning including its limitation (You should be able to implement it!) • Q-learning is a member of temporal difference algorithms

More Related