Reinforcement Learning

Reinforcement Learning Guest Lecturer: Chengxiang Zhai 15-681 Machine Learning December 6, 2001

Outline For Today • The Reinforcement Learning Problem • Markov Decision Process • Q-Learning • Summary

The Checker Problem Revisited • Goal: To win every game! • What to learn: Given any board position, choose a “good” move • But, what is a “good” move? • A move that helps win a game • A move that will lead to a “better” board position • So, what is a “better” board position? • A position where a “good” next move exists!

Structure of the Checker Problem • You are interacting/experimenting with an environment (board) • You see the state of the environment (board position) • And, you take an action (move), which will • change the state of the environment • result in an immediate reward • Immediate reward = 0 unless you win (+100) or lose (-100) the game • You want to learn to “control” the environment (board) so as to maximize your long term reward (win the game)

Agent state s reward r action a t t t r Environment t +1 s t +1 a 0 s 0 r 1 a a 1 2 s s s 1 2 3 r r 2 3 r g r g 2 r +… Maximize + + 1 2 3 (0 g <1 discount factor) Reinforcement Learning Problem

Three Elements in RL ? reward r state s action a (Slide from Prof. Sebastian Thrun’s lecture)

Example 1 : Slot Machine • State: configuration of slots • Action: stopping time • Reward: $$$ (Slide from Prof. Sebastian Thrun’s lecture)

Example 2 : Mobile Robot • State: location of robot, people, etc. • Action: motion • Reward: the number of happy faces (Slide from Prof. Sebastian Thrun’s lecture)

Example 3 : Backgammon • State: Board position • Action: move • Reward: • win (+100) • lose (-100) TD-Gammon best human players in the world

What Are We Learning Exactly? • A decision function/policy • Given the state, choose an action • Formally, • States: S={s1, …sn} • Actions: A={a1,…,am} • Reward: R • Find :SA that maximizes R (cumulativ reward over time)

So, What’s Special About Reinforcement Learning? Find :SA  Function Approx.

Agent state s reward r action a t t t r Environment t +1 s t +1 a 0 s 0 r 1 a a 1 2 s s s 1 2 3 r r 2 3 r g r g 2 r +… Maximize + + 1 2 3 (0< g <1 discount factor) Reinforcement Learning Problem

What’s So Special About CL?(Answers from “the book”) • Delayed Reward • Exploration • Partially observable states • Life-long learning

Now that we know the problem,How do we solve it? ==> Markov Decision Process (MDP)

Markov Decision Process (MDP) • Finite set of states S • Finite set of actions A • At each time step the agent observes state stS and chooses action atA(st) • Then receives immediate reward rt+1=r(st,at) • And state changes to st+1 =d(st,at) • Markov assumption : st+1=d(st,at) and rt+1=r(st,at) • Next reward and state only depend on current state st and action at • Functions d(st,at) and r(st,at) may be non-deterministic • Functions d(st,at) and r(st,at) not necessarily known to agent

Learning A Policy • A policy tells us how to choose an action given a state • An optimal policy is one that gives the best cumulative reward from any initial state • we define a cumulative value function for a policy pVp(s)= rt+ rt+1+ 2 rt+2+…= Si=0 rt+ii where rt, rt+1,… are generated by following the policy p from start state s • Task: Learn the optimal policy p* that maximizes Vp(s) s, p* = argmaxp Vp(s) • Define optimal value function V*(s) = Vp*(s)

Idea 1: Enumerating Policy • For each policy p : S  A • For each state s, compute the evaluation function Vp(s) • Pick the p that has the largest Vp(s) • What’s the problem? • Complexity! • How do we get around this? • Observation: If we know V*(s) = Vp*(s), we can find p* p*(s) = argmax a [r(s,a) +  V*( d(s,a))]

Idea 2: Learn V*(s) • For each state, compute V*(s)(less complexity) • Given the current state s, choose action according to p*(s) = argmax a [r(s,a) +  V*( d(s,a))] • What’s the problem this time? • This works, but only if we know r(s,a) and d(s,a) • How can we evaluate an action without knowing r(s,a) and d(s,a) ? • Observation: It seems that all we need is some function like Q(s,a) … …[ p*(s) = argmax a Q(s,a) ]

Idea 3: Learn Q(s,a) • Because we know p*(s) = argmax a [r(s,a) +  V*( d(s,a))] • If we want p*(s) = argmax a Q(s,a) , then we must have Q(s,a) = r(s,a) +  V*( d(s,a)) • We can express V* in terms of Q! V*(s) = max a Q(s,a) • So, we have THE RULE FOR Q-LEARNING Q(s,a) = r(s,a) +  max a’ Q(d(s,a’),a’) Value of a on s reward of a on s Best value of any action on next state

Q-Learning for Deterministic Worlds For each <s,a>, initialize table entry Q(s,a) =0 Observe current state s Do forever: • Select an action a and execute it • Receive immediate reward r • Observe the new state s’ • Update the entry for Q(s,a) as follows: • Change to state s’ Q(s,a) = r +  max a’ Q(s’,a’)

Why does Q-learning Work? • Q-learning converges! • Intuitively, for non-negative rewards, Estimated Q values never decrease and never exceed true Q values • Maximum error goes down by a factor of  after each state is updated Qn+1(s,a) = r +  max a’ Qn(s’,a’) True Q(s,a) True Q(s,a)

Nondeterministic Case • Both r(s,a) and d(s,a) may have probabilistic outcomes • Solution: Just add expectation! • Update rule is slightly different (partially updating, “stop” after enough visitings) • Also converges Q(s,a) =E[ r(s,a)] + Ep(s’|s,a)[max a’ Q(s’,a’)]

Extensions of Q-Learning • How can we accelerate Q-learning? • Choose action a that can maximize Q(s,a) (exploration vs. exploitation) • Updating sequences • Store past state-action transitions • Exploit knowledge of transition and reward function (simulation) • What if we can’t store all the entries? • Function Approximation (Neural Networks,etc)

Temporal Difference (TD) Learning • Learn by reducing discrepancies between estimates made at different times • Q-learning is a special case with one-step lookahead. Why not more than one-step? • TD(): Blend one-step, two-step, …, n-step lookahead with coefficients depending on  • When  =0, we get one-step Q-learning • When  =1, only the observed r values are considered Q (st,at) = rt +  [(1-  ) max a Q(st,at)+ Q (st+1,at+1)]

What You Should Know • All basic concepts of RL (state, action, reward, policy, value functions, discounted cumulative reward, …) • Mathematical foundation of RL is MDP and dynamic programming • Details of Q-learning including its limitation (You should be able to implement it!) • Q-learning is a member of temporal difference algorithms

Reinforcement Learning