250 likes | 399 Views
Reinforcement Learning (2). Lirong Xia. Tue, March 21, 2014. Reminder. Project 2 due tonight Project 3 is online (more later) due in two weeks. Recap: MDPs. Markov decision processes: S tates S Start state s 0 Actions A Transition p ( s’|s,a ) (or T( s,a,s ’))
E N D
Reinforcement Learning (2) Lirong Xia Tue, March 21, 2014
Reminder • Project 2 due tonight • Project 3 is online (more later) • due in two weeks
Recap: MDPs • Markov decision processes: • States S • Start state s0 • Actions A • Transition p(s’|s,a) (or T(s,a,s’)) • Reward R(s,a,s’) (and discount ϒ) • MDP quantities: • Policy = Choice of action for each (MAX) state • Utility (or return) = sum of discounted rewards
Optimal Utilities • The value of a state s: • V*(s) = expected utility starting in s and acting optimally • The value of a Q-state (s,a): • Q*(s,a) = expected utility starting in s, taking action a and thereafter acting optimally • The optimal policy: • π*(s) = optimal action from state s
Solving MDPs • Value iteration • Start with V1(s) = 0 • Given Vi, calculate the values for all states for depth i+1: • Repeat until converge • Use Vias evaluation function when computing Vi+1 • Policy iteration • Step 1: policy evaluation: calculate utilities for some fixed policy • Step 2: policy improvement: update policy using one-step look-ahead with resulting utilities as future values • Repeat until policy converges
Reinforcement learning • Don’t know T and/or R, but can observe R • Learn by doing • can have multiple episodes (trials)
The Story So Far: MDPs and RL Techniques: • Computation • Value and policy iteration • Policy evaluation • Model-based RL • sampling • Model-free RL: • Q-learning Things we know how to do: • If we know the MDP • Compute V*, Q*, π* exactly • Evaluate a fixed policy π • If we don’t know the MDP • If we can estimate the MDP then solve • We can estimate V for a fixed policy π • We can estimate Q*(s,a) for the optimal policy while executing an exploration policy
Model-Free Learning • Model-free (temporal difference) learning • Experience world through episodes (s,a,r,s’,a’,r’,s’’,a’’,r’’,s’’’…) • Update estimates each transition (s,a,r,s’) • Over time, updates will mimic Bellman updates
Detour: Q-Value Iteration • We’d like to do Q-value updates to each Q-state: • But can’t compute this update without knowing T,R • Instead, compute average as we go • Receive a sample transition (s,a,r,s’) • This sample suggests • But we want to merge the new observation to the old ones • So keep a running average
Q-Learning Properties • Will converge to optimal policy • If you explore enough (i.e. visit each q-state many times) • If you make the learning rate small enough • Basically doesn’t matter how you select actions (!) • Off-policy learning: learns optimal q-values, not the values of the policy you are following
Q-Learning • Q-learning produces tables of q-values:
Exploration / Exploitation • Random actions (ε greedy) • Every time step, flip a coin • With probability ε, act randomly • With probability 1-ε, act according to current policy
Today: Q-Learning with state abstraction • In realistic situations, we cannot possibly learn about every single state! • Too many states to visit them all in training • Too many states to hold the Q-tables in memory • Instead, we want to generalize: • Learn about some small number of training states from experience • Generalize that experience to new, similar states • This is a fundamental idea in machine learning, and we’ll see it over and over again
Example: Pacman • Let’s say we discover through experience that this state is bad: • In naive Q-learning, we know nothing about this state or its Q-states: • Or even this one!
Feature-Based Representations • Solution: describe a state using a vector of features (properties) • Features are functions from states to real numbers (often 0/1) that capture important properties of the state • Example features: • Distance to closest ghost • Distance to closest dot • Number of ghosts • 1/ (dist to dot)2 • Is Pacman in a tunnel? (0/1) • …etc • Is it the exact state on this slide? • Can also describe a Q-state (s,a) with features (e.g. action moves closer to food) Similar to a evaluation function
Linear Feature Functions • Using a feature representation, we can write a Q function (or value function) for any state using a few weights: • Advantage: our experience is summed up in a few powerful numbers • Disadvantage: states may share features but actually be very different in value!
Function Approximation • Q-learning with linear Q-functions: transition = (s,a,r,s’) • Intuitive interpretation: • Adjust weights of active features • E.g. if something unexpectedly bad happens, disprefer all states with that state’s features • Formal justification: online least squares Exact Q’s Approximate Q’s
Example: Q-Pacman s Q(s,a)= 4.0fDOT(s,a)-1.0fGST(s,a) fDOT(s,NORTH)=0.5 fGST(s,NORTH)=1.0 Q(s,a)=+1 R(s,a,s’)=-500 difference=-501 wDOT←4.0+α[-501]0.5 wGST ←-1.0+α[-501]1.0 Q(s,a)= 3.0fDOT(s,a)-3.0fGST(s,a) α= North r = -500 s’
Minimizing Error Imagine we had only one point x with features f(x): Approximate q update explained: “target” “prediction”
How many features should we use? • As many as possible? • computational burden • overfitting • Feature selection is important • requires domain expertise
Overview of Project 3 • MDPs • Q1: value iteration • Q2: find parameters that lead to certain optimal policy • Q3: similar to Q2 • Q-learning • Q4: implement the Q-learning algorithm • Q5: implementεgreedy action selection • Q6: try the algorithm • Approximate Q-learning and state abstraction • Q7: Pacman • Q8 (bonus): design and implement a state-abstraction Q-learning algorithm • Hints • make your implementation general • try to use class methods as much as possible