270 likes | 455 Views
Model-Free vs. Model-Based RL: Q, SARSA, & E 3. Administrivia. Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary Final projects Final presentations Dec 2, 7, 9 20 min (max) presentations 3 or 4 per day Sign up for presentation slots today!.
E N D
Administrivia • Reminder: • Office hours tomorrow truncated • 9:00-10:15 AM • Can schedule other times if necessary • Final projects • Final presentations Dec 2, 7, 9 • 20 min (max) presentations • 3 or 4 per day • Sign up for presentation slots today!
The Q-learning algorithm • Algorithm: Q_learn • Inputs: State space S; Act. space A • Discount γ (0<=γ<1); Learning rate α (0<=α<1) • Outputs: Q • Repeat { • s=get_current_world_state() • a=pick_next_action(Q,s) • (r,s’)=act_in_world(a) • Q(s,a)=Q(s,a)+α*(r+γ*max_a’(Q(s’,a’))-Q(s,a)) • } Until (bored)
SARSA-learning algorithm • Algorithm: SARSA_learn • Inputs: State space S; Act. space A • Discount γ (0<=γ<1); Learning rate α (0<=α<1) • Outputs: Q • s=get_current_world_state() • a=pick_next_action(Q,s) • Repeat { • (r,s’)=act_in_world(a) • a’=pick_next_action(Q,s’) • Q(s,a)=Q(s,a)+α*(r+γ*Q(s’,a’)-Q(s,a)) • a=a’; s=s’; • } Until (bored)
SARSA vs. Q • SARSA and Q-learning very similar • SARSA updates Q(s,a) for the policy it’s actually executing • Lets the pick_next_action()function pick action to update • Q updates Q(s,a) for greedy policy w.r.t. current Q • Uses max_a to pick action to update • might be diff than the action it executes at s’ • In practice: Q will learn the “true” π*, but SARSA will learn about what it’s actually doing • Exploration can get Q-learning in trouble...
Radioactive breadcrumbs • Can now define eligibility traces for SARSA • In addition to Q(s,a) table, keep an e(s,a) table • Records “eligibility” (real number) for each state/action pair • At every step ((s,a,r,s’,a’) tuple): • Increment e(s,a) for current (s,a) pair by 1 • Update all Q(s’’,a’’) vals in proportion to their e(s’’,a’’) • Decay all e(s’’,a’’) by factor of λγ • Leslie Kaelbling calls this the “radioactive breadcrumbs” form of RL
SARSA(λ)-learning alg. • Algorithm: SARSA(λ)_learn • Inputs: S, A, γ (0<=γ<1), α (0<=α<1), λ (0<=λ<1) • Outputs: Q • e(s,a)=0 // for all s, a • s=get_curr_world_st(); a=pick_nxt_act(Q,s); • Repeat { • (r,s’)=act_in_world(a) • a’=pick_next_action(Q,s’) • δ=r+γ*Q(s’,a’)-Q(s,a) • e(s,a)+=1 • foreach (s’’,a’’) pair in (SXA) { • Q(s’’,a’’)=Q(s’’,a’’)+α*e(s’’,a’’)*δ • e(s’’,a’’)*=λγ } • a=a’; s=s’; • } Until (bored)
The trail of crumbs Sutton & Barto, Sec 7.5
The trail of crumbs λ=0 Sutton & Barto, Sec 7.5
The trail of crumbs Sutton & Barto, Sec 7.5
Eligibility for a single state e(si,aj) 1st visit 2nd visit ... Sutton & Barto, Sec 7.5
Eligibility trace followup • Eligibility trace allows: • Tracking where the agent has been • Backup of rewards over longer periods • Credit assignment: state/action pairs rewarded for having contributed to getting to the reward • Why does it work?
The “forward view” of elig. • Original SARSA did “one step” backup: Info backup Rest of trajectory Q(st+1,at+1) rt Q(s,a)
The “forward view” of elig. • Original SARSA did “one step” backup: • Could also do a “two step backup”: Info backup Rest of trajectory Q(st+2,at+2) rt+1 rt Q(s,a)
The “forward view” of elig. • Original SARSA did “one step” backup: • Could also do a “two step backup”: • Or even an “n step backup”:
The “forward view” of elig. • Small-step backups (n=1, n=2, etc.) are slow and nearsighted • Large-step backups (n=100, n=1000, n=∞) are expensive and may miss near-term effects • Want a way to combine them • Can take a weighted average of different backups • E.g.:
The “forward view” of elig. 1/3 2/3
The “forward view” of elig. • How do you know which number of steps to avg over? And what the weights should be? • Accumulating eligibility traces are just a clever way to easily avg. over alln:
The “forward view” of elig. λ0 λ1 λ2 λn-1
Replacing traces • Kind just described are accumulating e-traces • Every time you go back to state, add extra e. • There are also replacing eligibility traces • Every time you go back to a state/action, reset e(s,a) to 1 • Works better sometimes Sutton & Barto, Sec 7.8
What do you know? • Both Q-learning and SARSA(λ) are model free methods • A.k.a., value-based methods • Learn a Q function • Never learn T or R explicitly • At the end of learning, agent knows how to act, but doesn’t explicitly know anything about the environment • Also, no guarantees about explore/exploit tradeoff • Sometimes, want one or both of the above
Model-based methods • Model based methods, OTOH, do explicitly learn T & R • At end of learning, have entire M=〈S,A,T,R〉 • Also have π* • At least one model-based method also guarantees explore/exploit tradeoff properties
E3 • Efficient Explore & Exploit algorithm • Kearns & Singh, Machine Learning 49, 2002 • Explicitly keeps a T matrix and a R table • Plan (policy iter) w/ curr. T & R -> curr. π • Every state/action entry in T and R: • Can be marked known or unknown • Has a #visits counter, nv(s,a) • After every 〈s,a,r,s’〉 tuple, update T & R (running average) • When nv(s,a)>NVthresh , mark cell as known & re-plan • When all states known, done learning & have π*
The E3 algorithm • Algorithm: E3_learn_sketch // only an overview • Inputs: S, A,γ (0<=γ<1), NVthresh, Rmax, Varmax • Outputs: T, R, π* • Initialization: • R(s)=Rmax // for all s • T(s,a,s’)=1/|S| // for all s,a,s’ • known(s,a)=0; nv(s,a)=0; // for all s, a • π=policy_iter(S,A,T,R)
The E3 algorithm • Algorithm: E3_learn_sketch // con’t • Repeat { • s=get_current_world_state() • a=π(s) • (r,s’)=act_in_world(a) • T(s,a,s’)=(1+T(s,a,s’)*nv(s,a))/(nv(s,a)+1) • nv(s,a)++; • if (nv(s,a)>NVthresh) { • known(s,a)=1; • π=policy_iter(S,A,T,R) • } • } Until (all (s,a) known)