1 / 26

Model-Free vs. Model-Based RL: Q, SARSA, & E 3

Model-Free vs. Model-Based RL: Q, SARSA, & E 3. Administrivia. Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary Final projects Final presentations Dec 2, 7, 9 20 min (max) presentations 3 or 4 per day Sign up for presentation slots today!.

juro
Download Presentation

Model-Free vs. Model-Based RL: Q, SARSA, & E 3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Model-Free vs. Model-Based RL: Q, SARSA, & E3

  2. Administrivia • Reminder: • Office hours tomorrow truncated • 9:00-10:15 AM • Can schedule other times if necessary • Final projects • Final presentations Dec 2, 7, 9 • 20 min (max) presentations • 3 or 4 per day • Sign up for presentation slots today!

  3. The Q-learning algorithm • Algorithm: Q_learn • Inputs: State space S; Act. space A • Discount γ (0<=γ<1); Learning rate α (0<=α<1) • Outputs: Q • Repeat { • s=get_current_world_state() • a=pick_next_action(Q,s) • (r,s’)=act_in_world(a) • Q(s,a)=Q(s,a)+α*(r+γ*max_a’(Q(s’,a’))-Q(s,a)) • } Until (bored)

  4. SARSA-learning algorithm • Algorithm: SARSA_learn • Inputs: State space S; Act. space A • Discount γ (0<=γ<1); Learning rate α (0<=α<1) • Outputs: Q • s=get_current_world_state() • a=pick_next_action(Q,s) • Repeat { • (r,s’)=act_in_world(a) • a’=pick_next_action(Q,s’) • Q(s,a)=Q(s,a)+α*(r+γ*Q(s’,a’)-Q(s,a)) • a=a’; s=s’; • } Until (bored)

  5. SARSA vs. Q • SARSA and Q-learning very similar • SARSA updates Q(s,a) for the policy it’s actually executing • Lets the pick_next_action()function pick action to update • Q updates Q(s,a) for greedy policy w.r.t. current Q • Uses max_a to pick action to update • might be diff than the action it executes at s’ • In practice: Q will learn the “true” π*, but SARSA will learn about what it’s actually doing • Exploration can get Q-learning in trouble...

  6. Radioactive breadcrumbs • Can now define eligibility traces for SARSA • In addition to Q(s,a) table, keep an e(s,a) table • Records “eligibility” (real number) for each state/action pair • At every step ((s,a,r,s’,a’) tuple): • Increment e(s,a) for current (s,a) pair by 1 • Update all Q(s’’,a’’) vals in proportion to their e(s’’,a’’) • Decay all e(s’’,a’’) by factor of λγ • Leslie Kaelbling calls this the “radioactive breadcrumbs” form of RL

  7. SARSA(λ)-learning alg. • Algorithm: SARSA(λ)_learn • Inputs: S, A, γ (0<=γ<1), α (0<=α<1), λ (0<=λ<1) • Outputs: Q • e(s,a)=0 // for all s, a • s=get_curr_world_st(); a=pick_nxt_act(Q,s); • Repeat { • (r,s’)=act_in_world(a) • a’=pick_next_action(Q,s’) • δ=r+γ*Q(s’,a’)-Q(s,a) • e(s,a)+=1 • foreach (s’’,a’’) pair in (SXA) { • Q(s’’,a’’)=Q(s’’,a’’)+α*e(s’’,a’’)*δ • e(s’’,a’’)*=λγ } • a=a’; s=s’; • } Until (bored)

  8. The trail of crumbs Sutton & Barto, Sec 7.5

  9. The trail of crumbs λ=0 Sutton & Barto, Sec 7.5

  10. The trail of crumbs Sutton & Barto, Sec 7.5

  11. Eligibility for a single state e(si,aj) 1st visit 2nd visit ... Sutton & Barto, Sec 7.5

  12. Eligibility trace followup • Eligibility trace allows: • Tracking where the agent has been • Backup of rewards over longer periods • Credit assignment: state/action pairs rewarded for having contributed to getting to the reward • Why does it work?

  13. The “forward view” of elig. • Original SARSA did “one step” backup: Info backup Rest of trajectory Q(st+1,at+1) rt Q(s,a)

  14. The “forward view” of elig. • Original SARSA did “one step” backup: • Could also do a “two step backup”: Info backup Rest of trajectory Q(st+2,at+2) rt+1 rt Q(s,a)

  15. The “forward view” of elig. • Original SARSA did “one step” backup: • Could also do a “two step backup”: • Or even an “n step backup”:

  16. The “forward view” of elig. • Small-step backups (n=1, n=2, etc.) are slow and nearsighted • Large-step backups (n=100, n=1000, n=∞) are expensive and may miss near-term effects • Want a way to combine them • Can take a weighted average of different backups • E.g.:

  17. The “forward view” of elig. 1/3 2/3

  18. The “forward view” of elig. • How do you know which number of steps to avg over? And what the weights should be? • Accumulating eligibility traces are just a clever way to easily avg. over alln:

  19. The “forward view” of elig. λ0 λ1 λ2 λn-1

  20. Replacing traces • Kind just described are accumulating e-traces • Every time you go back to state, add extra e. • There are also replacing eligibility traces • Every time you go back to a state/action, reset e(s,a) to 1 • Works better sometimes Sutton & Barto, Sec 7.8

  21. Model-free vs.Model-based

  22. What do you know? • Both Q-learning and SARSA(λ) are model free methods • A.k.a., value-based methods • Learn a Q function • Never learn T or R explicitly • At the end of learning, agent knows how to act, but doesn’t explicitly know anything about the environment • Also, no guarantees about explore/exploit tradeoff • Sometimes, want one or both of the above

  23. Model-based methods • Model based methods, OTOH, do explicitly learn T & R • At end of learning, have entire M=〈S,A,T,R〉 • Also have π* • At least one model-based method also guarantees explore/exploit tradeoff properties

  24. E3 • Efficient Explore & Exploit algorithm • Kearns & Singh, Machine Learning 49, 2002 • Explicitly keeps a T matrix and a R table • Plan (policy iter) w/ curr. T & R -> curr. π • Every state/action entry in T and R: • Can be marked known or unknown • Has a #visits counter, nv(s,a) • After every 〈s,a,r,s’〉 tuple, update T & R (running average) • When nv(s,a)>NVthresh , mark cell as known & re-plan • When all states known, done learning & have π*

  25. The E3 algorithm • Algorithm: E3_learn_sketch // only an overview • Inputs: S, A,γ (0<=γ<1), NVthresh, Rmax, Varmax • Outputs: T, R, π* • Initialization: • R(s)=Rmax // for all s • T(s,a,s’)=1/|S| // for all s,a,s’ • known(s,a)=0; nv(s,a)=0; // for all s, a • π=policy_iter(S,A,T,R)

  26. The E3 algorithm • Algorithm: E3_learn_sketch // con’t • Repeat { • s=get_current_world_state() • a=π(s) • (r,s’)=act_in_world(a) • T(s,a,s’)=(1+T(s,a,s’)*nv(s,a))/(nv(s,a)+1) • nv(s,a)++; • if (nv(s,a)>NVthresh) { • known(s,a)=1; • π=policy_iter(S,A,T,R) • } • } Until (all (s,a) known)

More Related