150 likes | 280 Views
Odds & Ends. Administrivia. Reminder: Q3 Nov 10 CS outreach: UNM SOE holding open house for HS seniors Want CS dept participation We want to show off the coolest things in CS Come demo your P1 and P2 code! Contact me or Lynne Jacobson. The bird of time. Last time: Eligibility traces
E N D
Administrivia • Reminder: Q3 Nov 10 • CS outreach: • UNM SOE holding open house for HS seniors • Want CS dept participation • We want to show off the coolest things in CS • Come demo your P1 and P2 code! • Contact me or Lynne Jacobson
The bird of time... • Last time: • Eligibility traces • The SARSA(λ) algorithm • Design exercise • This time: • Tip o’ the day • Notes on exploration • Design exercise, cont’d.
Tip o’ the day • Micro-experiments • Often, often, often when hacking: • “How the heck does that function work?” • “The docs don’t say what happens when you hand null to the constructor...” • “Uhhh... Will this work if I do it this way?” • “WTF does that mean?” • Could spend a bunch of time in the docs • Or... • Could just go and try it
Tip o’ the day • Answer: micro-experiments • Write a very small (<50 line) test program to make sure you understand what the thing does • Think: homework assignment from CS152 • Quick to write • Answers question better than docs can • Builds your intuition about what the machine is doing • Using the debugger to watch is also good
Q learning in code... • public class MyAgent implements Agent { • public void updateModel(SARSTuple s) { • State2d start=s.getInitState(); • State2d end=s.getNextState(); • Action act=s.getAction(); • double r=s.getReward(); • Action nextAct=_policy.argmaxAct(end); • double Qnow=_policy.get(start,act); • double Qnext=_policy.get(end,nextAct); • double Qrevised=Qnow+getAlpha()* • (r+getGamma()*Qnext-Qnow); • _policy.set(start,act,Qrevised); • } • }
The SARSA(λ) code • public class SARSAlAgent implements Agent { • public void updateModel(SARSTuple s) { • State2d start=s.getInitState(); • State2d end=s.getNextState(); • Action act=s.getAction(); • double r=s.getReward(); • Action nextAct=pickAction(end); • double Qnow=_policy.get(start,act); • double Qnext=_policy.get(end,nextAct); • double delta=r+_gamma*Qnext-Qnow; • setElig(start,act,getElig(start,act)+1.0); • for (SAPair p : getEligiblePairs()) { • currQ=_policy.get(p.getS(),p.getA()); • _policy.set(p.getS(),p.getA(), • currQ+getElig(p.getS(),p.getA())*_alpha*delta); • setElig(p.getS(),p.getA(), • getElig(p.getS(),p.getA())*_gamma*_lambda); • } • } • }
Q & SARSA(λ): Key diffs • Use of eligibility traces • Q updates single step of history • SARSA(λ) keeps record of visited state/action pairs: e(s,a) • Updates Q(s,a) value in proportion to e(s,a) • Decays e(s,a) by λ each step
Q & SARSA(λ): Key diffs • How “next state” action is picked • Q: nextAct=_policy.argmaxAct(end) • Picks “best” next state • SARSA: nextAct=RLAgent.pickAction(end) • Picks next state that agent would pick • Huh? What’s the difference?
Exploration vs. exploitation • Sometimes, agent wants to do something other than “best currently known action” • Why? • If agent never tries anything new, it may never discover that there’s a better answer out there... • Called the “exploration vs. exploitation” tradeoff • Is it better to “explore” to find new stuff, or to “exploit” what you already know?
ε-Greedy exploration • Answer: • “Most of the time” do the best known thing • act=argmaxa(Q(s,a)) • “Rarely” try something random • act=pickAtRandom(allActionSet) • ε-greedy exploration policies: • “rarely”==prob ε • “most of the time”==prob 1-ε
ε-Greedy in code • public class eGreedyAgent implements RLAgent { • // implements the e-greedy exploration policy • public Action pickAction(State2d s) { • final double rVal=_rand.nextDouble(); • if (rVal<_epsilon) { • return randPick(_ASet); • } • return _policy.argmaxAct(s); • } • private final Set<Action> _ASet; • private final double _epsilon; • }
Design exercise • For M4/Rollout, need to be able to: • Train agent for many trials/steps per trial • Generate learning curves for agent’s learning • Run some trials w/ learning turned on • Freeze learning • Run some trials w/ learning turned off • Average steps-to-goal over those trials • Save average as one point in curve • Design: objects/methods to support this learning framework • Support: diff learning algs, diff environments, diff params, variable # of trials/steps, etc.