190 likes | 308 Views
Reinforcement Learning for Soaring CDMRG – 24 May 2010. Nick Lawrance. Reinforcement Learning for Soaring. What I want to do Have a good understanding of the dynamics involved in aerodynamic soaring in known conditions but:
E N D
Reinforcement Learning for SoaringCDMRG – 24 May 2010 Nick Lawrance
Reinforcement Learning for Soaring • What I want to do • Have a good understanding of the dynamics involved in aerodynamic soaring in known conditions but: • Dynamic soaring requires energy loss actions for net energy gain cycles which can be difficult using traditional control or path generation methods • Wind is difficult to predict; guidance and nav must be done on-line whilst simultaneously maintaining reasonable energy levels and safety requirements • Classic exploration-exploitation problem with the added catch that exploration requires energy gained through exploitation
Reinforcement Learning for Soaring • Why reinforcement learning • Previous work focused on understanding soaring and examining alternatives for generating energy-gain paths. • Always have the issue of balancing exploration and exploitation, my code ended up being long sequences of heuristic rules • Reinforcement learning could provide the link from known good paths towards optimal paths
Monte Carlo, TD, Sarsa & Q-learning • Monte Carlo – Learn an average reward for actions taken during series of episodes • Temporal Difference – Simultaneously estimate expected reward and value function • Sarsa – using TD for on-policy control • Q-learning – off-policy TD control
Figure 6.13: The cliff-walking task. Off-policy Q-learning learns the optimal policy, along the edge of the cliff, but then keeps falling off because of the -greedy action selection. On-policy Sarsa learns a safer policy taking into account the action selection method. These data are from a single run, but smoothed.
Eligibility Traces • TD(0) is effectively one-step backup of Vπ(reward only counts to previous action) • Eligibility traces extend this to reward the sequence of actions that lead to the current reward.
Sarsa(λ) • Initialize Q(s,a) arbitrarily and e(s,a) = 0, for all s, a • Repeat (for each episode): • Initialize s, a • Repeat (for each step of episode): • Take action a, observe r, s’ • Choose a’ from s’ using policy derived from Q (ε-greedy) • For all s,a: • until s is terminal
Simplest soaring attempt • Square grid, simple motion, energy sinks and sources • Movement cost, turn cost, edge cost
Hex grid, dynamic soaring • Energy based simulation • Drag movement cost, turn cost • Constant speed • No wind motion (due to limited states)
Next • Reinforcement learning has advantages to offer our group, but our contribution should probably be focused in well defined areas • For most of our problems, the state spaces are very large and usually continuous; we need estimation methods • We usually have a good understanding of at least some aspects of the problem; how can/should we use this information to give better solutions?