Reinforcement Learning for Soaring CDMRG – 24 May 2010

Reinforcement Learning for SoaringCDMRG – 24 May 2010 Nick Lawrance

Reinforcement Learning for Soaring • What I want to do • Have a good understanding of the dynamics involved in aerodynamic soaring in known conditions but: • Dynamic soaring requires energy loss actions for net energy gain cycles which can be difficult using traditional control or path generation methods • Wind is difficult to predict; guidance and nav must be done on-line whilst simultaneously maintaining reasonable energy levels and safety requirements • Classic exploration-exploitation problem with the added catch that exploration requires energy gained through exploitation

Reinforcement Learning for Soaring • Why reinforcement learning • Previous work focused on understanding soaring and examining alternatives for generating energy-gain paths. • Always have the issue of balancing exploration and exploitation, my code ended up being long sequences of heuristic rules • Reinforcement learning could provide the link from known good paths towards optimal paths

Monte Carlo, TD, Sarsa & Q-learning • Monte Carlo – Learn an average reward for actions taken during series of episodes • Temporal Difference – Simultaneously estimate expected reward and value function • Sarsa – using TD for on-policy control • Q-learning – off-policy TD control

Figure 6.13: The cliff-walking task. Off-policy Q-learning learns the optimal policy, along the edge of the cliff, but then keeps falling off because of the -greedy action selection. On-policy Sarsa learns a safer policy taking into account the action selection method. These data are from a single run, but smoothed.

Eligibility Traces • TD(0) is effectively one-step backup of Vπ(reward only counts to previous action) • Eligibility traces extend this to reward the sequence of actions that lead to the current reward.

Sarsa(λ) • Initialize Q(s,a) arbitrarily and e(s,a) = 0, for all s, a • Repeat (for each episode): • Initialize s, a • Repeat (for each step of episode): • Take action a, observe r, s’ • Choose a’ from s’ using policy derived from Q (ε-greedy) • For all s,a: • until s is terminal

Sarsa(λ)

Simplest soaring attempt • Square grid, simple motion, energy sinks and sources • Movement cost, turn cost, edge cost

Simulation - Static

Hex grid, dynamic soaring • Energy based simulation • Drag movement cost, turn cost • Constant speed • No wind motion (due to limited states)

Hex grid, dynamic soaring

Next • Reinforcement learning has advantages to offer our group, but our contribution should probably be focused in well defined areas • For most of our problems, the state spaces are very large and usually continuous; we need estimation methods • We usually have a good understanding of at least some aspects of the problem; how can/should we use this information to give better solutions?

Reinforcement Learning for Soaring CDMRG – 24 May 2010

Reinforcement Learning for Soaring CDMRG – 24 May 2010

Presentation Transcript

Punishment

PAUL ROOS GYMNASIUM CALENDAR 2010

Reinforcement learning 2: action selection

Teaching Soaring Weather

Update 2010: Type 2 Diabetes

Introduction to Reinforcement Learning

BEST PRACTICES IN COLLEGE TEACHING

Lifelong Learning Programme 2010 presented by Leonardo da Vinci National Agency NA 2 Team

Service Learning and Tragedy Unit 2010-2011

Introduction to HCI

Reinforcement Learning

Collaborative Learning

Operant Conditioning (Types of Reinforcement)

Special Topics in Educational Data Mining

Letting Go of the Windsock

Learning Management System Training Workshop IIUM , PJ campus 24 – 25 May 2010

Reinforcement Learning

Learning

Inside the restored Capitol dome