190 likes | 259 Views
Event-Learning with a Non-Markovian Controller. Istv án Szita, Bálint Takács & András Lőrincz. Eötvös Loránd University Hungary. Acknowledgements. thanks to ECCAI for the travel grant work partially supported by European Office for Aerospace Research and Development
E N D
Event-Learning with a Non-Markovian Controller István Szita, Bálint Takács & András Lőrincz Eötvös Loránd University Hungary
Acknowledgements • thanks to ECCAI for the travel grant • work partially supported by • European Office for Aerospace Research and Development • Air Force Office of scientific Research • Hungarian National Science Foundation • thanks to Csaba Szepesvári for the helpful comments Szita, Takács & Lőrincz: Event-learning with non-Markovian controller
Introduction: reinforcement learning max. reward agent state action reward environment Szita, Takács & Lőrincz: Event-learning with non-Markovian controller
Introduction:Markov decision processes • fully observable, Markovian • state and action space: S, A • transition probabilities: P(s,a,s’) • reward function: R(s,a,s’) • policy: p(s,a) • value function: V(s) := E(gt×rt|s0=s) Szita, Takács & Lőrincz: Event-learning with non-Markovian controller
Introduction:Solving MDPs • optimal value function: • action-value function: Q(s,a), Q*(s,a) • policy from Q • solution: • iteration • iterated averaging • sampling • e.g. DP, Q-learning, SARSA • most of them provably converges Szita, Takács & Lőrincz: Event-learning with non-Markovian controller
Event-learning • basic idea: learn values of events: E(s,s’) • Event: (s,s’) transition • policy: pE(s,s’) • expected advantages: • catches the subgoal concept • higher level decision • good performance • needs a controller Szita, Takács & Lőrincz: Event-learning with non-Markovian controller
Event-learning:the controller • selects an action (sequence) • tries to realize the planned event (s,s’) • what should be the controller? • simplest: (approximate) inverse dynamics • may be too coarse • hierarchical: lower level RL agent • hard to tune • an intermediate solution: SDS controller • simple but robust Szita, Takács & Lőrincz: Event-learning with non-Markovian controller
The SDS controller • approximate inverse dynamics: gives an action for desired event (s,sdesired) • error: (s,sexperienced) happens • correction (feedback) term: discounted integral of sdesired(t) – sexperienced(t) • the action given by the inverse dynamics is corrected by Λ·(feedback term) • (continuous action space) Szita, Takács & Lőrincz: Event-learning with non-Markovian controller
Properties of the SDS controller • very mild conditions on the approx. inv. dynamics • asymptotically bounded error ( < , for sufficiently large Λ) • robust (→ experiments) • Event-learning with SDS • non-Markovian • performance guarantee? Szita, Takács & Lőrincz: Event-learning with non-Markovian controller
-stationary MDPs • transition probabilities may change over time • the changes are small, not cumulative:remain in a small environment of some base MDP P base MDP Szita, Takács & Lőrincz: Event-learning with non-Markovian controller
RL in -MDPs • we can use RL algorithms also in -MDPs • they do not converge to an optimal policy(does not exist) • we showed that they are still near-optimal:for large t, ║Vt – V*║< K· V* optimal value function of base MDP K· Szita, Takács & Lőrincz: Event-learning with non-Markovian controller
Back to event-learning and SDS • from the viewpoint of the event-learning agent, the controller is part of the environment! • the error of SDS is less than • the environment is -MDP • event-learning with SDS is asymptotically near-optimal Szita, Takács & Lőrincz: Event-learning with non-Markovian controller
Demonstration problem:the pendulum Szita, Takács & Lőrincz: Event-learning with non-Markovian controller
Demonstration problem:the pendulum Szita, Takács & Lőrincz: Event-learning with non-Markovian controller
Experiment 1:Comparison with SARSA Szita, Takács & Lőrincz: Event-learning with non-Markovian controller
Experiment 2:Robustness Szita, Takács & Lőrincz: Event-learning with non-Markovian controller
Summary – -MDPs • general theorem on near-optimality of RL algorithms • applicable: • event-learning • fast changing or uncertain environments Szita, Takács & Lőrincz: Event-learning with non-Markovian controller
Summary – event-learning • learns tiny subgoals • Event-learning with an SDS controller is • practically: robust • theoretically: bounded deviation from optimum Szita, Takács & Lőrincz: Event-learning with non-Markovian controller