Event-Learning with a Non-Markovian Controller

Event-Learning with a Non-Markovian Controller István Szita, Bálint Takács & András Lőrincz Eötvös Loránd University Hungary

Acknowledgements • thanks to ECCAI for the travel grant • work partially supported by • European Office for Aerospace Research and Development • Air Force Office of scientific Research • Hungarian National Science Foundation • thanks to Csaba Szepesvári for the helpful comments Szita, Takács & Lőrincz: Event-learning with non-Markovian controller

Introduction: reinforcement learning max. reward agent state action reward environment Szita, Takács & Lőrincz: Event-learning with non-Markovian controller

Introduction:Markov decision processes • fully observable, Markovian • state and action space: S, A • transition probabilities: P(s,a,s’) • reward function: R(s,a,s’) • policy: p(s,a) • value function: V(s) := E(gt×rt|s0=s) Szita, Takács & Lőrincz: Event-learning with non-Markovian controller

Introduction:Solving MDPs • optimal value function: • action-value function: Q(s,a), Q*(s,a) • policy from Q • solution: • iteration • iterated averaging • sampling • e.g. DP, Q-learning, SARSA • most of them provably converges Szita, Takács & Lőrincz: Event-learning with non-Markovian controller

Event-learning • basic idea: learn values of events: E(s,s’) • Event: (s,s’) transition • policy: pE(s,s’) • expected advantages: • catches the subgoal concept • higher level decision • good performance • needs a controller Szita, Takács & Lőrincz: Event-learning with non-Markovian controller

Event-learning:the controller • selects an action (sequence) • tries to realize the planned event (s,s’) • what should be the controller? • simplest: (approximate) inverse dynamics • may be too coarse • hierarchical: lower level RL agent • hard to tune • an intermediate solution: SDS controller • simple but robust Szita, Takács & Lőrincz: Event-learning with non-Markovian controller

The SDS controller • approximate inverse dynamics: gives an action for desired event (s,sdesired) • error: (s,sexperienced) happens • correction (feedback) term: discounted integral of sdesired(t) – sexperienced(t) • the action given by the inverse dynamics is corrected by Λ·(feedback term) • (continuous action space) Szita, Takács & Lőrincz: Event-learning with non-Markovian controller

Properties of the SDS controller • very mild conditions on the approx. inv. dynamics • asymptotically bounded error ( < , for sufficiently large Λ) • robust (→ experiments) • Event-learning with SDS • non-Markovian • performance guarantee? Szita, Takács & Lőrincz: Event-learning with non-Markovian controller

-stationary MDPs • transition probabilities may change over time • the changes are small, not cumulative:remain in a small environment of some base MDP P  base MDP Szita, Takács & Lőrincz: Event-learning with non-Markovian controller

RL in -MDPs • we can use RL algorithms also in -MDPs • they do not converge to an optimal policy(does not exist) • we showed that they are still near-optimal:for large t, ║Vt – V*║< K· V* optimal value function of base MDP K· Szita, Takács & Lőrincz: Event-learning with non-Markovian controller

Back to event-learning and SDS • from the viewpoint of the event-learning agent, the controller is part of the environment! • the error of SDS is less than  • the environment is -MDP • event-learning with SDS is asymptotically near-optimal Szita, Takács & Lőrincz: Event-learning with non-Markovian controller

Demonstration problem:the pendulum Szita, Takács & Lőrincz: Event-learning with non-Markovian controller

Experiment 1:Comparison with SARSA Szita, Takács & Lőrincz: Event-learning with non-Markovian controller

Experiment 2:Robustness Szita, Takács & Lőrincz: Event-learning with non-Markovian controller

Summary – -MDPs • general theorem on near-optimality of RL algorithms • applicable: • event-learning • fast changing or uncertain environments Szita, Takács & Lőrincz: Event-learning with non-Markovian controller

Summary – event-learning • learns tiny subgoals • Event-learning with an SDS controller is • practically: robust • theoretically: bounded deviation from optimum Szita, Takács & Lőrincz: Event-learning with non-Markovian controller

Thanks for your attention!

Event-Learning with a Non-Markovian Controller