Reinforcement Learning in the Multi-Robot Domain

Reinforcement Learning in the Multi-Robot Domain

Source • “Reinforcement Learning in the Multi-Robot Domain” : Maja J Mataric

Introduction • Mataric describes a method for real-time learning in an autonomous agent • Reinforcement Learning (RL) is used • The agent learns based on rewards and punishments • This method is experimentally validated on a group of 4 robots learning a foraging task

Why? • A successful learning algorithm would allow autonomous agents to exhibit complex behaviors with little (or no) extra programming.

Two Main Challenges • The state space is prohibitively large • Building a predictive model is very slow • Might be more efficient to learn a policy • Dealing with Structuring and Assignment of reinforcement • Environment does not provide a direct source of immediate reinforcement • Delayed Credit

Addressing Challenges Since the state space is prohibitively large, it is necessary to reduce the space using behaviors and conditions • Behaviors • Homing / Wall-following • Abstract away low-level controllers • Conditions • Have-puck? / at-home? • Abstract away details of the state space

Addressing Challenges • Reinforcement is difficult because an event that induces reinforcement may be due to past actions such as • Attempts to reach a goal • or Reactions to another robot • To address this Mataric uses Shaped Reinforcement in the form of • Heterogeneous reward functions • and Progress Estimators

Reward Functions • Heterogeneous reward functions combine multi-modal feedback from • External (sensory) • and Internal (state) modalities • Each behavior has an associated goal which provides a reinforcement signal • More sub-goals leads to more frequent reinforcement which leads to faster convergence

Reward Functions • Progress Estimators (PE’s) provide positive or negative reinforcement with respect to the current goal • PE’s Decrease sensitivity to noise • Noise-induced events are not consistently supported • PE’s Encourage Exploration • Non-productive behaviors are terminated • Decrease fortuitous rewards • Over time less reward is given to fortuitous successes

The Learning Task • The Learning Task consists of finding a mapping from conditions • Have-puck? • At-home? • Near-intruder? • Night-time? • To behaviors • Safe-wandering • Dispersion • Resting • homing

Learning Algorithm • Matrix A(c,b) is a normalized sum of the reinforcement, R, for each behavior pair over time t A(c,b) = Sumt R(c,t) • Learning is continuous

Immediate Reinforcement • Positive • Ep: grasped-puck • Egd: dropped-puck-at-home • Egw: woke-up-at-home • Negative • Ebd: dropped-puck-away-from-home • Ebw: woke-up-away-from-home • The events are merged into one heterogeneous reinforcement function RE(c)

Progress Estimators • RI(c,t) – Minimizing Interference • Positive for increasing distance • Negative for decreasing distance • RH(c,t) – Homing (with puck) • Positive for nearer to home • Negative for farther from home

Control Algorithm • Behavior selection is induced by events • Events are triggered • Externally • Internally • By progress estimators

Control Algorithm When an event is detected, the following control sequence is executed • Current (c,b) pair is reinforced • Current behavior is terminated • New behavior selected • Choose an untried behavior • Otherwise choose “best” behavior

Experimental Results Three approaches compared • Monolithic single-goal reward • Puck delivery to home • Heterogeneous reward function • Heterogeneous reward function with two progress estimators

Experimental Results • The hand-coded base policy

Experimental Results • Monolithic • Heterogeneous • Heterogeneous with progress estimators Percent of correct policy learned after 15 minutes

Evaluation • Monolithic • Does not provide enough feedback • Heterogeneous Reward • Certain behaviors pursued too long • Behaviors with delayed reward ignored (homing) • Heterogeneous with Progress Estimators • Eliminates thrashing • Impact of fortuitous rewards minimized

Conclusions • Mataric’s methods of heterogeneous reward functions with progress estimators can improve performance using domain knowledge

Critique • Multi-robot learning? • The techniques converge to a hand-crafted policy – what is the optimal policy?

Questions?

Reinforcement Learning in the Multi-Robot Domain