220 likes | 244 Views
Reinforcement Learning in the Multi-Robot Domain. Source. “Reinforcement Learning in the Multi-Robot Domain” : Maja J Mataric. Introduction. Mataric describes a method for real-time learning in an autonomous agent Reinforcement Learning (RL) is used
E N D
Source • “Reinforcement Learning in the Multi-Robot Domain” : Maja J Mataric
Introduction • Mataric describes a method for real-time learning in an autonomous agent • Reinforcement Learning (RL) is used • The agent learns based on rewards and punishments • This method is experimentally validated on a group of 4 robots learning a foraging task
Why? • A successful learning algorithm would allow autonomous agents to exhibit complex behaviors with little (or no) extra programming.
Two Main Challenges • The state space is prohibitively large • Building a predictive model is very slow • Might be more efficient to learn a policy • Dealing with Structuring and Assignment of reinforcement • Environment does not provide a direct source of immediate reinforcement • Delayed Credit
Addressing Challenges Since the state space is prohibitively large, it is necessary to reduce the space using behaviors and conditions • Behaviors • Homing / Wall-following • Abstract away low-level controllers • Conditions • Have-puck? / at-home? • Abstract away details of the state space
Addressing Challenges • Reinforcement is difficult because an event that induces reinforcement may be due to past actions such as • Attempts to reach a goal • or Reactions to another robot • To address this Mataric uses Shaped Reinforcement in the form of • Heterogeneous reward functions • and Progress Estimators
Reward Functions • Heterogeneous reward functions combine multi-modal feedback from • External (sensory) • and Internal (state) modalities • Each behavior has an associated goal which provides a reinforcement signal • More sub-goals leads to more frequent reinforcement which leads to faster convergence
Reward Functions • Progress Estimators (PE’s) provide positive or negative reinforcement with respect to the current goal • PE’s Decrease sensitivity to noise • Noise-induced events are not consistently supported • PE’s Encourage Exploration • Non-productive behaviors are terminated • Decrease fortuitous rewards • Over time less reward is given to fortuitous successes
The Learning Task • The Learning Task consists of finding a mapping from conditions • Have-puck? • At-home? • Near-intruder? • Night-time? • To behaviors • Safe-wandering • Dispersion • Resting • homing
Learning Algorithm • Matrix A(c,b) is a normalized sum of the reinforcement, R, for each behavior pair over time t A(c,b) = Sumt R(c,t) • Learning is continuous
Immediate Reinforcement • Positive • Ep: grasped-puck • Egd: dropped-puck-at-home • Egw: woke-up-at-home • Negative • Ebd: dropped-puck-away-from-home • Ebw: woke-up-away-from-home • The events are merged into one heterogeneous reinforcement function RE(c)
Progress Estimators • RI(c,t) – Minimizing Interference • Positive for increasing distance • Negative for decreasing distance • RH(c,t) – Homing (with puck) • Positive for nearer to home • Negative for farther from home
Control Algorithm • Behavior selection is induced by events • Events are triggered • Externally • Internally • By progress estimators
Control Algorithm When an event is detected, the following control sequence is executed • Current (c,b) pair is reinforced • Current behavior is terminated • New behavior selected • Choose an untried behavior • Otherwise choose “best” behavior
Experimental Results Three approaches compared • Monolithic single-goal reward • Puck delivery to home • Heterogeneous reward function • Heterogeneous reward function with two progress estimators
Experimental Results • The hand-coded base policy
Experimental Results • Monolithic • Heterogeneous • Heterogeneous with progress estimators Percent of correct policy learned after 15 minutes
Evaluation • Monolithic • Does not provide enough feedback • Heterogeneous Reward • Certain behaviors pursued too long • Behaviors with delayed reward ignored (homing) • Heterogeneous with Progress Estimators • Eliminates thrashing • Impact of fortuitous rewards minimized
Conclusions • Mataric’s methods of heterogeneous reward functions with progress estimators can improve performance using domain knowledge
Critique • Multi-robot learning? • The techniques converge to a hand-crafted policy – what is the optimal policy?