180 likes | 341 Views
Effective Reinforcement Learning for Mobile Robots. Smart, D.L and Kaelbing, L.P. Content. Background Review Q-learning Reinforcement learning on mobile robots Learning framework Experimental results Conclusion Discussion. Background. Hard to code behaviour efficiently and correctly
E N D
Effective Reinforcement Learning for Mobile Robots Smart, D.L and Kaelbing, L.P.
Content • Background • Review Q-learning • Reinforcement learning on mobile robots • Learning framework • Experimental results • Conclusion • Discussion
Background • Hard to code behaviour efficiently and correctly • Reinforcement learning: tell the robot what to do, not how to do it • How well suited is reinforcement learning for mobile robots?
Review Q-learning • Discrete states s and actions a • Learn value function by observing rewards • Actual function Q*(s,a) = E[R(s,a) + g max Q*(s’,a’)] • Learn by Q(st,at) = (1-a) Q(st,at) + a(rt+1 + g max Q(st+1,a’)) • Sample distribution has no effect on learned policy p*(s) = argmax Q*(s,a)
Reinforcement learning on mobile robots • Sparse reward function • Almost always zero reward R(s,a) • Non-zero reward only when on success or failure • Continuous environment • HEDGER is used as a function approximator • Function approximation can be used when it never extrapolates from the data
Reinforcement learning on mobile robots • Q-learning can only be successful when a state with positive reward can be found • Sparse reward function and continuous environment cause reward states to be hard to find by trial and error • Solution: show robot how to find the reward states
Learning framework • Split learning into two phases: • Phase one: actions are controlled by exterior force, learning algorithm only passively observes • Phase two: learning algorithm learns optimal policy • By ‘showing’ the robot where the interesting states are, learning should be quicker
Experimental setup • Two experiments on B21r mobile robot • Movement speed is fixed by outside force • Rotation speed has to be learned • Settings a = 0.2, g = 0.99 or 0.90 • Performance is measured after every 5 runs • Robot does not learn from these test • Starting position and orientation similar, not identical
Experimental Results:Corridor Following Task • State space: • distance to end of corridor • distance to left wall as fraction of corridor width • angle q to target point
Experimental Results:Corridor Following Task • Computer controlled teacher • Rotation speed is a fraction a of the angle q
Experimental Results:Corridor Following Task • Human controlled teacher • Different corridor than computer controlled teacher
Experimental Results:Corridor Following Task Results • Decrease in performance after training • Phase 2 supplies more novel experiences • Sloppy human controller causes faster convergence than rigid computer controller • Fewer phase 1 and phase 2 runs • Human controller supplies more varied data
Experimental Results:Corridor Following Task Results • Simulated performance without advantage of teacher examples
Experimental Results:Obstacle Avoidance Task • State space: • direction and distance to obstacles • direction and distance to target
Experimental Results:Obstacle Avoidance Task Results • Human controlled teacher • Robot starts 3m from target, random orientation
Experimental Results:Obstacle Avoidance Task Results • Simulation without teacher examples • No obstacles present; robot only must reach goal • Simulated robot starts in the right orientation • 3 meters from target: 18.7% reached target in one week of simulated time, taking 6.54 hours on average
Conclusion • Passive observation of appropriate state-action behaviour can speed up Q-learning • Knowledge about the robot or the learning algorithm is not necessary • Any solution will work, providing a good solution is not necessary