Effective Reinforcement Learning for Mobile Robots

Effective Reinforcement Learning for Mobile Robots Smart, D.L and Kaelbing, L.P.

Content • Background • Review Q-learning • Reinforcement learning on mobile robots • Learning framework • Experimental results • Conclusion • Discussion

Background • Hard to code behaviour efficiently and correctly • Reinforcement learning: tell the robot what to do, not how to do it • How well suited is reinforcement learning for mobile robots?

Review Q-learning • Discrete states s and actions a • Learn value function by observing rewards • Actual function Q*(s,a) = E[R(s,a) + g max Q*(s’,a’)] • Learn by Q(st,at) = (1-a) Q(st,at) + a(rt+1 + g max Q(st+1,a’)) • Sample distribution has no effect on learned policy p*(s) = argmax Q*(s,a)

Reinforcement learning on mobile robots • Sparse reward function • Almost always zero reward R(s,a) • Non-zero reward only when on success or failure • Continuous environment • HEDGER is used as a function approximator • Function approximation can be used when it never extrapolates from the data

Reinforcement learning on mobile robots • Q-learning can only be successful when a state with positive reward can be found • Sparse reward function and continuous environment cause reward states to be hard to find by trial and error • Solution: show robot how to find the reward states

Learning framework • Split learning into two phases: • Phase one: actions are controlled by exterior force, learning algorithm only passively observes • Phase two: learning algorithm learns optimal policy • By ‘showing’ the robot where the interesting states are, learning should be quicker

Experimental setup • Two experiments on B21r mobile robot • Movement speed is fixed by outside force • Rotation speed has to be learned • Settings a = 0.2, g = 0.99 or 0.90 • Performance is measured after every 5 runs • Robot does not learn from these test • Starting position and orientation similar, not identical

Experimental Results:Corridor Following Task • State space: • distance to end of corridor • distance to left wall as fraction of corridor width • angle q to target point

Experimental Results:Corridor Following Task • Computer controlled teacher • Rotation speed is a fraction a of the angle q

Experimental Results:Corridor Following Task • Human controlled teacher • Different corridor than computer controlled teacher

Experimental Results:Corridor Following Task Results • Decrease in performance after training • Phase 2 supplies more novel experiences • Sloppy human controller causes faster convergence than rigid computer controller • Fewer phase 1 and phase 2 runs • Human controller supplies more varied data

Experimental Results:Corridor Following Task Results • Simulated performance without advantage of teacher examples

Experimental Results:Obstacle Avoidance Task • State space: • direction and distance to obstacles • direction and distance to target

Experimental Results:Obstacle Avoidance Task Results • Human controlled teacher • Robot starts 3m from target, random orientation

Experimental Results:Obstacle Avoidance Task Results • Simulation without teacher examples • No obstacles present; robot only must reach goal • Simulated robot starts in the right orientation • 3 meters from target: 18.7% reached target in one week of simulated time, taking 6.54 hours on average

Conclusion • Passive observation of appropriate state-action behaviour can speed up Q-learning • Knowledge about the robot or the learning algorithm is not necessary • Any solution will work, providing a good solution is not necessary

Discussion

Effective Reinforcement Learning for Mobile Robots

Effective Reinforcement Learning for Mobile Robots

Presentation Transcript

Control for Mobile Robots

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

REINFORCEMENT LEARNING

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning