220 likes | 377 Views
Introduction to Reinforcement Learning. Freek Stulp. Overview. General principles of RL Markov Decision Process as model Values of states: V(s) Values of state-actions: Q(a,s) Exploration vs. Exploitation Issues in RL Conclusion.
E N D
Introduction toReinforcement Learning Freek Stulp
Overview • General principles of RL • Markov Decision Process as model • Values of states: V(s) • Values of state-actions: Q(a,s) • Exploration vs. Exploitation • Issues in RL • Conclusion
Neural Networks are supervised learning algorithms: for each input, we know the output. What if we don‘t know the output for each input? Flight control system example Let the agent learn how to achieve certain goals itself, through interaction with the environment. General principles of RL
Environment action percept reward Agent Rewards to specify goals (example: dogs) General principles of RL Let the agent learn how to achieve certain goals itself, through interaction with the environment. This does not solve the problem!
Markov Decision Process = {S,A,R,T} Set of states S Set of actions A Reward function R Transition function T Markov property Tss´ only depends on s, s´ Policy: p(S)=>A Problem: Find policy p that maximizes the reward Discounted reward: r0 + g1r1 + g2r2 ... gnrn a2 s2 s3 r1 r2 Popular model: MDPs a0 a1 s0 s1 r0
0 0 -1 -1 -1 -2 -1 -3 0 -14 -20 -22 -1 -1 -2 -1 -3 -1 -2 -1 -14 -20 -22 -20 -1 -2 -3 -1 -2 -1 -1 -1 -20 -22 -20 -14 -1 -3 -1 -2 -1 -1 0 0 -22 -20 -14 0 Vp(s) (Random policy) V*(s) (Optimal policy) R (Rewards) Values of states: Vp(s) • Definition of value Vp(s) • Cumulative reward when starting in state s, and executing some policy untill terminal state is reached. • Optimal policy yields V*(s)
Dynamic programmingV(s) = R(s) + S Vps´(Tss´V(s´)) + Only visited states are used s s Determining Vp(s) TD-learningV(s) = V(s) + a(R(s)+V(s´)-V(s)) -Necessary to consider all states.
Q-values: Q(a,s)Value of doing an action in a certain state. Dynamic Programming: Q(a,s) =R(s) + Ss´Tss´maxaQ(a´,s´) TD-learning Q(a,s) = Q(a,s) + a(R(s) + maxa´Q(a´,s´) - Q(a,s))T is not in this formula: Model free learning! Values of state-action: Q(a,s)
Only exploitation: New (maybe better) paths never discovered Only exploration: What is learned is never exploited Good trade-off: Explore first to learn, exploit later to benefit Exploration vs. Exploitation
Hidden state If you don‘t know where you are, you can‘t know what to do. Curse of dimensionality Very large state spaces. Continuous states/action spaces All algorithms use discrete tables spaces. What about continuous values? Many of your articles discuss solutions to these problems. Some issues
RL: Learning through interaction and rewards. Markov Decision Process popular model Values of states: V(s) Values of action/states: Q(a,s) (model free!) Still some problems... not quite ready for complex real-world problems yet, but research underway! Conclusion
Literature • Artificial Intelligence: A Modern Approach • Stuart Russel and Peter Norvig • Machine Learning • Tom M. Mitchell • Reinforcement learning: A Tutorial • Mance E. Harmon and Stephanie S. Harmon