170 likes | 273 Views
Reinforcement Learning an introduction part 4. Ann Nowé Ann.nowe@vub.ac.be http://como.vub.ac.be. By Sutton and Barto. Backup diagrams in DP. State-value function for policy . V(s 1 ). V(s 1 ’ ). V(s). V(s 2 ). V(s 2 ’ ). V(s 3 ). V(s 3 ’ ). Action-values function for policy .
E N D
Reinforcement Learning an introductionpart 4 Ann Nowé Ann.nowe@vub.ac.behttp://como.vub.ac.be By Sutton and Barto
Backup diagrams in DP State-value function for policy V(s1) V(s1’) V(s) V(s2) V(s2’) V(s3) V(s3’) Action-values function for policy Q(s1,a1) s1 Q(s1,a2) Q(s,a) s2 Q(s2,a1) Q(s2,a2)
T T T T T T T T T T Dynamic Programming, model based T T T
Recall Value Iteration in DP Q(s,a)
T T T T T T T T T T T T T T T T T T T T RL, model free
Q-Learning, a value iteration approach Q-learning is off-policy
example 0.3 4 R=10 b 0.2 2 R=2 a 0.7 5 c 1 R=1 R=1 R=5 1 3 0.8 d 1 Epoch 1: 1,2,4 Epoch 2: 1,6 Epoch 3: 1,3 Epoch 4: 1,2,5 Epoch 6: 2,5 R=4 6
Some convergence issues Q-learning in guaranteed to converge in a Markovian setting Tsitsiklis J.N. Asynchronous Stochastic Approximation and Q-learning. Machine Learning, Vol. 16:185-202, 1994.
Proof by Tsitsiklis, cont. On the convergence of Q-learning
Proof by Tsitsiklis On the convergence of Q-learning Q(s,a) Noiseterm “Learning factor” q vector, but with possibly outdated components Contraction mapping
Proof by Tsitsiklis, cont. Stochastic approximation, as a vector qi Fi Fi + noise t qj
Proof by Tsitsiklis, cont. Relating Q-learning to stochastic approximation Contraction mapping Bellman operator ith component Noise term Can vary in time
Sarsa: On-Policy TD Control When is Sarsa = Q-learning?
Q-Learning versus SARSA Q-learning is off-policy Sarsa Q-learning is on-policy
Cliff Walking example Actions: up, down, left, right Reward: cliff -100, goal 0, default -1. Action selection -greedy, with = 0.1 Sarsa takes exploration into account
Q-learning for CAC Acceptance Criterion: Maximize Network Revenue Class-1 S1 = (2,4) [ Q(s1,A1) Q(s1,R1) S2=(3,4) S3 = (3,3) Class-2 [ Q(s3,A2) Q(s3,R2)
Call Arrival Call Arrival tn t1 t2 System state: x System state: y t0 = 0 Call Departure Call Departure Call Departure Continuous Time Q-learning for CAC [Bratke]