230 likes | 516 Views
Reinforcement Learning. Michael Roberts. With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998). What is RL?. Trial & error learning without model with model Structure. s3. r2. r1. s1. s2. r3. s4. RL vs. Supervised Learning.
E N D
Reinforcement Learning Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)
What is RL? • Trial & error learning • without model • with model • Structure s3 r2 r1 s1 s2 r3 s4
RL vs. Supervised Learning • Evaluative vs. Instructional feedback • Role of exploration • On-line performance
K-armed Bandit Problem Average Rewards Actions 10 0, 0, 5, 10, 35 5, 10, -15, -15, -10 -5 Agent 100 0
K-armed Bandit Cont. • Greedy exploration • ε-greedy • Softmax Average Reward: Incremental formula: where: α = 1 / (k+1) Probability of choosing action a:
More General Problems • More than one state • Delayed rewards • Markov Decision Process (MDP) • Set of states • Set of actions • Reward function • State transition function • Table or Function Approximation
Backup Diagram .25 .25 .25 .4 .6 .7 .3 .5 .5 Rewards 10 5 200 200 -10 1000
Performance Metrics • Eventual convergence to optimality • Speed of convergence to optimality • Regret (Kaelbling, L., Littman, M., & Moore, A. 1996)
Initialize V arbitrarily, e.g. , for all Repeat For each until (a small positive number) Output a deterministic policy, such that:
Temporal Difference Learning • RL without a model • Issue of: temporal credit assignment • Bootstraps like DP • TD(0):
TD Learning • Again, TD(0) = TD(λ) = where e is called an eligibility trace
Additional Work • POMDP’s • Macros • Multi-agent rl • Multiple reward structures