1.15k likes | 1.39k Views
Introduction to Reinforcement Learning. Gerry Tesauro IBM T.J.Watson Research Center http://www.research.ibm.com/infoecon http://www.research.ibm.com/massdist. Outline. Statement of the problem: What RL is all about How it’s different from supervised learning Mathematical Foundations
E N D
Introduction to Reinforcement Learning Gerry Tesauro IBM T.J.Watson Research Center http://www.research.ibm.com/infoecon http://www.research.ibm.com/massdist
Outline • Statement of the problem: • What RL is all about • How it’s different from supervised learning • Mathematical Foundations • Markov Decision Problem (MDP) framework • Dynamic Programming: Value Iteration, ... • Temporal Difference (TD) and Q Learning • Applications: Combining RL and function approximation
Acknowledgement • Lecture material shamelessly adapted from: R. S. Sutton and A. G. Barto, “Reinforcement Learning” • Book published by MIT Press, 1998 • Available on the web at: RichSutton.com • Many slides shamelessly stolen from web site
Basic RL Framework • 1. Learning with evaluative feedback • Learner’s output is “scored” by a scalar signal (“Reward” or “Payoff” function) saying how well it did • Supervised learning: Learner is told the correct answer! • May need to try different outputs just to see how well they score (exploration …)
Basic RL Framework • 2. Learning to Act: Learning to manipulate the environment • Supervised learning is passive: Learner doesn’t affect the distribution of exemplars or the class labels
Basic RL Framework • Learner has to figure out which action is best, and which actions lead to which states. Might have to try all actions! • Exploration vs. Exploitation: when to try a “wrong” action vs. sticking to the “best” action
Basic RL Framework • 3. Learning Through Time: • Reward is delayed (Act now, reap the reward later) • Agent may take long sequence of actions before receiving reward • “Temporal Credit Assignment” Problem: Given sequence of actions and rewards, how to assign credit/blame for each action?
Agent’s objective is to maximize expected value of “return” R: sum of future rewards: • is a “discount parameter” (0 1) • Example: Cart-Pole Balancing Problem: • reward = -1 at failure, else 0 • expected return = -k for k steps to failure reward maximized by making k
We consider non-deterministic environments: • Action at in state st • Probability distribution of rewards rt+1 • Probability distribution of new states st+1 • Some environments have nice property: distributions are history-independent and stationary. These are called Markov environments and the agent’s task is a Markov Decision Problem (MDP)
An MDP specification consists of: • list of states s S • list of legal action set A(s) for every s • set of transition probabilities for every s,a,s’: • set of expected rewards for every s,a,s’:
Given an MDP specification: • Agent learns a policy : • deterministic policy (s) = action to take in state s • non-deterministic policy (s,a) = probability of choosing action a in state s • Agent’s objective is to learn the policy that maximizes expected value of return Rt • “Value Function” associated with a policy tells us how good the policy is. Two types of value functions ...
State-Value FunctionV (s) = Expected return starting in state s and following policy : • Action-Value Function Q (s,a) = Expected return starting from action a in state s, and then following policy :
Bellman Equation for a Policy • The basic idea: • Apply expectation for state s under policy : • A linear system of equations for V ; unique solution
Why V*, Q* are useful • Any policy that is greedy w.r.t. V* or Q* is an optimal policy *. • One-step lookahead using V*: • Zero-step lookahead using Q*:
Two methods to solve for V*, Q* • Policy improvement: given a policy , find a better policy ’. • Policy Iteration: Keep repeating above and ultimately you will get to *. • Value Iteration: Directly solve Bellman’s optimality equation, without explicitly writing down the policy.
Policy Improvement • Evaluate the policy: given , compute V (s) and Q (s,a) (from linear Bellman equations). • For every state s, construct new policy: do the best initial action, and then follow policy thereafter. • The new policy is greedy w.r.t. Q (s,a) and V (s) V’ (s) V (s) ’ in our partial ordering.
Policy Improvement, contd. • What if the new policy has the same value as the old policy? ( V’ (s) = V (s) for all s) • But this is the Bellman Optimality equation: if V solves it, then it must be the optimal value function V*.
Value Iteration • Use the Bellman Optimality equation to define an iterative “bootstrap” calculation: • This is guaranteed to converge to a unique V* (backup is a contraction mapping)
Summary of DP methods • Guaranteed to converge to * in polynomial time (in size of state space); in practice often faster than linear • The method of choice if you can do it. • Why it might not be doable: • your problem is not an MDP • the transition probs and rewards are unknown or too hard to specify • Bellman’s “curse of dimensionality:” the state space is too big (>> O(106) states) • RL may be useful in these cases
Monte Carlo Methods • Estimate V (s) by sampling • perform a trial: run the policy starting from s until termination state reached; measure actual return Rt • N trials: average Rt accurate to ~ 1/sqrt(N) • no “bootstrapping:” not using V(s’) to estimate V(s) • Two important advantages of Monte Carlo: • Can learn online without a model of the environment • Can learn in a simulatedenvironment
Temporal Difference Learning • Error signal: difference between current estimate and improved estimate; drives change of current estimate • Supervised learning error: error(x) = target_output(x) - learner_output(x) • Bellman error (DP): “1-step full-width lookahead” - “0-step lookahead” • Monte Carlo error: error(s) = <Rt > - V(s) “many-step sample lookahead” - “0-step lookahead”
TD error signal • Temporal Difference Error Signal: take one step using current policy, observe r and s’, then: “1-step sample lookahead” - “0-step lookahead” • In particular, for undiscounted sequences with no intermediate rewards, we have simply: • Self-consistent prediction goal: predicted returns should be self-consistent from one time step to the next (true of both TD and DP)
Learning using the Error Signal: we could just do a reassignment: • But it’s often a good idea to learn incrementally: where is a small “learning rate” parameter (either constant, or decreases with time) • the above algorithm is known as “TD(0)” ; convergence to be discussed later...
Advantages of TD Learning • Combines the “bootstrapping” (1-step self-consistency) idea of DP with the “sampling” idea of MC; maybe the best of both worlds • Like MC, doesn’t need a model of the environment, only experience • TD, but not MC, can be fully incremental • you can learn before knowing the final outcome • you can learnwithout the final outcome (from incomplete sequences) • Bootstrapping TD has reduced variance compared to Monte Carlo, but possibly greater bias
The point of the parameter • (My view): in TD() is a knob to twiddle: provides a smooth interpolation between =0 (pure TD) and =1 (pure MC) • For many toy grid-world type problems, can show that intermediate values of work best. • For real-world problems, best will be highly problem-dependent.
Convergence of TD () • TD() converges to the correct value function V (s) with probability 1 for all . Requires: • lookup table representation (V(s) is a table), • must visit all states an infinite # of times, • a certain schedule for decreasing (t). (Usually (t) ~ 1/t) • BUT:TD() converges only for a fixed policy. What if we want to learn as well as V? We still have more work to do ...
Q-Learning: TD Idea to Learn * • Q-Learning (Watkins, 1989): one-step sample backup to learn action-value function Q(s,a). The most important RL algorithm in use today. Uses one-step error: to define an incremental learning algorithm: where (t) follows same schedule as in TD algorithm.
Nice properties of Q-learning • Q guaranteed to converge to Q* w/probability 1. • Greedy guaranteed to converge to *. • But (amazingly), don’t need to follow a fixed policy, or the greedy policy, during learning! Virtually any policy will do, as long as all (s,a) pairs visited infinitely often. • As with TD, don’t need a model, can learn online, both bootstraps and samples.
RL and Function Approximation • DP infeasible for many real applications due to curse of dimensionality: |S| too big. • FA may provide a way to “lift the curse:” • complexity D of FA needed to capture regularity in environment may be << |S|. • no need to sweep thru entire state space: train on N “plausible” samples and then generalize to similar samples drawn from the same distribution. • PAC learning tells us generalization error ~D/N; N need only scale linearly with D.
RL + Gradient Parameter Training • Recall incremental training of lookup tables: • If instead V(s) = V (s), adjust to reduce MSE (R-V(s))2 by gradient descent:
Example: TD() training of neural networks (episodic; =1 and intermediate r = 0):
Case-Study Applications • Several commonalities: • Problems are more-or-less MDPs • |S| is enormous can’t do DP • State-space representation critical: use of “features” based on domain knowledge • FA is reasonably simple (linear or NN) • Train in a simulator! Need lots of experience, but still << |S| • Only visit plausible states; only generalize to plausible states
Learning backgammon using TD() • Neural net observes a sequence of input patterns x1, x2, x3, …, xf : sequence of board positions occurring during a game • Representation: Raw board description (# of White or Black checkers at each location) using simple truncated unary encoding. (“hand-crafted features” added in later versions) • At final position xf, reward signal z given: • z = 1 if White wins; • z = 0 if Black wins • Train neural net using gradient version of TD() • Trained NN output Vt = V (xt , w) should estimate prob (White wins | xt )
Q: Who makes the moves?? • A: Let neural net make the moves itself, using its current evaluator: score all legal moves, and pick max Vt for White, or min Vt for Black. • Hopelessly non-theoretical and crazy: • Training V using non-stationary (no convergence proof) • Training V using nonlinear func. approx. (no cvg. proof) • Random initial weights Random initial play! Extremely long sequence of random moves and random outcome Learning seems hopeless to a human observer • But what the heck, let’s just try and see what happens...