330 likes | 431 Views
A Finite Sample Upper Bound on the Generalization Error for Q-Learning. S.A. Murphy Univ. of Michigan CALD: February, 2005. Outline Two Examples Q-functions & Q-Learning The Goal Finite Sample Bounds Discussion. Two Examples.
E N D
A Finite Sample Upper Bound on the Generalization Error for Q-Learning S.A. Murphy Univ. of Michigan CALD: February, 2005
Outline • Two Examples • Q-functions & Q-Learning • The Goal • Finite Sample Bounds • Discussion
Two Examples • Treatment: Managing Drug Dependence, Mental Illness, HIV infection • Preventive Intervention: Increasing and Maintaining Activity • ---- Multi-stage decision problems: repeated decisions are made over time on each subject.
Managing Drug Dependence • Goal is to reduce long term abuse. • Individuals present in an acute stage. • What should the first treatment be? (medication?, psychosocial therapy?) • At what point do we say the individual is not responding? • What should the second treatment be for nonresponders? (medication?, psychosocial therapy?) • What should the second treatment be for responders? • What information should be used in making these decisions?
Improving Activity • Goal is to maximize long term weekly step counts. • Physically inactive individuals • How should the web coach set the weekly goal? • What framing should the web coach use in providing feedback on past week’s performance? • What information should the web coach (policy) use to make these decisions?
Commonalities • One training set of finite horizon trajectories. • Actions made according to a known stochastic policy • System dynamics are poorly understood. • Policy class is restricted. • Explicitly constrained to be interpretable and/or • Implicitly constrained because of function approximation
T+1 Decisions Observations made prior to tth time (vector of continuous and discrete variables) Action at tth time Reward at tth time
The Goal Given a training set of n trajectories of the form, estimate the policy that maximizes the mean of over the class
Q-functions & Q-Learning Definition: denotes expectation when the actions are chosen according to the policy, denotes expectation when the actions are chosen according to the stochastic exploration policy,
Q-functions The Q-functions for a policy, are given recursively by For t=T,T-1,….
Q-functions The Q-functions for an optimal policy, are given recursively by For t=T,T-1,….
Q-functions An optimal (overall not only over ) policy is given by
Q-learning with finite horizon trajectories Given an approximation space for the Q-functions, minimize over Set
Q-Learning For each t=T-1,…,0 minimize over And set and so on.
Q-Learning The estimated policy is given by
The Goal Approximate each Q function by a linear combination of k features: implicitly constrains the class of policies: call this constrained class
Goal:Given a learning algorithm and approximation classes assess the ability of learning algorithm to produce the best policy in the class. Construct an upper bound for where is the estimator of the policy. denotes expectation when the actions are chosen according to the policy
We can expect that our estimator of the Q functions, say will be close to a projection if the training set is large. The projection is: and for t=T-1,…, 0,
Finite Sample Bounds Primary Assumptions: (1) is invertible for each t (2) Number of possible actions is finite. (3) for L>1.
For with probability at least 1- δ for n satisfying is the size of the action space; k is the number of features.
The message The goal of the Q-learning algorithm is to produce that are close to This is different from the goal of producing a policy that will maximize
For policies and with probability at least 1- δ for all satisfying
Suppose there is a policy for which is maximal. Then for and with probability at least 1- δ for n satisfying the complexity constraint from before.
is an approximation error: If then members of the approximation space are arbitrarily close to the optimal Q-function (optimal overall not just in )
For policies and with probability at least 1- δ for n satisfying the complexity constraint from before.
Difference in values can be related to Q-functions: Kakade (2003)
The message The goal of the Q-learning algorithm is to produce that are close to This is different from the goal of producing a policy with that is close to
The message Using function approximation in Q-learning provides a way to add information to the data but the price is bias. Other methods that add information (e.g. modeling the dynamics) can be expected to incur a bias as well.
Discussion • When the information in the training set is small relative to the observation space, parameterizing the Q-functions is one way to add information. But how can one reduce the bias of Q-learning? • Policy search with importance weights? Low bias—high variance across training sets. Q-learning is lower variance—higher bias. • Can we construct algorithms that tell us when we must add information to the training set so as to reduce variability? • What kinds of statistical tests would be most useful in assessing whether the estimated policy is better than an ad hoc policy?
This seminar can be found at: http://www.stat.lsa.umich.edu/~samurphy/seminars/cald0205.ppt The paper can be found at : http://www.stat.lsa.umich.edu/~samurphy/papers/ Qlearning.pdf samurphy@umich.edu
A policy search method with importance sampling weights would employ a variant of