360 likes | 379 Views
This paper discusses the knowledge gradient policy for sequential off-line learning and presents theoretical and numerical results. It also introduces various sampling problems and explores different exploration and exploitation techniques.
E N D
Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial Engineering Princeton University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAA
Overview • Problem Formulation • Knowledge Gradient Policy • Theoretical Results • Numerical Results
Measurement Phase Experimental outcomes Alternative 1 +.2 +1 Alternative 2 -1 Alternative 3 -.2 N opportunities to do experiments Alternative M -1 +1
Implementation Phase Alternative 1 Alternative 2 Alternative 3 Reward Alternative M
Taxonomy of Sampling Problems Sampling
Example 1 • One common experimental design is to spread measurements equally across the alternatives. Quality Alternative
Round Robin Exploration Example 1
Example 2 • How might we improve round-robin exploration for use with this prior?
Largest Variance Exploration Example 2
Example 3 Exploitation:
Model • xn is the alternative tested at time n. • Measure the value of alternative xn • At time n, , independent • Error n+1 is independent N(0,()2). • At time N, choose an alternative. • Maximize
State Transition • At time n measure alternative xn. We update our estimate of based on the measurement • Estimates of other Yx do not change.
n change in best estimate of Yx due to the measurement • The value of the optimal policy satisfies Bellman’s equation • At time n, n+1x is a normal random variable with mean nx and variance satisfying uncertainty about Yx after the measurement uncertainty about Yx before the measurement
Utility of Information Consider our “utility of information”, and consider the random change in utility due to a measurement at time n
Knowledge Gradient Definition • The knowledge gradient policy chooses the measurement that maximizes this expected increase in utility,
Knowledge Gradient • We may compute the knowledge gradient policy via which is the expectation of the maximum of a normal and a constant.
Knowledge Gradient The computation becomes where is the normal cdf, is the normal pdf, and
Optimality Results • If our measurement budget allows only one measurement (N=1), the knowledge gradient policy is optimal.
Optimality Results • The knowledge gradient policy is optimal in the limit as the measurement budget N grows to infinity. • This is really a convergence result.
Optimality Results • The knowledge gradient policy has sub-optimality bounded by where VKG,n gives the value of the knowledge gradient policy and Vn the value of the optimal policy.
Optimality Results • If there are exactly 2 alternatives (M=2), the knowledge gradient policy is optimal. • In this case, the optimal policy reduces to
Optimality Results • If there is no measurement noise and alternatives may be reordered so that then the knowledge gradient policy is optimal.
Numerical Experiments • 100 randomly generated problems • M Uniform {1,...100} • N Uniform {M, 3M, 10M} • 0x Uniform [-1,1] • (0x)2 = 1 with probability 0.9 = 10-3 with probability 0.1 • = 1
Interval Estimation • Compare alternatives via a linear combination of mean and standard deviation. • The parameter z/2 controls the tradeoff between exploration and exploitation.
KG / IE Comparison Value of KG – Value of IE
IE and “Sticking” Alternative 1 is known perfectly
Thank You Any Questions?
Boltzmann Exploration Parameterized by a declining sequence of temperatures (T0,...TN-1).