250 likes | 363 Views
An Asymptotically Optimal Algorithm for the Max k-Armed Bandit Problem. Matthew Streeter & Stephen Smith Carnegie Mellon University NESCAI, April 29 2006. Outline. Problem statement & motivations Modeling payoff distributions An asymptotically optimal algorithm. D 1. D 2. Machine 1.
E N D
An Asymptotically Optimal Algorithm for the Max k-Armed Bandit Problem Matthew Streeter & Stephen Smith Carnegie Mellon University NESCAI, April 29 2006
Outline • Problem statement & motivations • Modeling payoff distributions • An asymptotically optimal algorithm
D1 D2 Machine 1 D3 Machine 2 Machine 3 The k-Armed Bandit • You are in a room with k slot machines • Pulling the arm of machine i returns a payoff drawn (independently at random) from unknown distribution Di • Allowed n total pulls • Goal: maximize total payoff • > 50 years of papers
The Max k-Armed Bandit D1 • You are in a room with k slot machines • Pulling the arm of machine i returns a payoff drawn (independently at random) from unknown distribution Di • Allowed n total pulls • Goal: maximize highest payoff • Introduced ~2003 D2 Machine 1 D3 Machine 2 Machine 3
The Max k-Armed Bandit: Motivations D1 • Given: some optimization problem, k randomized heuristics • Each time you run a heuristic, get a solution with a certain quality • Allowed n runs • Goal: maximize quality of best solution • Cicirello & Smith (2005) show competitive performance on RCPSP Assumption: each run has the same computational cost D2 Tabu Search D3 Hill Climbing Simulated Annealing
The Max k-Armed Bandit: Example • Given n pulls, what strategy maximizes the (expected) maximum payoff? • If n=1, should pull arm 1 (higher mean) • If n=1000, should pull arm 2 (higher variance)
Can’t Handle Arbitrary Payoff Distributions • Needle in the haystack: can’t distinguish arms until you get payoff > 0, at which point highest payoff can’t be improved
Why? Extremal Types Theorem: max. of n independent draws from some fixed distribution a GEV converges in distribution converges in distribution • Compare to Central Limit Theorem: sum of n draws a Gaussian Assumption • We will assume each machine returns payoff from a generalized extreme value (GEV) distribution
The GEV distribution • Z has a GEV distribution if for constants s, , and > 0. determines mean determines standard deviation s determines shape
Example payoff distribution: Job Shop Scheduling • Job shop scheduling: assign start times to operations, subject to constraints. • Length of schedule = latest completion time of any operation • Goal: find a schedule with minimum length • Many heuristics (branch and bound, simulated annealing...)
Example payoff distribution: Job Shop Scheduling • “ft10” is a notorious instance of the job shop scheduling problem • Heuristic h: do hill-climbing 500 times • Ran h 1000 times on ft10; fit GEV to payoff data
Distribution truncated at 931. Optimal schedule length = 930 (Carlier & Pinson, 1986) probability E[Max. payoff] -(schedule length) num. runs Example payoff distribution: Job shop scheduling Best of 50,000 sampled schedules has length 1014
Notation • mi(t) = expected maximum payoff you get from pulling the ith arm t times • m*(t) = max1ik mi(t) • S(t) = expected maximum payoff you get by following strategy S for t pulls
The Algorithm • Strategy S* ( and to be determined): • For i from 1 to k: • Using D pulls, estimate mi(n). Pick D so that with probability 1-, estimate is within of true mi(n). • For remaining n-kD pulls: • Pull arm with max. estimated mi(n) • Guarantee: S*(n) = m*(n) - o(1).
The GEV distribution • Z has a GEV distribution if for constants s, , and > 0. determines mean determines standard deviation s determines shape
s>0 Lots of algebra s=0 Not so bad s<0 Behavior of the GEV
Empirical mi(1) Predicted mi(n) Empirical mi(2) Predicting mi(n) • Estimation procedure: linear interpolation! • Estimate mi(1) and mi(2) , then interpolate to get mi(n)
Predicting mi(n): Lemma • Let X be a random variable with (unknown) mean and standard deviation max.O(-2 log -1) samples of X suffice to obtain an estimate such that with probability at least 1-, estimate is within of true value. • Proof idea: use “median of means”
Empirical mi(1) Predicted mi(n) Empirical mi(2) Predicting mi(n) • Equation for line: mi(n) = mi(1)+[mi(1)-mi(2)](log n) • Estimating mi(n) requires O((log n)2 -2 log -1) pulls
The Algorithm • Strategy S* ( and to be determined): • For i from 1 to k: • Using D pulls, estimate mi(n). Pick D so that with probability 1-, estimate is within of true mi(n). • For remaining n-kD pulls: • Pull arm with max. predicted mi(n) • Guarantee: S*(n) = m*(n) - o(1) • Three things make S* less than optimal: • • • m*(n) - m*(n-kD)
Analysis • Three things make S* less than optimal: • • • m*(n) - m*(n-kD) • Setting =n-2, =n-1/3 takes care of the first two. Then: • m*(n)-m*(n-kD) = O(log n - log(n-kD)) = O(kD/n) = O(k(log n)2 -2(log -1)/n) = O(k(log n)3 n-1/3) = o(1)
Summary & Future Work • Defined max k-armed bandit problem and discussed applications to heuristic search • Presented an asymptotically optimal algorithm for GEV payoff distributions (we analyzed special case s=0) • Working on applications to scheduling problems
The Extremal Types Theorem • Define Mn = max. of n draws, and suppose where each rn is a linear “rescaling function”. Then G is either a point mass or a “generalized extreme value distribution”: for constants s, , and > 0.