240 likes | 424 Views
A Simple Distribution-Free Approach to the Max k-Armed Bandit Problem. Matthew Streeter and Stephen Smith Carnegie Mellon University. Outline. The max k-armed bandit problem Previous work Our distribution-free approach Experimental evaluation. What is the max k-armed bandit problem?.
E N D
A Simple Distribution-Free Approach to the Max k-Armed Bandit Problem Matthew Streeter and Stephen Smith Carnegie Mellon University
Outline • The max k-armed bandit problem • Previous work • Our distribution-free approach • Experimental evaluation
The classical k-armed bandit • You are in a room with k slot machines • Pulling the arm of machine i returns a payoff drawn (independently at random) from unknown distribution Di • Allowed n total pulls • Goal: maximize total payoff • > 50 years of papers
The max k-armed bandit • You are in a room with k slot machines • Pulling the arm of machine i returns a payoff drawn (independently at random) from unknown distribution Di • Allowed n total pulls • Goal: maximize highest payoff • Introduced ~2003
Goal: improve multi-start heuristics • A multi-start heuristic runs an underlying randomized heuristic a bunch of times and returns the best solution • Examples: • HBSS (Bresina 1996) • VBSS (Cicirello & Smith 2005) • GRASPs (Feo & Resende 1995, and many others)
Application: selecting among heuristics • Given: some optimization problem, k randomized heuristics • Each time you run a heuristic, get a solution with a certain quality • Allowed n runs • Goal: maximize quality of best solution
The max k-armed bandit: example • Given n pulls, how can we maximize the (expected) maximum payoff? • If n=1, should pull blue arm (higher mean) • If n=1000, should mainly pull maroon arm (higher variance)
Distributional assumptions? • Without distributional assumptions, optimal strategy is not interesting. • For example suppose payoffs are in {0,1}; arms are shuffled so you don’t know which is which. • Optimal strategy samples the arms in round-robin order! • Can’t distinguish a “good” arm until you receive payoff 1, at which point max payoff can’t be improved
Distributional assumptions? • All previous work assumed each machine returns payoff from a generalized extreme value (GEV) distribution • Why? • Extremal Types Theorem: let Mn = max. of n independent draws from some fixed distribution. As n, distribution of Mn a GEV distribution • GEV sometimes gives an excellent fit to payoff distributions we care about
Previous work • Cicirello & Smith (CP 2004, AAAI 2005): • Assumed Gumbel distributions (special case of GEV), no rigorous performance guarantees • Good results selecting among heuristics for the RCPSP/max • Streeter & Smith (AAAI 2006) • Rigorous result for general GEV distributions • But no experimental evaluation
Our contributions • Threshold ascent: strategy to solve max k-armed problem using classical k-armed solver as subroutine • Chernoff interval estimation: strategy for classical k-armed bandit algorithm that works well when mean payoffs are small (we assume payoffs in [0,1])
Threshold Ascent • Parameters: strategy S for classical k-armed bandit, integer m > 0 • Idea: • Initialize t - • Use Sto maximize number of payoffs that exceed t • Once m payoffs > t have been received, increase t and repeat
Threshold Ascent • Designed to work well when: • For t > tcritical, there is a growing gap between probability that eventually-best arm yields payoff > t and corresponding prob. for other arms
m controls exploration/exploitation tradeoff (larger m means algorithm converges more before increasing t) • as t gets large, S sees a classical k-armed bandit instance where almost all payoffs are zero • we don’t really start S from scratch each time we increase t Threshold Ascent • Parameters: strategy S for classical k-armed bandit, integer m > 0 • Idea: • Initialize t - • Use Sto maximize number of payoffs that exceed t • Once m payoffs > t have been received, increase t and repeat
Interval Estimation • Interval estimation (Lai & Robbins 1987, Kaelbling 1993) maintains confidence interval for each arm’s mean payoff; pulls arm with highest upper bound 2 1 3 Arm 3 Arm 1 Arm 2
Chernoff Interval Estimation • We analyze a variant of interval estimation with confidence intervals derived from Chernoff bounds • regret = average_payoff(strategy) - *, where * = mean payoff of best arm. • We prove an O(sqrt(*)*X) regret bound, where X = sqrt(k (log n)/n). • Using Hoeffding’s inequality just gives O(X). (Auer et al. 2002). As * 0, our bound is much better. • Can get comparable bounds using “multiplicative weight update” algorithms
The RCPSP/max • Assign start times to activities subject to resource and temporal constraints • Goal: find a schedule with minimum makespan • NP-hard, “one of the most intractable problems in operations research” (Mohring 2000) • Multi-start heuristics give state-of-the-art performance (Cicirello & Smith 2005)
Note: we use a less aggressive variant of interval estimation in these experiments Evaluation • Five multi-start heuristics; each is a randomized rule for greedily building a schedule • LPF - “longest path following” • LST - “latest start time” • MST - “minimum slack time” • MTS - “most total successors” • RSM - “resource scheduling method” • Three max k-armed bandit strategies: • Threshold Ascent (m=100, S = Chernoff interval estimation with 99% confidence intervals) • round robin sampling • QD-BEACON (Cicirello & Smith 2004, 2005)
Evaluation • Ran on 169 instances from ProGen/max library • For each instance, ran each of five rules 10,000 times and saved results in file • For each of three strategies, solve as max 5-armed bandit with n=10,000 pulls • Define regret = difference between max. possible payoff and max. payoff actually obtained
Results • Threshold Ascent outperforms the other max k-armed bandit strategies, as well as the five “pure” strategies
Summary & Conclusions • The max k-armed bandit problem is a simple online learning problem with applications to heuristic search • We described a new, distribution-free approach to the max k-armed bandit problem • Our strategy is effective at selecting among randomized priority dispatching rules for the RCPSP/max