A Simple Distribution-Free Approach to the Max k-Armed Bandit Problem

A Simple Distribution-Free Approach to the Max k-Armed Bandit Problem Matthew Streeter and Stephen Smith Carnegie Mellon University

Outline • The max k-armed bandit problem • Previous work • Our distribution-free approach • Experimental evaluation

What is the max k-armed bandit problem?

The classical k-armed bandit • You are in a room with k slot machines • Pulling the arm of machine i returns a payoff drawn (independently at random) from unknown distribution Di • Allowed n total pulls • Goal: maximize total payoff • > 50 years of papers

The max k-armed bandit • You are in a room with k slot machines • Pulling the arm of machine i returns a payoff drawn (independently at random) from unknown distribution Di • Allowed n total pulls • Goal: maximize highest payoff • Introduced ~2003

Why study it?

Goal: improve multi-start heuristics • A multi-start heuristic runs an underlying randomized heuristic a bunch of times and returns the best solution • Examples: • HBSS (Bresina 1996) • VBSS (Cicirello & Smith 2005) • GRASPs (Feo & Resende 1995, and many others)

Application: selecting among heuristics • Given: some optimization problem, k randomized heuristics • Each time you run a heuristic, get a solution with a certain quality • Allowed n runs • Goal: maximize quality of best solution

The max k-armed bandit: example • Given n pulls, how can we maximize the (expected) maximum payoff? • If n=1, should pull blue arm (higher mean) • If n=1000, should mainly pull maroon arm (higher variance)

Distributional assumptions? • Without distributional assumptions, optimal strategy is not interesting. • For example suppose payoffs are in {0,1}; arms are shuffled so you don’t know which is which. • Optimal strategy samples the arms in round-robin order! • Can’t distinguish a “good” arm until you receive payoff 1, at which point max payoff can’t be improved

Distributional assumptions? • All previous work assumed each machine returns payoff from a generalized extreme value (GEV) distribution • Why? • Extremal Types Theorem: let Mn = max. of n independent draws from some fixed distribution. As n, distribution of Mn  a GEV distribution • GEV sometimes gives an excellent fit to payoff distributions we care about

Previous work • Cicirello & Smith (CP 2004, AAAI 2005): • Assumed Gumbel distributions (special case of GEV), no rigorous performance guarantees • Good results selecting among heuristics for the RCPSP/max • Streeter & Smith (AAAI 2006) • Rigorous result for general GEV distributions • But no experimental evaluation

Our contributions • Threshold ascent: strategy to solve max k-armed problem using classical k-armed solver as subroutine • Chernoff interval estimation: strategy for classical k-armed bandit algorithm that works well when mean payoffs are small (we assume payoffs in [0,1])

Threshold Ascent • Parameters: strategy S for classical k-armed bandit, integer m > 0 • Idea: • Initialize t - • Use Sto maximize number of payoffs that exceed t • Once m payoffs > t have been received, increase t and repeat

Threshold Ascent • Designed to work well when: • For t > tcritical, there is a growing gap between probability that eventually-best arm yields payoff > t and corresponding prob. for other arms

m controls exploration/exploitation tradeoff (larger m means algorithm converges more before increasing t) • as t gets large, S sees a classical k-armed bandit instance where almost all payoffs are zero • we don’t really start S from scratch each time we increase t Threshold Ascent • Parameters: strategy S for classical k-armed bandit, integer m > 0 • Idea: • Initialize t - • Use Sto maximize number of payoffs that exceed t • Once m payoffs > t have been received, increase t and repeat

Interval Estimation • Interval estimation (Lai & Robbins 1987, Kaelbling 1993) maintains confidence interval for each arm’s mean payoff; pulls arm with highest upper bound 2 1 3 Arm 3 Arm 1 Arm 2

Chernoff Interval Estimation • We analyze a variant of interval estimation with confidence intervals derived from Chernoff bounds • regret = average_payoff(strategy) - *, where * = mean payoff of best arm. • We prove an O(sqrt(*)*X) regret bound, where X = sqrt(k (log n)/n). • Using Hoeffding’s inequality just gives O(X). (Auer et al. 2002). As * 0, our bound is much better. • Can get comparable bounds using “multiplicative weight update” algorithms

Experimental Evaluation

The RCPSP/max • Assign start times to activities subject to resource and temporal constraints • Goal: find a schedule with minimum makespan • NP-hard, “one of the most intractable problems in operations research” (Mohring 2000) • Multi-start heuristics give state-of-the-art performance (Cicirello & Smith 2005)

Note: we use a less aggressive variant of interval estimation in these experiments Evaluation • Five multi-start heuristics; each is a randomized rule for greedily building a schedule • LPF - “longest path following” • LST - “latest start time” • MST - “minimum slack time” • MTS - “most total successors” • RSM - “resource scheduling method” • Three max k-armed bandit strategies: • Threshold Ascent (m=100, S = Chernoff interval estimation with 99% confidence intervals) • round robin sampling • QD-BEACON (Cicirello & Smith 2004, 2005)

Evaluation • Ran on 169 instances from ProGen/max library • For each instance, ran each of five rules 10,000 times and saved results in file • For each of three strategies, solve as max 5-armed bandit with n=10,000 pulls • Define regret = difference between max. possible payoff and max. payoff actually obtained

Results • Threshold Ascent outperforms the other max k-armed bandit strategies, as well as the five “pure” strategies

Summary & Conclusions • The max k-armed bandit problem is a simple online learning problem with applications to heuristic search • We described a new, distribution-free approach to the max k-armed bandit problem • Our strategy is effective at selecting among randomized priority dispatching rules for the RCPSP/max

A Simple Distribution-Free Approach to the Max k-Armed Bandit Problem

A Simple Distribution-Free Approach to the Max k-Armed Bandit Problem

Presentation Transcript

A New Approach to the Maximum-Flow Problem

A SIMPLE APPROACH TO RADIO TRANSMISSIONS

A New Approach to the Maximum-Flow Problem

A Leader’s Approach to a Problem

A Lock-free Multi-threaded Algorithm for the Max-flow Problem

The opportunities of fisheries management A simple approach to a complex problem

Mean Field Equilibria of Multi-Armed Bandit Games

“The Barefoot Bandit”

The Simple Approach to Energy Conservation

The Book Bandit

Approach to the Problem

The burden of armed conflict: a public health approach

Multi-armed Bandit Problems with Dependent Arms

Multi-armed Bandit Problem and Bayesian Optimization in Reinforcement Learning

A Dependent LP-Rounding Approach for the k-Median Problem

A SIMPLE APPROACH TO BYOD

An Asymptotically Optimal Algorithm for the Max k-Armed Bandit Problem

Exploration and Exploitation Strategies for the K-armed Bandit Problem

A new approach to address the multimodal problem

The K -armed Dueling Bandits Problem

A simple scheduling problem

Simple approach