250 likes | 455 Views
The K -armed Dueling Bandits Problem. COLT 2009 Cornell University. Multi-armed Bandits. K bandits (arms / actions / strategies) Each time step, algorithm chooses bandit based on prior actions and feedback. Only observe feedback from actions taken. Partial Feedback Online Learning Problem.
E N D
The K-armed Dueling Bandits Problem COLT 2009 Cornell University
Multi-armed Bandits • K bandits (arms / actions / strategies) • Each time step, algorithm chooses bandit based on prior actions and feedback. • Only observe feedback from actions taken. • Partial Feedback Online Learning Problem
Optimizing Retrieval Functions • Interactively learn the best retrieval function • Search users provide implicit feedback • E.g., what they click on • Seems like a natural bandit problem • Each retrieval function is a bandit • Feedback is clicks • Find the best w.r.t. clickthrough rate • Assumes clicks → explicit absolute feedback!
What Results do Users View/Click? [Joachims et al., TOIS 2007]
Team-Game Interleaving (u=thorsten, q=“svm”) f1(u,q) r1 f2(u,q) r2 1. Kernel Machines http://svm.first.gmd.de/ 2. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ 3. Support Vector Machine and Kernel ... References http://svm.research.bell-labs.com/SVMrefs.html 4. Lucent Technologies: SVM demo applet http://svm.research.bell-labs.com/SVT/SVMsvt.html 5. Royal Holloway Support Vector Machine http://svm.dcs.rhbnc.ac.uk 1. Kernel Machines http://svm.first.gmd.de/ 2. Support Vector Machinehttp://jbolivar.freeservers.com/ 3. An Introduction to Support Vector Machineshttp://www.support-vector.net/ 4. Archives of SUPPORT-VECTOR-MACHINES ...http://www.jiscmail.ac.uk/lists/SUPPORT... 5. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ Interleaving(r1,r2) 1. Kernel Machines T2http://svm.first.gmd.de/ 2. Support Vector Machine T1http://jbolivar.freeservers.com/ 3. SVM-Light Support Vector Machine T2http://ais.gmd.de/~thorsten/svm light/ 4. An Introduction to Support Vector Machines T1http://www.support-vector.net/ 5. Support Vector Machine and Kernel ... ReferencesT2 http://svm.research.bell-labs.com/SVMrefs.html 6. Archives of SUPPORT-VECTOR-MACHINES ... T1http://www.jiscmail.ac.uk/lists/SUPPORT... 7. Lucent Technologies: SVM demo applet T2http://svm.research.bell-labs.com/SVT/SVMsvt.html • Mix results of f1 and f2 • Relative feedback • More reliable • Interpretation: (r1 >r2) ↔ clicks(T1) > clicks(T2) [Radlinski, Kurup, Joachims; CIKM 2008]
Dueling Bandits Problem • Given K bandits b1, …, bK • Each iteration: compare (duel) two bandits • E.g., interleaving two retrieval functions • Comparison is noisy • Each comparison result independent • Comparison probabilities initially unknown • Comparison probabilities fixed over time • Total preference ordering, initially unknown
Dueling Bandits Problem • Want to find best (or good) bandit • Similar to finding the max w/ noisy comparisons • Ours is a regret minimization setting • Choose pair (bt, bt’) to minimize regret: • (% users who prefer best bandit over chosen ones)
Related Work • Existing MAB settings • [Lai & Robbins, ’85] [Auer et al., ’02] • PAC setting [Even-Dar et al., ’06] • Partial monitoring problem • [Cesa-Bianchi et al., ’06] • Computing with noisy comparisons • Finding max [Feige et al., ’97] • Binary search [Karp & Kleinberg, ’07] [Ben-Or & Hassidim, ’08] • Continuous-armed Dueling Bandits Problem • [Yue & Joachims, ’09]
Assumptions • P(bi > bj) = ½ + εij (distinguishability) • Strong Stochastic Transitivity • For three bandits bi > bj > bk : • Monotonicity property • Stochastic Triangle Inequality • For three bandits bi > bj > bk : • Diminishing returns property • Satisfied by many standard models • E.g., Logistic/Bradley-Terry
Naïve Approach • In deterministic case, O(K) comparisons to find max • Extend to noisy case: • Repeatedly compare until confident one is better • Problem: comparing two awful (but similar) bandits • Waste comparisons to see which awful bandit is better • Incur high regret for each comparison • Also applies to elimination tournaments
Interleaved Filter • Choose candidate bandit at random ►
Interleaved Filter • Choose candidate bandit at random • Make noisy comparisons (Bernoulli trial) against all other bandits simultaneously • Maintain mean and confidence interval for each pair of bandits being compared ►
Interleaved Filter • Choose candidate bandit at random • Make noisy comparisons (Bernoulli trial) against all other bandits simultaneously • Maintain mean and confidence interval for each pair of bandits being compared • …until another bandit is better • With confidence 1 – δ ►
Interleaved Filter • Choose candidate bandit at random • Make noisy comparisons (Bernoulli trial) against all other bandits simultaneously • Maintain mean and confidence interval for each pair of bandits being compared • …until another bandit is better • With confidence 1 – δ • Repeat process with new candidate • Remove all empirically worse bandits ►
Interleaved Filter • Choose candidate bandit at random • Make noisy comparisons (Bernoulli trial) against all other bandits simultaneously • Maintain mean and confidence interval for each pair of bandits being compared • …until another bandit is better • With confidence 1 – δ • Repeat process with new candidate • Remove all empirically worse bandits • Continue until 1 candidate left ►
Regret Analysis • Define a round to be all the time steps for a particular candidate bandit • Halts round when 1 - δ confidence met • Treat candidate bandits as random walk • Takes log K rounds to reach best bandit • Define a match to be all the comparisons between two bandits in a round • O(K) total matches in expectation • Constant fraction of bandits removed each round
Regret Analysis • O(K) total matches • Each match incurs regret • Depends on δ = K-2T-1 • Finds best bandit w.p. 1-1/T • Expected regret:
Lower Bound Example • Order bandits b1 > b2 > … > bK • P(b > b’) = ½ + ε • Each match takes comparisons • Pay Θ(ε) regret for each comparison • Accumulated regret over all matches is at least
Moving Forward • Extensions • Drifting user interests • Changing user population / document collection • Contextualization / side information • Cost-sensitive active learning w/ relative feedback • Live user studies • Interactive experiments on search service • Sheds insight; guides future design
Per-Match Regret • Number of comparisons in match bi vs bj : • ε1i > εij : round ends before concluding bi > bj • ε1i < εij : conclude bi > bj before round ends, remove bj • Pay ε1i + ε1j regret for each comparison • By triangle inequality ε1i + ε1j ≤ 2*max{ε1i , εij} • Thus by stochastic transitivity accumulated regret is
Number of Rounds • Assume all superior bandits have equal prob of defeating candidate • Worst case scenario under transitivity • Model this as a random walk • rj transitions to each ri (i < j) with equal probability • Compute total number of steps before reaching r1 (i.e., r*) • Can show O(log K) w.h.p. using Chernoff bound
Total Matches Played • O(K) matches played in each round • Naïve analysis yields O(K log K) total • However, all empirically worse bandits are also removed at the end of each round • Will not participate in future rounds • Assume worst case that inferior bandits have ½ chance of being empirically worse • Can show w.h.p. that O(K) total matches are played over O(log N) rounds
Removing Inferior Bandits • At conclusion of each round • Remove any empirically worse bandits • Intuition: • High confidence that winner is better than incumbent candidate • Empirically worse bandits cannot be “much better” than incumbent candidate • Can show via Hoeffding bound that winner is also better than empirically worse bandits with high confidence • Preserves 1-1/T confidence overall that we’ll find the best bandit