Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem

Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem ICML 2009 Yisong Yue Thorsten Joachims Cornell University

Learning To Rank • Supervised Learning Problem • Extension of classification/regression • Relatively well understood • High applicability in Information Retrieval • Requires explicitly labeled data • Expensive to obtain • Expert judged labels == search user utility? • Doesn’t generalize to other search domains.

Our Contribution • Learn from implicit feedback (users’ clicks) • Reduce labeling cost • More representative of end user information needs • Learn using pairwise comparisons • Humans are more adept at making pairwise judgments • Via Interleaving [Radlinski et al., 2008] • On-line framework (Dueling Bandits Problem) • We leverage users when exploring new retrieval functions • Exploration vs exploitation tradeoff (regret)

Team-Game Interleaving (u=thorsten, q=“svm”) f1(u,q)  r1 f2(u,q)  r2 NEXTPICK 1. Kernel Machines http://svm.first.gmd.de/ 2. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ 3. Support Vector Machine and Kernel ... References http://svm.research.bell-labs.com/SVMrefs.html 4. Lucent Technologies: SVM demo applet http://svm.research.bell-labs.com/SVT/SVMsvt.html 5. Royal Holloway Support Vector Machine http://svm.dcs.rhbnc.ac.uk 1. Kernel Machines http://svm.first.gmd.de/ 2. Support Vector Machinehttp://jbolivar.freeservers.com/ 3. An Introduction to Support Vector Machineshttp://www.support-vector.net/ 4. Archives of SUPPORT-VECTOR-MACHINES ...http://www.jiscmail.ac.uk/lists/SUPPORT... 5. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ Interleaving(r1,r2) 1. Kernel Machines T2http://svm.first.gmd.de/ 2. Support Vector Machine T1http://jbolivar.freeservers.com/ 3. SVM-Light Support Vector Machine T2http://ais.gmd.de/~thorsten/svm light/ 4. An Introduction to Support Vector Machines T1http://www.support-vector.net/ 5. Support Vector Machine and Kernel ... ReferencesT2 http://svm.research.bell-labs.com/SVMrefs.html 6. Archives of SUPPORT-VECTOR-MACHINES ... T1http://www.jiscmail.ac.uk/lists/SUPPORT... 7. Lucent Technologies: SVM demo applet T2http://svm.research.bell-labs.com/SVT/SVMsvt.html Invariant: For all k, in expectation same number of team members in top k from each team. Interpretation: (r2 Âr1) ↔ clicks(T2) > clicks(T1) [Radlinski, Kurup, Joachims; CIKM 2008]

Dueling Bandits Problem • Continuous space bandits F • E.g., parameter space of retrieval functions (i.e., weight vectors) • Each time step compares two bandits • E.g., interleaving test on two retrieval functions • Comparison is noisy & independent

Dueling Bandits Problem • Continuous space bandits F • E.g., parameter space of retrieval functions (i.e., weight vectors) • Each time step compares two bandits • E.g., interleaving test on two retrieval functions • Comparison is noisy & independent • Choose pair (ft, ft’) to minimize regret: • (% users who prefer best bandit over chosen ones)

Example 1 • P(f* > f) = 0.9 • P(f* > f’) = 0.8 • Incurred Regret = 0.7 • Example 2 • P(f* > f) = 0.7 • P(f* > f’) = 0.6 • Incurred Regret = 0.3 • Example 3 • P(f* > f) = 0.51 • P(f* > f) = 0.55 • Incurred Regret = 0.06

Modeling Assumptions • Each bandit f2F has intrinsic value v(f) • Never observed directly • Assume v(f) is strictly concave ( unique f* ) • Comparisons based on v(f) • P(f > f’) = σ( v(f) – v(f’) ) • P is L-Lipschitz • For example:

Probability Functions

Dueling Bandit Gradient Descent • Maintain ft • Compare with ft’(close to ft -- defined by step size) • Update if ft’ wins comparison • Expectation of update close to gradient of P(ft > f’) • Builds on Bandit Gradient Descent [Flaxman et al., 2005]

δ – explore step size γ – exploit step size Current point Losing candidate Winning candidate Dueling Bandit Gradient Descent

Analysis (Sketch) • Dueling Bandit Gradient Descent • Sequence of partially convex functions ct(f) = P(ft > f) • Random binary updates (expectation close to gradient) • Bandit Gradient Descent[Flaxman et al., SODA 2005] • Sequence of convex functions • Use randomized update (expectation close to gradient) • Can be extended to our setting (Assumes more information)

Analysis (Sketch) • Convex functions satisfy • Both additive and multiplicative error • Depends on exploration step size δ • Main analytical contribution: bounding multiplicative error

Regret Bound • Regret grows as O(T3/4): • Average regret shrinks as O(T-1/4) • In the limit, we do as well as knowing f* in hindsight δ = O(1/T-1/4 ) γ = O(1/T-1/2 )

Practical Considerations • Need to set step size parameters • Depends on P(f > f’) • Cannot be set optimally • We don’t know the specifics of P(f > f’) • Algorithm should be robust to parameter settings • Set parameters approximately in experiments

50 dimensional parameter space • Value function v(x) = -xTx • Logistic transfer function • Random point has regret almost 1 More experiments in paper.

Web Search Simulation • Leverage web search dataset • 1000 Training Queries, 367 Dimensions • Simulate “users” issuing queries • Value function based on NDCG@10 (ranking measure) • Use logistic to make probabilistic comparisons • Use linear ranking function. • Not intended to compete with supervised learning • Feasibility check for online learning w/ users • Supervised labels difficult to acquire “in the wild”

Chose parameters with best final performance • Curves basically identical for validation and test sets (no over-fitting) • Sampling multiple queries makes no difference

What Next? • Better simulation environments • More realistic user modeling assumptions • DBGD simple and extensible • Incorporate pairwise document preferences • Deal with ranking discontinuities • Test on real search systems • Varying scales of user communities • Sheds on insight / guides future development

Extra Slides

Active vs Passive Learning • Passive Data Collection (offline) • Biased by current retrieval function • Point-wise Evaluation • Design retrieval function offline • Evaluate online • Active Learning (online) • Automatically propose new rankings to evaluate • Our approach

Relative vs Absolute Metrics • Our framework based on relative metrics • E.g., comparing pairs of results or rankings • Relatively recent development • Absolute Metrics • E.g., absolute click-through rate • More common in literature • Suffers from presentation bias • Less robust to the many different sources of noise

What Results do Users View/Click? [Joachims et al., TOIS 2007]

Analysis (Sketch) • Convex functions satisfy • We have both multiplicative and additive error • Depends on exploration step size δ • Main technical contribution: bounding multiplicative error Existing results yields sub-linear bounds on:

Analysis (Sketch) • We know how to bound • Regret: • We can show using Lipschitz and symmetry of σ:

More Simulation Experiments • Logistic transfer function σ(x) = 1/(1+exp(-x)) • 4 choices of value functions • δ, γ set approximately

NDCG Normalized Discounted Cumulative Gain Multiple Levels of Relevance DCG: contribution of ith rank position: Ex: has DCG score of NDCG is normalized DCG best possible ranking as score NDCG = 1

Considerations • NDCG is discontinuous w.r.t. function parameters • Try larger values of δ, γ • Try sampling multiple queries per update • Homogenous user values • NDCG@10 • Not an optimization concern • Modeling limitation • Not intended to compete with supervised learning • Sanity check of feasibility for online learning w/ users

Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem