290 likes | 537 Views
Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation. SIGIR 2010 Yisong Yue Cornell University Joint work with: Yue Gao, Olivier Chapelle, Ya Zhang, Thorsten Joachims. Retrieval Evaluation Using Click Data. Eliciting relative feedback E.g., is A better than B?
E N D
Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation SIGIR 2010 Yisong Yue Cornell University Joint work with: Yue Gao, Olivier Chapelle, Ya Zhang, Thorsten Joachims
Retrieval Evaluation Using Click Data • Eliciting relative feedback • E.g., is A better than B? • Evaluation pipeline • Online experiment design (example to follow) • Collect clicks • Use standard statistical tests (e.g., t-test) • Contribution: Supervised learning algorithm for training a more efficient test statistic
Retrieval Evaluation Using Click Data • Eliciting relative feedback • E.g., is A better than B? • Evaluation pipeline • Online experiment design (example to follow) • Collect clicks • Use standard test statistics (e.g., t-test) • Contribution: Supervised learning algorithm for training a more efficient test statistic
Team-Game Interleaving(Online Experiment for Search Applications) (u=thorsten, q=“svm”) A(u,q) r1 B(u,q) r2 1. Kernel Machines http://svm.first.gmd.de/ 2. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ 3. Support Vector Machine and Kernel ... References http://svm.research.bell-labs.com/SVMrefs.html 4. Lucent Technologies: SVM demo applet http://svm.research.bell-labs.com/SVT/SVMsvt.html 5. Royal Holloway Support Vector Machine http://svm.dcs.rhbnc.ac.uk 1. Kernel Machines http://svm.first.gmd.de/ 2. Support Vector Machinehttp://jbolivar.freeservers.com/ 3. An Introduction to Support Vector Machineshttp://www.support-vector.net/ 4. Archives of SUPPORT-VECTOR-MACHINES ...http://www.jiscmail.ac.uk/lists/SUPPORT... 5. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ [Radlinski, Kurup, Joachims, CIKM 2008]
Team-Game Interleaving(Online Experiment for Search Applications) (u=thorsten, q=“svm”) A(u,q) r1 B(u,q) r2 1. Kernel Machines http://svm.first.gmd.de/ 2. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ 3. Support Vector Machine and Kernel ... References http://svm.research.bell-labs.com/SVMrefs.html 4. Lucent Technologies: SVM demo applet http://svm.research.bell-labs.com/SVT/SVMsvt.html 5. Royal Holloway Support Vector Machine http://svm.dcs.rhbnc.ac.uk 1. Kernel Machines http://svm.first.gmd.de/ 2. Support Vector Machinehttp://jbolivar.freeservers.com/ 3. An Introduction to Support Vector Machineshttp://www.support-vector.net/ 4. Archives of SUPPORT-VECTOR-MACHINES ...http://www.jiscmail.ac.uk/lists/SUPPORT... 5. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ Interleaving(r1,r2) 1. Kernel Machines T2http://svm.first.gmd.de/ 2. Support Vector Machine T1http://jbolivar.freeservers.com/ 3. SVM-Light Support Vector Machine T2http://ais.gmd.de/~thorsten/svm light/ 4. An Introduction to Support Vector Machines T1http://www.support-vector.net/ 5. Support Vector Machine and Kernel ... ReferencesT2 http://svm.research.bell-labs.com/SVMrefs.html 6. Archives of SUPPORT-VECTOR-MACHINES ... T1http://www.jiscmail.ac.uk/lists/SUPPORT... 7. Lucent Technologies: SVM demo applet T2http://svm.research.bell-labs.com/SVT/SVMsvt.html • Mix results of A and B • Relative feedback • More reliable [Radlinski, Kurup, Joachims, CIKM 2008]
Team-Game Interleaving(Online Experiment for Search Applications) (u=thorsten, q=“svm”) A(u,q) r1 B(u,q) r2 1. Kernel Machines http://svm.first.gmd.de/ 2. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ 3. Support Vector Machine and Kernel ... References http://svm.research.bell-labs.com/SVMrefs.html 4. Lucent Technologies: SVM demo applet http://svm.research.bell-labs.com/SVT/SVMsvt.html 5. Royal Holloway Support Vector Machine http://svm.dcs.rhbnc.ac.uk 1. Kernel Machines http://svm.first.gmd.de/ 2. Support Vector Machinehttp://jbolivar.freeservers.com/ 3. An Introduction to Support Vector Machineshttp://www.support-vector.net/ 4. Archives of SUPPORT-VECTOR-MACHINES ...http://www.jiscmail.ac.uk/lists/SUPPORT... 5. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ Interleaving(r1,r2) 1. Kernel Machines T2http://svm.first.gmd.de/ 2. Support Vector Machine T1http://jbolivar.freeservers.com/ 3. SVM-Light Support Vector Machine T2http://ais.gmd.de/~thorsten/svm light/ 4. An Introduction to Support Vector Machines T1http://www.support-vector.net/ 5. Support Vector Machine and Kernel ... ReferencesT2 http://svm.research.bell-labs.com/SVMrefs.html 6. Archives of SUPPORT-VECTOR-MACHINES ... T1http://www.jiscmail.ac.uk/lists/SUPPORT... 7. Lucent Technologies: SVM demo applet T2http://svm.research.bell-labs.com/SVT/SVMsvt.html • Mix results of A and B • Relative feedback • More reliable Interpretation: (r1> r2) ↔ clicks(r1) > clicks(r2) [Radlinski, Kurup, Joachims, CIKM 2008]
Determining Statistical Significance • Each q, interleave A(q) and B(q), log clicks • t-Test • For each q, score: % clicks on A(q) • E.g., 3/4 = 0.75 • Sample mean score (e.g., 0.6)
Determining Statistical Significance • Each q, interleave A(q) and B(q), log clicks • t-Test • For each q, score: % clicks on A(q) • E.g., 3/4 = 0.75 • Sample mean score (e.g., 0.6) • Compute confidence (p value) • E.g., want p = 0.05 (i.e., 95% confidence) • More data, more confident
Determining Statistical Significance • Each q, interleave A(q) and B(q), log clicks • Other Statistical Tests: • z-Test • (equal to t-Test for large samples) • Rank Test • Binomial Test • Etc… • All similar
Limitation • Example: query session with 2 clicks • One click at rank 1 (from A) • Later click at rank 4 (from B) • Normally would count this query session as a tie
Limitation • Example: query session with 2 clicks • One click at rank 1 (from A) • Later click at rank 4 (from B) • Normally would count this query session as a tie • But second click is probably more informative… • …so B should get more credit for this query
Linear Model • Feature vector φ(q,c): • Weight of click is wTφ(q,c)
Example • wTφ(q,c) differentiates last clicks and other clicks
Example • wTφ(q,c) differentiates last clicks and other clicks • Interleave A vs B • 3 clicks per session • Last click 60% on result from A • Other 2 clicks random
Example • wTφ(q,c) differentiates last clicks and other clicks • Interleave A vs B • 3 clicks per session • Last click 60% on result from A • Other 2 clicks random • Conventional w = (1,1) – has significant variance • Only count last click w = (1,0) – minimizes variance
Scoring Query Sessions • Feature representation for query session:
Scoring Query Sessions • Feature representation for query session: • Weighted score for query: • Positive score favors A, negative favors B
Upgraded Test Statistic • t-Test: • Compute mean score wTψq over all queries • E.g., 0.2 • Null hypothesis: mean = 0 • Can reach statistical significance sooner • How to learn w?
Supervised Learning • Will optimize for z-Test: Inverse z-Test • Approximately equal t-Test for large samples • z-Score = mean / standard deviation (Assumes A > B)
Inverting Other Statistical Tests • Most statistical tests uses a test statistic • E.g., z-score • Rank test: % concordant pairs in confidence ranking (ROC Area) • We optimized using logistic regression
Recap • Collect training data • Pairs of retrieval functions A & B, (A > B) • Interleave them, collect usage logs • Build features • For each query session q: Ψq • Query session score: wTψq (positive favors A) • Train w to optimize test statistic • z-Test: maximize ( mean(w) / std(w) )
Experiment Setup • Data collection • Pool of retrieval functions • Hash users into partitions • Run interleaving of different pairs in parallel • Collected on arXiv.org • 2 pools of retrieval functions • Training Pool: (6 pairs) know A > B • New Pool: (12 pairs)
Experimental Results • Inverse z-Test works well • Beats baseline on most of new interleaving pairs • Direction of tests all in agreement • In 6/12 pairs, for p=0.1, reduces sample size by 10% • In 4/12 pairs, achieves p=0.05, but not baseline • 400 to 650 queries per interleaving experiment • Weights hard to interpret (features correlated) • Largest weight: “1 if single click & rank > 1”
Conclusion • Principled, offers practical benefits • Should perform better with more training data • Can be applied to other application domains • Limitations: • Treats training data as one sample • Might not work well when test data is very different from training data • Susceptible to adversarial behavior
Training Logistic Regression • Need to mirror training data • E.g., • ψq with label 1 • -ψq with label 0 • Otherwise, will learn a trivial model that always predicts 1