A Support Vector Method for Optimizing Average Precision

A Support Vector Method for Optimizing Average Precision SIGIR 2007 Yisong Yue Cornell University In Collaboration With: Thomas Finley, Filip Radlinski, Thorsten Joachims (Cornell University)

Motivation • Learn to Rank Documents • Optimize for IR performance measures • Mean Average Precision • Leverage Structural SVMs [Tsochantaridis et al. 2005]

MAP vs Accuracy • Average precision is the average of the precision scores at the rank locations of each relevant document. • Ex: has average precision • Mean Average Precision (MAP) is the mean of the Average Precision scores for a group of queries. • A machine learning algorithm optimizing for Accuracy might learn a very different model than optimizing for MAP. • Ex: has average precision of about 0.64, but has a max accuracy of 0.8 vs 0.6 in above ranking.

Recent Related Work Greedy & Local Search • [Metzler & Croft 2005] – optimized for MAP, used gradient descent, expensive for large number of features. • [Caruana et al. 2004] – iteratively built an ensemble to greedily improve arbitrary performance measures. Surrogate Performance Measures • [Burges et al. 2005] – used neural nets optimizing for cross entropy. • [Cao et al. 2006] – used SVMs optimizing for modified ROC-Area. Relaxations • [Xu & Li2007] – used Boosting with exponential loss relaxation

Conventional SVMs • Input examples denoted by x (high dimensional point) • Output targets denoted by y (either +1 or -1) • SVMs learns a hyperplane w, predictions are sign(wTx) • Training involves finding w which minimizes • subject to • The sum of slacks upper bounds the accuracy loss

Adapting to Average Precision • Let x denote the set of documents/query examples for a query • Let y denote a (weak) ranking (each yij2 {-1,0,+1}) • Same objective function: • Constraints are defined for each incorrect labeling y’ over the set of documents x. • Joint discriminant score for the correct labeling at least as large as incorrect labeling plus the performance loss.

Adapting to Average Precision • Maximize subject to where and • Sum of slacks upper bound MAP loss. • After learning w, a prediction is made by sorting on wTxi

Too Many Constraints! • For Average Precision, the true labeling is a ranking where the relevant documents are all ranked in the front, e.g., • An incorrect labeling would be any other ranking, e.g., • This ranking has Average Precision of about 0.8 with (y,y’) ¼ 0.2 • Exponential number of rankings, thus an exponential number of constraints!

Structural SVM Training • STEP 1: Solve the SVM objective function using only the current working set of constraints. • STEP 2: Using the model learned in STEP 1, find the most violated constraint from the exponential set of constraints. • STEP 3: If the constraint returned in STEP 2 is more violated than the most violated constraint the working set by some small constant, add that constraint to the working set. • Repeat STEP 1-3 until no additional constraints are added. Return the most recent model that was trained in STEP 1. STEP 1-3 is guaranteed to loop for at most a polynomial number of iterations. [Tsochantaridis et al. 2005]

Original SVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Structural SVM Approach Repeatedly finds the next most violated constraint… …until set of constraints is a good approximation. Illustrative Example

Finding Most Violated Constraint • Structural SVM is an oracle framework. • Requires subroutine to find the most violated constraint. • Dependent on formulation of loss function and joint feature representation. • Exponential number of constraints! • Efficient algorithm in the case of optimizing MAP.

Finding Most Violated Constraint Observation • MAP is invariant on the order of documents within a relevance class • Swapping two relevant or non-relevant documents does not change MAP. • Joint SVM score is optimized by sorting by document score, wTx • Reduces to finding an interleaving between two sorted lists of documents

Finding Most Violated Constraint • Start with perfect ranking • Consider swapping adjacent relevant/non-relevant documents ►

Finding Most Violated Constraint • Start with perfect ranking • Consider swapping adjacent relevant/non-relevant documents • Find the best feasible ranking of the non-relevant document ►

Finding Most Violated Constraint • Start with perfect ranking • Consider swapping adjacent relevant/non-relevant documents • Find the best feasible ranking of the non-relevant document • Repeat for next non-relevant document ►

Finding Most Violated Constraint • Start with perfect ranking • Consider swapping adjacent relevant/non-relevant documents • Find the best feasible ranking of the non-relevant document • Repeat for next non-relevant document • Never want to swap past previous non-relevant document ►

Finding Most Violated Constraint • Start with perfect ranking • Consider swapping adjacent relevant/non-relevant documents • Find the best feasible ranking of the non-relevant document • Repeat for next non-relevant document • Never want to swap past previous non-relevant document • Repeat until all non-relevant documents have been considered ►

Quick Recap SVM Formulation • SVMs optimize a tradeoff between model complexity and MAP loss • Exponential number of constraints (one for each incorrect ranking) • Structural SVMs finds a small subset of important constraints • Requires sub-procedure to find most violated constraint Find Most Violated Constraint • Loss function invariant to re-ordering of relevant documents • SVM score imposes an ordering of the relevant documents • Finding interleaving of two sorted lists • Loss function has certain monotonic properties • Efficient algorithm

Experiments • Used TREC 9 & 10 Web Track corpus. • Features of document/query pairs computed from outputs of existing retrieval functions. (Indri Retrieval Functions & TREC Submissions) • Goal is to learn a recombination of outputs which improves mean average precision.

Moving Forward • Approach also works (in theory) for other measures. • Some promising results when optimizing for NDCG (with only 1 level of relevance). • Currently working on optimizing for NDCG with multiple levels of relevance. • Preliminary MRR results not as promising.

Conclusions • Principled approach to optimizing average precision. (avoids difficult to control heuristics) • Performs at least as well as alternative SVM methods. • Can be generalized to a large class of rank-based performance measures. • Software available at http://svmrank.yisongyue.com

A Support Vector Method for Optimizing Average Precision

A Support Vector Method for Optimizing Average Precision

Presentation Transcript

Proactively Optimizing Support

Support Vector Machines: A Survey

Vector Addition: Parallelogram Method

Support vector machines for classification

A Novel Method for Early Software Quality Prediction Based on Support Vector Machine

Support Vector Machines

Support Vector Machines

Optimizing Average Precision using Weakly Supervised Data

A Routing Vector Method (RVM) for Routing Bluetooth Scatternets

Support Vector Machines: A Survey

Optimizing F-Measure with Support Vector Machines

Chapter 10 The Support Vector Method For Estimating Indicator Functions

A support Vector Method for Multivariate performance Measures

Optimizing Average Precision (Ranking) Incorporating High -Order Information

Support Vector Machines

A vector quantization method for nearest neighbor classifier design

METHOD OF AVERAGE ERROR

Chapter 10 The Support Vector Method For Estimating Indicator Functions

A Novel Method for Early Software Quality Prediction Based on Support Vector Machine

Optimizing F-Measure with Support Vector Machines

Support Vector Machines: A Survey

Vector Precision Engineering