510 likes | 631 Views
A fast algorithm for learning large scale preference relations Vikas C. Raykar and Ramani Duraiswami University of Maryland College Park Balaji Krishnapuram Siemens medical solutions USA AISTATS 2007. Learning . Many learning tasks can be viewed as function estimation.
E N D
A fast algorithm for learning large scale • preference relations • Vikas C. Raykarand Ramani Duraiswami • University of Maryland College Park • Balaji Krishnapuram • Siemens medical solutions USA • AISTATS 2007
Learning Many learning tasks can be viewed as function estimation.
Learning from examples Not all supervised learning procedures fit in the standard classification/regression framework. In this talk we are mainly concerned with ranking/ordering. Learning algorithm Training
Ranking / Ordering For some applications ordering is more important Example 1: Information retrieval Sort in the order of relevance
Ranking / Ordering For some applications ordering is more important Example 2: Recommender systems Sort in the order of preference
Ranking / Ordering For some applications ordering is more important Example 3: Medical decision making Decide over different treatment options
Plan of the talk • Ranking formulation • Algorithm • Fast algorithm • Results
Given a we can order/rank a set of instances. Preference relations Goal - Learn a preference relation Training data – Set of pairwise preferences
Ranking function Provides a numerical score Not unique Goal - Learn a preference relation New Goal - Learn a ranking function Why not use classifier/ordinal regressor as the ranking function?
Pairwise disagreements Pairwise preference Relations Why is ranking different? Learning algorithm Training
Training data..more formally From these two we can get a set of pairwise preference realtions
Loss function.. Minimize fraction of pairwise disagreements Maximize fraction of pairwise agreements Total # of pairwise agreements Total # of pairwise preference relations Generalized Wilcoxon-Mann-Whitney (WMW) statistic
+ + + - + - - + - + - + - Consider a two class problem
Function class..Linear ranking function • Different algorithms use different function class • RankNet – neural network • RankSVM – RKHS • RankBoost – boosted decision stumps
Plan of the talk • Ranking formulation • Training data – Pairwise preference relations • Ideal Loss function – WMW statistic • Function class – linear ranking functions • Algorithm • Fast algorithm • Results
Choose w to maximize The Likelihood Discrete optimization problem Log-likelihood Assumption : Every pair is drawn independently Sigmoid[Burges et.al.]
Another interpretation O-1 indicator function What we want to maximize Log-sigmoid What we actually maximize Log-sigmoid is a lower bound for the indicator function
Lower bounding the WMW Log-likelihood <= WMW
Gradient based learning • Use nonlinear conjugate-gradient algorithm. • Requires only gradient evaluations. • No function evaluations. • No second derivatives. • Gradient is given by
Pairwise preference relations Cross entropy Backpropagation neural net RankNet Learning algorithm Training
Pairwise preference relations Pairwise disagreements SVM RKHS RankSVM Learning algorithm Training
Pairwise preference relations Pairwise disagreements Boosting Decision stumps RankBoost Learning algorithm Training
Plan of the talk • Ranking formulation • Training data – Pairwise preference relations • Loss function – WMW statistic • Function class – linear ranking functions • Algorithm • Maximize a lower bound on WMW • Use conjugate-gradient • Quadratic complexity • Fast algorithm • Results
Key idea • Use approximate gradient. • Extremely fast in linear time. • Converges to the same solution. • Requires a few more iterations.
Core computational primitive Weighted summation of erfc functions
1. Beauliu’s series expansion Derive bounds for this to choose the number of terms Retain only the first few terms contributing to the desired accuracy.
3. Regrouping Once A and B are precomputed Can be computed in O(pM) Does not depend on y. Can be computed in O(pN) Reduced from O(MN) to O(p(M+N))
3. Other tricks • Rapid saturation of the erfc function. • Space subdivision • Choosing the parameters to achieve • the error bound • See the technical report
Plan of the talk • Ranking formulation • Training data – Pairwise preference relations • Loss function – WMW statistic • Function class – linear ranking functions • Algorithm • Maximize a lower bound on WMW • Use conjugate-gradient • Quadratic complexity • Fast algorithm • Use fast approximate gradient • Fast summation of erfc functions • Results
Datasets • 12 public benchmark datasets • Five-fold cross-validation experiments • CG tolerance 1e-3 • Accuracy for the gradient computation 1e-6
Direct vs Fast -WMW statistic WMW is similar for both the exact and the fast approximate version.
Comparison with other methods • RankNet - Neural network • RankSVM - SVM • RankBoost - Boosting
Comparison with other methods • WMW is almost similar for all the methods. • Proposed method faster than all the other methods. • Next best time is shown by RankBoost. • Only proposed method can handle large datasets.
Application to collaborative filtering • Predict movie ratings for a user based on the ratings provided by other users. • MovieLens dataset (www.grouplens.org) • 1 million ratings (1-5) • 3592 movies • 6040 users • Feature vector for each movie – rating provided by d other users
Plan/Conclusion of the talk • Ranking formulation • Training data – Pairwise preference relations • Loss function – WMW statistic • Function class – linear ranking functions • Algorithm • Maximize a lower bound on WMW • Use conjugate-gradient • Quadratic complexity • Fast algorithm • Use fast approximate gradient • Fast summation of erfc functions • Results • Similar accuracy as other methods • But much much faster
Future work • Ranking formulation • Training data – Pairwise preference relations • Loss function – WMW statistic • Function class – linear ranking functions • Algorithm • Maximize a lower bound on WMW • Use conjugate-gradient • Quadratic complexity • Fast algorithm • Use fast approximate gradient • Fast summation of erfc functions • Results • Similar accuracy as other methods • But much much faster Other applications neural network Probit regression Code coming soon
Future work • Ranking formulation • Training data – Pairwise preference relations • Loss function – WMW statistic • Function class – linear ranking functions • Algorithm • Maximize a lower bound on WMW • Use conjugate-gradient • Quadratic complexity • Fast algorithm • Use fast approximate gradient • Fast summation of erfc functions • Results • Similar accuracy as other methods • But much much faster Nonlinear Kernelized Variation. Other applications neural network Probit regression