A fast algorithm for learning large scale preference relations

A fast algorithm for learning large scale • preference relations • Vikas C. Raykarand Ramani Duraiswami • University of Maryland College Park • Balaji Krishnapuram • Siemens medical solutions USA • AISTATS 2007

Learning Many learning tasks can be viewed as function estimation.

Learning from examples Not all supervised learning procedures fit in the standard classification/regression framework. In this talk we are mainly concerned with ranking/ordering. Learning algorithm Training

Ranking / Ordering For some applications ordering is more important Example 1: Information retrieval Sort in the order of relevance

Ranking / Ordering For some applications ordering is more important Example 2: Recommender systems Sort in the order of preference

Ranking / Ordering For some applications ordering is more important Example 3: Medical decision making Decide over different treatment options

Plan of the talk • Ranking formulation • Algorithm • Fast algorithm • Results

Given a we can order/rank a set of instances. Preference relations Goal - Learn a preference relation Training data – Set of pairwise preferences

Ranking function Provides a numerical score Not unique Goal - Learn a preference relation New Goal - Learn a ranking function Why not use classifier/ordinal regressor as the ranking function?

Pairwise disagreements Pairwise preference Relations Why is ranking different? Learning algorithm Training

Training data..more formally From these two we can get a set of pairwise preference realtions

Loss function.. Minimize fraction of pairwise disagreements Maximize fraction of pairwise agreements Total # of pairwise agreements Total # of pairwise preference relations Generalized Wilcoxon-Mann-Whitney (WMW) statistic

+ + + - + - - + - + - + - Consider a two class problem

Function class..Linear ranking function • Different algorithms use different function class • RankNet – neural network • RankSVM – RKHS • RankBoost – boosted decision stumps

Plan of the talk • Ranking formulation • Training data – Pairwise preference relations • Ideal Loss function – WMW statistic • Function class – linear ranking functions • Algorithm • Fast algorithm • Results

Choose w to maximize The Likelihood Discrete optimization problem Log-likelihood Assumption : Every pair is drawn independently Sigmoid[Burges et.al.]

The MAP estimator

Another interpretation O-1 indicator function What we want to maximize Log-sigmoid What we actually maximize Log-sigmoid is a lower bound for the indicator function

Lower bounding the WMW Log-likelihood <= WMW

Gradient based learning • Use nonlinear conjugate-gradient algorithm. • Requires only gradient evaluations. • No function evaluations. • No second derivatives. • Gradient is given by

Pairwise preference relations Cross entropy Backpropagation neural net RankNet Learning algorithm Training

Pairwise preference relations Pairwise disagreements SVM RKHS RankSVM Learning algorithm Training

Pairwise preference relations Pairwise disagreements Boosting Decision stumps RankBoost Learning algorithm Training

Plan of the talk • Ranking formulation • Training data – Pairwise preference relations • Loss function – WMW statistic • Function class – linear ranking functions • Algorithm • Maximize a lower bound on WMW • Use conjugate-gradient • Quadratic complexity • Fast algorithm • Results

Key idea • Use approximate gradient. • Extremely fast in linear time. • Converges to the same solution. • Requires a few more iterations.

Core computational primitive Weighted summation of erfc functions

Notion of approximation

Example

1. Beauliu’s series expansion Derive bounds for this to choose the number of terms Retain only the first few terms contributing to the desired accuracy.

2. Error bounds

3. Use truncated series

3. Regrouping Once A and B are precomputed Can be computed in O(pM) Does not depend on y. Can be computed in O(pN) Reduced from O(MN) to O(p(M+N))

3. Other tricks • Rapid saturation of the erfc function. • Space subdivision • Choosing the parameters to achieve • the error bound • See the technical report

Numerical experiments

Precision vs Speedup

Plan of the talk • Ranking formulation • Training data – Pairwise preference relations • Loss function – WMW statistic • Function class – linear ranking functions • Algorithm • Maximize a lower bound on WMW • Use conjugate-gradient • Quadratic complexity • Fast algorithm • Use fast approximate gradient • Fast summation of erfc functions • Results

Datasets • 12 public benchmark datasets • Five-fold cross-validation experiments • CG tolerance 1e-3 • Accuracy for the gradient computation 1e-6

Direct vs Fast -WMW statistic WMW is similar for both the exact and the fast approximate version.

Direct vs Fast – Time taken

Effect of gradient approximation

Comparison with other methods • RankNet - Neural network • RankSVM - SVM • RankBoost - Boosting

Comparison with other methods • WMW is almost similar for all the methods. • Proposed method faster than all the other methods. • Next best time is shown by RankBoost. • Only proposed method can handle large datasets.

Sample result

Application to collaborative filtering • Predict movie ratings for a user based on the ratings provided by other users. • MovieLens dataset (www.grouplens.org) • 1 million ratings (1-5) • 3592 movies • 6040 users • Feature vector for each movie – rating provided by d other users

Collaborative filtering results

Plan/Conclusion of the talk • Ranking formulation • Training data – Pairwise preference relations • Loss function – WMW statistic • Function class – linear ranking functions • Algorithm • Maximize a lower bound on WMW • Use conjugate-gradient • Quadratic complexity • Fast algorithm • Use fast approximate gradient • Fast summation of erfc functions • Results • Similar accuracy as other methods • But much much faster

Future work • Ranking formulation • Training data – Pairwise preference relations • Loss function – WMW statistic • Function class – linear ranking functions • Algorithm • Maximize a lower bound on WMW • Use conjugate-gradient • Quadratic complexity • Fast algorithm • Use fast approximate gradient • Fast summation of erfc functions • Results • Similar accuracy as other methods • But much much faster Other applications neural network Probit regression Code coming soon

Future work • Ranking formulation • Training data – Pairwise preference relations • Loss function – WMW statistic • Function class – linear ranking functions • Algorithm • Maximize a lower bound on WMW • Use conjugate-gradient • Quadratic complexity • Fast algorithm • Use fast approximate gradient • Fast summation of erfc functions • Results • Similar accuracy as other methods • But much much faster Nonlinear Kernelized Variation. Other applications neural network Probit regression

A fast algorithm for learning large scale preference relations

A fast algorithm for learning large scale preference relations

Presentation Transcript

Preference learning

Large-scale Machine Learning using DryadLINQ

EMERGING SYSTEMS FOR LARGE-SCALE MACHINE LEARNING

Large-Scale Machine Learning at Twitter

Efficient Large-Scale Structured Learning

Large-scale Machine Learning using DryadLINQ

LARGE SCALE

Algorithm for a Large Class of Domains

NTUplace: A Partitioning Based Placement Algorithm for Large-Scale Designs

LARGE-SCALE DISTANCE LEARNING INITIATIVES

A Fast Algorithm for Power Grid Design

A fast algorithm for color image segmentation

Large-Scale Machine Learning: SVM

A Fast Algorithm for Multi-Pattern Searching

Mapping Algorithm for Large-scale Field Programmable Analog Array (FPAA)

Large scale

Simulation of Large-Scale Communication Networks How Large? How Fast?

Planning for a Large-Scale Implementation

A Fast Learning Algorithm for Deep Belief Nets

A Fast Algorithm for Incremental Distance Calculation

Deep learning and large scale machine learning solutions

Large-Scale Deep Learning With TensorFlow