Learning to Rank: From Pairwise Approach to Listwise Approach

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date: 2009/12/09 Published in ICML 2007

Contents • Introduction • Pairwise approach • Listwise approach • Probability models • Permutation probability • Top k probability • Learning method: ListNet • Experiments • Conclusions and future work

Introduction • Learning to rank: • Ranking objects for some queries • Document retrieval, expert finding, anti web spam, and product ratings, etc. • Learning to rank methods: • Pointwise approach • Pairwise approach • Listwise approach

Pairwise approach (1/2) • Training samples: document pairs • Learning task: classification of object pairs into 2 categories (correctly ranked or incorrectly ranked) • Methods: • RankSVM (Herbrich et al., 1999) • RankBoost (Freund et al., 1998) • RankNet (Burges et al., 2005)

Pairwise approach (2/2) • Advantages: • Handiness of applying existing classification methods • Ease of obtaining training instances of document pairs • E.g. click-through data from users (Joachimes, 2002) • Problems… • Learning objective is to minimize errors in classifying document pairs, not to minimize errors in ranking documents. • The assumption of i.i.d. generated document pairs is too strong • The number of document pairs varies largely from query to query, resulting biased models towards queries with more document pairs

Listwise approach (1/2) • Training samples: document lists • Listwise loss function • Represents the difference between the ranking list outputted by the ranking model and the ground truth ranking list • Probabilistic methods + cross-entropy • Permutation probability • Top k probability • Classification model: neural network • Optimization algorithm: gradient descent

Listwise approach (2/2) • Listwise framework: Queries …… …… Relevance scores Listwise loss function …… …… Documents Feature vectors …… …… Model-generated scores …… ……

Probability models • Map a list of scores to a probability distribution • Permutation probability • Top k probability • Take any metric between probability distributions as a loss function • Cross-entropy

Permutation probability (1/6) • objects are to be ranked • A permutation = ranking order of objects = • = a set of all possible permutations of objects • A list of scores

Permutation probability (2/6) • Permutation probability is defined as: where = an increasing and strictly positive function = the score of object at position of permutation • For example:

Permutation probability (3/6) • The permutation probability forms a probability distribution over • and • The permutation with larger element in the front has higher probability • If • has the highest probability • has the lowest probability

Permutation probability (4/6) • Example: 3 objects with scores 3, 5, 10

Permutation probability (5/6) • For a linear function , the permutation probability is scale invariant where • For a exponential function , the permutation probability is translation invariant where

Permutation probability (6/6) • However… • The number of permutation computation is of an order of • The computation is intractable for large • Consider the top k probability!

Top k probability (1/4) • The probability of objects (out of objects) being ranked on the top positions • The top k subgroup is defined as a set containing all the permutations in which the top k objects are exactly • is the collection of all the top k subgroups • now has only elements << • E.g. for 5 objects, the top 2 subgroup includes:{(1,3,2,4,5), (1,3,2,5,4), (1,3,4,2,5), (1,3,4,5,2), (1,3,5,2,4), (1,3,5,4,2)}

Top k probability (2/4) • The top k probability of objects is defined as: • For example (5 objects): • Still needs to compute n! permutations?

Top k probability (3/4) • The top k probability can be computed as follows: where = the score of object (ranked at position ) • For example (1,3,x,x,x):

Top k probability (4/4) • Top k probabilities form a probability distribution over the collection • The top k subgroup with larger element in the front has higher top k probability • Top k probability is scale or translation invariant with a carefully designed function

Listwise loss function • Cross-entropy between the top k distributions of two lists of scores: where denotes the query denotes the ground truth list of scores denotes the model-generated list of scores

Learning method: ListNet (1/2) • A learning to rank method for optimizing the listwise loss function based on top k probability with neural network as the model and gradient descent as optimization algorithm • denotes the ranking function based on the neural network model • For a given feature vector , the ranking function gives a score • Score list

Learning method: ListNet (2/2) • Learning algorithm of ListNet: Input: training data Parameter: number of iteration and learning rate Initialize parameter for t = 1 to do for = 1 to do Input of query to neural network and compute score list with current Compute gradient Update end for end for Output neural network model

Experiments • ListNet compared with 3 pairwise methods: • RankNet • RankSVM • RankBoost • 3 datasets • TREC • OHSUMED • CSearch

TREC dataset • .gov domain web pages in 2002 • 1,053,110 pages, 11,164,829 hyperlinks • 50 queries • Binary relevance judgment (relevant or irrelevant) • 20 features extracted from each query-document pair (e.g. content features and hyperlink features)

OHSUMED dataset • A collection of documents and queries on medicine • 348,566 documents, 106 queries • 16,140 query-document pairs • Relevance judgment: definitely relevant, possibly relevant, not relevant • 30 features extracted for each query-document pair

CSearch dataset • A dataset from a commercial search engine • About 25,000 queries with 1000 documents associated with each query • About a total of 600 features, including query-dependent and –independent features • 5 levels of relevance judgment: 4 (perfect match) to 0 (bad match)

Ranking performance measure (1/2) • Normalized Discounted Cumulative Gain (NDCG) where or • Can be used with more than 2 levels of relevance score

Ranking performance measure (2/2) • Mean Average Precision (MAP) where • MAP = average of AP over all queries • Can only use binary relevance judgment

Experimental results (1/4) • Ranking accuracies in terms of NDCG on TREC NDCG top k

Experimental results (2/4) • Ranking accuracies in terms of NDCG on OHSUMED NDCG top k

Experimental results (3/4) • Ranking accuracies in terms of NDCG on CSearch NDCG top k

Experimental results (4/4) • Ranking accuracies in terms of MAP

Discussions (1/2) • For pairwise approach, the number of document pairs varies largely from query to query • Distribution of the number of document pairs per query in OHSUMED

Discussions (2/2) • Pairwise approach employs a “pairwise” loss function, not suited for NCDG and MAP for performance measuring • Listwise approach better represents the performance measures • Verification? • Observe the relationship between loss and NDCG in each iteration

Pairwise loss vs. NDCG in RankNet NDCG Loss NDCG Pairwise loss iteration

Listwise loss vs. NDCG in ListNet NDCG NDCG Loss Pairwise loss iteration

Conclusions and future work • Conclusions • Listwise approach for learning to rank • Permutation probability and top k probability • Cross-entropy as loss function • Using neural network as model and gradient descent as the optimization algorithm • Future work • Use other metrics for loss function • Use other models • Investigate the relationship between listwise loss functions and performance measures

Learning to Rank: From Pairwise Approach to Listwise Approach

Learning to Rank: From Pairwise Approach to Listwise Approach

Presentation Transcript

A Constructivistic Approach to Learning

Learning to Rank: A Machine Learning Approach to Static Ranking

Learning Approach to Personality

The Inquiry Approach to Learning

Approach to

Learning Approach

An Intensive Approach to Extensive Learning

COLT approach to explaining Learning

Situational approach to learning

The Learning Approach to Personality

ICON Approach to Learning

THE GUIDANCE- ORIENTED APPROACH TO LEARNING

Quality approach to new e-Learning: from inspection to inspiration

A Blended Learning Approach to Listening

A Micro-Learning Approach to

A Revolutionary Approach to Distance Learning

THE GUIDANCE- ORIENTED APPROACH TO LEARNING

Scientific Approach To Learning Spanish

Approach to

The Constructivism Approach To Learning.

Learning to Align: a Statistical Approach

Social Constructivist Approach to Learning