390 likes | 488 Views
Learn about the evolution of learning-to-rank techniques from pairwise to listwise approaches, encompassing methodologies, probability models, and practical applications. Explore the advantages, challenges, and experiments conducted in the field, along with the ListNet learning method. Discover how listwise approaches optimize ranking lists by minimizing error and enhancing ranking accuracy.
E N D
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date: 2009/12/09 Published in ICML 2007
Contents • Introduction • Pairwise approach • Listwise approach • Probability models • Permutation probability • Top k probability • Learning method: ListNet • Experiments • Conclusions and future work
Introduction • Learning to rank: • Ranking objects for some queries • Document retrieval, expert finding, anti web spam, and product ratings, etc. • Learning to rank methods: • Pointwise approach • Pairwise approach • Listwise approach
Pairwise approach (1/2) • Training samples: document pairs • Learning task: classification of object pairs into 2 categories (correctly ranked or incorrectly ranked) • Methods: • RankSVM (Herbrich et al., 1999) • RankBoost (Freund et al., 1998) • RankNet (Burges et al., 2005)
Pairwise approach (2/2) • Advantages: • Handiness of applying existing classification methods • Ease of obtaining training instances of document pairs • E.g. click-through data from users (Joachimes, 2002) • Problems… • Learning objective is to minimize errors in classifying document pairs, not to minimize errors in ranking documents. • The assumption of i.i.d. generated document pairs is too strong • The number of document pairs varies largely from query to query, resulting biased models towards queries with more document pairs
Listwise approach (1/2) • Training samples: document lists • Listwise loss function • Represents the difference between the ranking list outputted by the ranking model and the ground truth ranking list • Probabilistic methods + cross-entropy • Permutation probability • Top k probability • Classification model: neural network • Optimization algorithm: gradient descent
Listwise approach (2/2) • Listwise framework: Queries …… …… Relevance scores Listwise loss function …… …… Documents Feature vectors …… …… Model-generated scores …… ……
Probability models • Map a list of scores to a probability distribution • Permutation probability • Top k probability • Take any metric between probability distributions as a loss function • Cross-entropy
Permutation probability (1/6) • objects are to be ranked • A permutation = ranking order of objects = • = a set of all possible permutations of objects • A list of scores
Permutation probability (2/6) • Permutation probability is defined as: where = an increasing and strictly positive function = the score of object at position of permutation • For example:
Permutation probability (3/6) • The permutation probability forms a probability distribution over • and • The permutation with larger element in the front has higher probability • If • has the highest probability • has the lowest probability
Permutation probability (4/6) • Example: 3 objects with scores 3, 5, 10
Permutation probability (5/6) • For a linear function , the permutation probability is scale invariant where • For a exponential function , the permutation probability is translation invariant where
Permutation probability (6/6) • However… • The number of permutation computation is of an order of • The computation is intractable for large • Consider the top k probability!
Top k probability (1/4) • The probability of objects (out of objects) being ranked on the top positions • The top k subgroup is defined as a set containing all the permutations in which the top k objects are exactly • is the collection of all the top k subgroups • now has only elements << • E.g. for 5 objects, the top 2 subgroup includes:{(1,3,2,4,5), (1,3,2,5,4), (1,3,4,2,5), (1,3,4,5,2), (1,3,5,2,4), (1,3,5,4,2)}
Top k probability (2/4) • The top k probability of objects is defined as: • For example (5 objects): • Still needs to compute n! permutations?
Top k probability (3/4) • The top k probability can be computed as follows: where = the score of object (ranked at position ) • For example (1,3,x,x,x):
Top k probability (4/4) • Top k probabilities form a probability distribution over the collection • The top k subgroup with larger element in the front has higher top k probability • Top k probability is scale or translation invariant with a carefully designed function
Listwise loss function • Cross-entropy between the top k distributions of two lists of scores: where denotes the query denotes the ground truth list of scores denotes the model-generated list of scores
Learning method: ListNet (1/2) • A learning to rank method for optimizing the listwise loss function based on top k probability with neural network as the model and gradient descent as optimization algorithm • denotes the ranking function based on the neural network model • For a given feature vector , the ranking function gives a score • Score list
Learning method: ListNet (2/2) • Learning algorithm of ListNet: Input: training data Parameter: number of iteration and learning rate Initialize parameter for t = 1 to do for = 1 to do Input of query to neural network and compute score list with current Compute gradient Update end for end for Output neural network model
Experiments • ListNet compared with 3 pairwise methods: • RankNet • RankSVM • RankBoost • 3 datasets • TREC • OHSUMED • CSearch
TREC dataset • .gov domain web pages in 2002 • 1,053,110 pages, 11,164,829 hyperlinks • 50 queries • Binary relevance judgment (relevant or irrelevant) • 20 features extracted from each query-document pair (e.g. content features and hyperlink features)
OHSUMED dataset • A collection of documents and queries on medicine • 348,566 documents, 106 queries • 16,140 query-document pairs • Relevance judgment: definitely relevant, possibly relevant, not relevant • 30 features extracted for each query-document pair
CSearch dataset • A dataset from a commercial search engine • About 25,000 queries with 1000 documents associated with each query • About a total of 600 features, including query-dependent and –independent features • 5 levels of relevance judgment: 4 (perfect match) to 0 (bad match)
Ranking performance measure (1/2) • Normalized Discounted Cumulative Gain (NDCG) where or • Can be used with more than 2 levels of relevance score
Ranking performance measure (2/2) • Mean Average Precision (MAP) where • MAP = average of AP over all queries • Can only use binary relevance judgment
Experimental results (1/4) • Ranking accuracies in terms of NDCG on TREC NDCG top k
Experimental results (2/4) • Ranking accuracies in terms of NDCG on OHSUMED NDCG top k
Experimental results (3/4) • Ranking accuracies in terms of NDCG on CSearch NDCG top k
Experimental results (4/4) • Ranking accuracies in terms of MAP
Discussions (1/2) • For pairwise approach, the number of document pairs varies largely from query to query • Distribution of the number of document pairs per query in OHSUMED
Discussions (2/2) • Pairwise approach employs a “pairwise” loss function, not suited for NCDG and MAP for performance measuring • Listwise approach better represents the performance measures • Verification? • Observe the relationship between loss and NDCG in each iteration
Pairwise loss vs. NDCG in RankNet NDCG Loss NDCG Pairwise loss iteration
Listwise loss vs. NDCG in ListNet NDCG NDCG Loss Pairwise loss iteration
Conclusions and future work • Conclusions • Listwise approach for learning to rank • Permutation probability and top k probability • Cross-entropy as loss function • Using neural network as model and gradient descent as the optimization algorithm • Future work • Use other metrics for loss function • Use other models • Investigate the relationship between listwise loss functions and performance measures