1 / 36

Learning to Rank: From Pairwise Approach to Listwise Approach

Learning to Rank: From Pairwise Approach to Listwise Approach. Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date: 2009/12/09 Published in ICML 2007. Contents. Introduction Pairwise approach Listwise approach Probability models

pbay
Download Presentation

Learning to Rank: From Pairwise Approach to Listwise Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date: 2009/12/09 Published in ICML 2007

  2. Contents • Introduction • Pairwise approach • Listwise approach • Probability models • Permutation probability • Top k probability • Learning method: ListNet • Experiments • Conclusions and future work

  3. Introduction • Learning to rank: • Ranking objects for some queries • Document retrieval, expert finding, anti web spam, and product ratings, etc. • Learning to rank methods: • Pointwise approach • Pairwise approach • Listwise approach

  4. Pairwise approach (1/2) • Training samples: document pairs • Learning task: classification of object pairs into 2 categories (correctly ranked or incorrectly ranked) • Methods: • RankSVM (Herbrich et al., 1999) • RankBoost (Freund et al., 1998) • RankNet (Burges et al., 2005)

  5. Pairwise approach (2/2) • Advantages: • Handiness of applying existing classification methods • Ease of obtaining training instances of document pairs • E.g. click-through data from users (Joachimes, 2002) • Problems… • Learning objective is to minimize errors in classifying document pairs, not to minimize errors in ranking documents. • The assumption of i.i.d. generated document pairs is too strong • The number of document pairs varies largely from query to query, resulting biased models towards queries with more document pairs

  6. Listwise approach (1/2) • Training samples: document lists • Listwise loss function • Represents the difference between the ranking list outputted by the ranking model and the ground truth ranking list • Probabilistic methods + cross-entropy • Permutation probability • Top k probability • Classification model: neural network • Optimization algorithm: gradient descent

  7. Listwise approach (2/2) • Listwise framework: Queries …… …… Relevance scores Listwise loss function …… …… Documents Feature vectors …… …… Model-generated scores …… ……

  8. Probability models • Map a list of scores to a probability distribution • Permutation probability • Top k probability • Take any metric between probability distributions as a loss function • Cross-entropy

  9. Permutation probability (1/6) • objects are to be ranked • A permutation = ranking order of objects = • = a set of all possible permutations of objects • A list of scores

  10. Permutation probability (2/6) • Permutation probability is defined as: where = an increasing and strictly positive function = the score of object at position of permutation • For example:

  11. Permutation probability (3/6) • The permutation probability forms a probability distribution over • and • The permutation with larger element in the front has higher probability • If • has the highest probability • has the lowest probability

  12. Permutation probability (4/6) • Example: 3 objects with scores 3, 5, 10

  13. Permutation probability (5/6) • For a linear function , the permutation probability is scale invariant where • For a exponential function , the permutation probability is translation invariant where

  14. Permutation probability (6/6) • However… • The number of permutation computation is of an order of • The computation is intractable for large • Consider the top k probability!

  15. Top k probability (1/4) • The probability of objects (out of objects) being ranked on the top positions • The top k subgroup is defined as a set containing all the permutations in which the top k objects are exactly • is the collection of all the top k subgroups • now has only elements << • E.g. for 5 objects, the top 2 subgroup includes:{(1,3,2,4,5), (1,3,2,5,4), (1,3,4,2,5), (1,3,4,5,2), (1,3,5,2,4), (1,3,5,4,2)}

  16. Top k probability (2/4) • The top k probability of objects is defined as: • For example (5 objects): • Still needs to compute n! permutations?

  17. Top k probability (3/4) • The top k probability can be computed as follows: where = the score of object (ranked at position ) • For example (1,3,x,x,x):

  18. Top k probability (4/4) • Top k probabilities form a probability distribution over the collection • The top k subgroup with larger element in the front has higher top k probability • Top k probability is scale or translation invariant with a carefully designed function

  19. Listwise loss function • Cross-entropy between the top k distributions of two lists of scores: where denotes the query denotes the ground truth list of scores denotes the model-generated list of scores

  20. Learning method: ListNet (1/2) • A learning to rank method for optimizing the listwise loss function based on top k probability with neural network as the model and gradient descent as optimization algorithm • denotes the ranking function based on the neural network model • For a given feature vector , the ranking function gives a score • Score list

  21. Learning method: ListNet (2/2) • Learning algorithm of ListNet: Input: training data Parameter: number of iteration and learning rate Initialize parameter for t = 1 to do for = 1 to do Input of query to neural network and compute score list with current Compute gradient Update end for end for Output neural network model

  22. Experiments • ListNet compared with 3 pairwise methods: • RankNet • RankSVM • RankBoost • 3 datasets • TREC • OHSUMED • CSearch

  23. TREC dataset • .gov domain web pages in 2002 • 1,053,110 pages, 11,164,829 hyperlinks • 50 queries • Binary relevance judgment (relevant or irrelevant) • 20 features extracted from each query-document pair (e.g. content features and hyperlink features)

  24. OHSUMED dataset • A collection of documents and queries on medicine • 348,566 documents, 106 queries • 16,140 query-document pairs • Relevance judgment: definitely relevant, possibly relevant, not relevant • 30 features extracted for each query-document pair

  25. CSearch dataset • A dataset from a commercial search engine • About 25,000 queries with 1000 documents associated with each query • About a total of 600 features, including query-dependent and –independent features • 5 levels of relevance judgment: 4 (perfect match) to 0 (bad match)

  26. Ranking performance measure (1/2) • Normalized Discounted Cumulative Gain (NDCG) where or • Can be used with more than 2 levels of relevance score

  27. Ranking performance measure (2/2) • Mean Average Precision (MAP) where • MAP = average of AP over all queries • Can only use binary relevance judgment

  28. Experimental results (1/4) • Ranking accuracies in terms of NDCG on TREC NDCG top k

  29. Experimental results (2/4) • Ranking accuracies in terms of NDCG on OHSUMED NDCG top k

  30. Experimental results (3/4) • Ranking accuracies in terms of NDCG on CSearch NDCG top k

  31. Experimental results (4/4) • Ranking accuracies in terms of MAP

  32. Discussions (1/2) • For pairwise approach, the number of document pairs varies largely from query to query • Distribution of the number of document pairs per query in OHSUMED

  33. Discussions (2/2) • Pairwise approach employs a “pairwise” loss function, not suited for NCDG and MAP for performance measuring • Listwise approach better represents the performance measures • Verification? • Observe the relationship between loss and NDCG in each iteration

  34. Pairwise loss vs. NDCG in RankNet NDCG Loss NDCG Pairwise loss iteration

  35. Listwise loss vs. NDCG in ListNet NDCG NDCG Loss Pairwise loss iteration

  36. Conclusions and future work • Conclusions • Listwise approach for learning to rank • Permutation probability and top k probability • Cross-entropy as loss function • Using neural network as model and gradient descent as the optimization algorithm • Future work • Use other metrics for loss function • Use other models • Investigate the relationship between listwise loss functions and performance measures

More Related