1 / 30

Discovering Bucket Orders from Full Rankings

Discovering Bucket Orders from Full Rankings. Jianlin Feng* Department of Computer Science and Technology Huazhong University of Science and Technology Qiong Fang, Wilfred Ng Department of Computer Science and Engineering Hong Kong University of Science and Technology. * Work done at UIUC.

Download Presentation

Discovering Bucket Orders from Full Rankings

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Discovering Bucket Orders from Full Rankings Jianlin Feng* Department of Computer Science and Technology Huazhong University of Science and Technology Qiong Fang, Wilfred Ng Department of Computer Science and Engineering Hong Kong University of Science and Technology * Work done at UIUC SIGMOD 6/10/08

  2. Definitions of Rankings and Orders • Full ranking • A permutation of n items (or objects). • a full ranking T of 6 items: a  c  b  d  f  e • Formalized by a Total Order • A binary relation of items, satisfying the three criteria of anti-symmetry, transitivity, and linearity. • Partial ranking • A full ranking of k nonempty buckets • Items in the same bucket are tied. • Formalized by a Bucket Order • A total order of buckets (i.e, “ties”) • a bucket order B: {a, b, c, d}  {e, f} HUST & HKUST

  3. Introduction • Input: m full rankings (total order) over n items • Output: a single full ranking over n items • Rank aggregation • Voting • Meta-search • Multi-criteria query • Or Output: A single bucket order over n items ? • Bucket Order Discovering (BOD) HUST & HKUST

  4. Motivation (1):Representing Collective Browsing Habits • Each user’s habit is reflected in his or her browsing sequence: user 1:  sports  weather user 2:  politics  weather … user n:  politics  weather • Similar users should have similar, but not strictly the same browsing sequences. • A “representative” bucket order of collective browsing habits:  {politics, sports}  {weather} frontpage news frontpage {frontpage, news} HUST & HKUST

  5. species fossil site species fossil site Motivation (2) : Approximating Bucket Order of Fossil Sites • Seriation in Paleontology • Given a 0-1 matrix, find an order of the rows such that the 1s are asconsecutive as possible. • Markov Chain Monte Carlo (MCMC)  total orders • Puolamäki et al, PLoS Comput Biol’s 2006 • The underlying order is indeed a bucket order. • Paleontological dataset:g10s10 • 124 fossil sites • the “ground truth” bucket order • 15 buckets. • Given the total orders generated by MCMC, • linear extensions of the underlying bucket order, • We want to find a good approximation of the underlying bucket order. Seriation HUST & HKUST

  6. Problem statement:Bucket Order Discovering (BOD) • Given m full rankings R={T1, T2, ..., Tm} over n items, • We want to find a bucket order B such that • “representative” perspective: B is a good “representative” that summarizes R well; • “approximation” perspective : B is a good “approximation” of some “ground truth” bucket order G • where R is simply a set of “linear extensions” of G. HUST & HKUST

  7. Outline • Motivation • Problem formulation • Previous algorithms • The Bucket Pivot Algorithm • The Dynamic Programming Algorithm • Our approach • The Bucket Gap Algorithm • Experimental study • Conclusion HUST & HKUST

  8. What Means a Good Bucket Order? • Precedence probability Ptu: • The fraction of the input full rankings in which item t precedes u. • A good bucket order B should well preserve the pair-wise precedence relationship: • small |Ptu - 1.0| ==> t should precede u in B. • small |Ptu - 0.5| ==> t and u should be “tied” in B. • small |Ptu - 0.0| ==> u should precede t in B. • The distance betweenB andtheinput full rankings • The sum of values |Ptu - 1.0|, or |Ptu - 0.5|, or |Ptu - 0.0| . HUST & HKUST

  9. Distance in Matrix Notation (Gionis et al, KDD’2006) • the input pair-order matrixC : Ctu isPtu. • the pair-order matrix CBfor bucket order B: • CBtu equals 1.0, if t precedes u in B • CBtu equals 0.0, if u precedes t in B • CBtu equals 0.5, if t and u are “tied” in B • The distance betweenB andtheinput full rankings • This is the I-Distance for goodness of “ ” representative HUST & HKUST

  10. G-Distance for goodness of “approximation”(Gionis et al, KDD’2006) • CG: the pair order matrix of the “ground truth” G. HUST & HKUST

  11. Formal Definition of BOD • The BOD problem is now formulated as • Given a collection of input full rankings, • find a bucket order that minimizesI-Distance (or G-Distance). • This optimization problem is NP-hard. (Gionis et al, KDD’2006.) • We have to use heuristic algorithms. HUST & HKUST

  12. Outline • Motivation • Problem formulation • Previous algorithms • The Bucket Pivot Algorithm • The Dynamic Programming Algorithm • Our approach • The Bucket Gap Algorithm • Experimental study • Conclusion HUST & HKUST

  13. The Bucket Pivot Algorithm (PIVOT) (Gionis et al, KDD’2006) • Input: the input pair-order matrix C • Output: a bucket order B • Idea: • If Ctu is close to 0.5 enough: • 0.5 - f≤ Ctu < 0.5 + f, f : bounding parameter • Then t and u should be put into the same bucket in B. • Else “left” (u t) or “right” (t u) • To avoid checking each Ctu, perform like the quick-sort algorithm • Adapted from the FAS-PIVOTalgorithm (Ailon et al, STOC’2005) HUST & HKUST

  14. If a is the pivot in 1st recursion: • {a, b, c, d, e, f} f is 0.35 Limitations of PIVOT :Results heavily depend on pivots chosen and f f is 0.25 The input pair-order matrix C • Ifa is the pivotin 1st recursion: • {a, c, d, f}  {b}  {e} • Ifb is the pivotin 1st recursion: • {a, c}  {b, d, f}  {e} HUST & HKUST

  15. The Dynamic Programming Algorithm (DP):(Fagin et al, PODS’2004) • Idea: • If two items’median ranks are close enough, they should be put into the same bucket. • Median_rank(i) = median(T1(i), T2(i), …, Tn(i)) • Step 1: pre-processsing • to avoid checking “closeness” on median rank between each pair of items. • (MEDRANK, Fagin et al, SIGMOD’2003): sorts n items into a total order T in non-decreasing order of items’ median ranks • T: <a: 1, c: 2, b: 4, d: 4, f : 5, e: 6> • Step 2: using “closeness” on median rank to form buckets • Using dynamic programming to segment T into a bucket order B. HUST & HKUST

  16. Two Limitations of DP:from “Approximation” Perspective • Limitation 1: • Two items from different buckets in the “ground truth” bucket order G can also have close median ranks. • Limitation 2: • DP’s minimizing bucket costs tends to break a big bucket b of G into several small buckets. • Bucket cost: • Observed on g10s10: • DP generates 34 buckets, while G has only 15 buckets. Median rank of the l-th item along a total order T. average position HUST & HKUST

  17. Outline • Motivation • Problem formulation • Previous algorithms • The Bucket Pivot Algorithm • The Dynamic Programming Algorithm • Our approach • The Bucket Gap Algorithm • Experimental study • Conclusion HUST & HKUST

  18. The Bucket Gap Algorithm (GAP):Basic Ideas • Motivated by the two limitations of DP • Idea 1: If two items are close on multiple quantile ranks, it is more reliable to put them into the same bucket. • Quantile_rank(i) = quantile(T1(i), T2(i), …, Tn(i)) • Median rank is the quantile rank w.r.t the quantile 50%. • Idea 2: Items from different buckets should have “abnormally large gaps” between their quantile ranks. • DP’s idea: items in the same bucket should have small gaps between their median ranks. HUST & HKUST

  19. The Bucket Gap Algorithm:A Two Phase Framework • Phase 1: check “closeness” of items on each quantile rank separately. • For each quantile, sort all the items in non-decreasing order of their corresponding quantile ranks. • Such a total order is called a quantile order. • Use our novel Abnormal Rank Gap heuristic to segment quantile orders into initial bucket orders. • Phase 2: aggregate the “closeness” of items on each quantile rank to generate the final bucket order. • Perform a median rank aggregation on the initial bucket orders. HUST & HKUST

  20. MEDRANK+: generating quantile orders First sort quantiles in increasing order a a In each round, output items with their quantile ranks to corresponding quantile orders. a d c 30% a: 1 Then, perform a round-robin scan of all the input full rankings. 50% a: 1 70% 90% HUST & HKUST

  21. the Abnormal Rank Gap Heuristic • A quantile order Q1: • 5 rank gaps: • Average gapga and Standard deviationsg • ga = 4/5, sg = sqrt(14) / 5. • A rank gap gi is abnormal if gi > average gap + one unit of standard deviation • The Heuristic • An abnormal rank gap separates two consecutive buckets. • Na abnormal rank gaps  (Na +1) buckets • Only g5 is abnormal in Q1  Initial Bucket Order B1: < { a, b, c, d, f }, { e } > HUST & HKUST

  22. Median Rank Aggregation on Initial Bucket Orders • Put items with the same median rank into the same bucket in the final bucket order. HUST & HKUST

  23. Outline • Motivation • Problem formulation • Previous algorithms • The Bucket Pivot Algorithm • The Dynamic Programming Algorithm • Our approach • The Bucket Gap Algorithm • Experimental study • Conclusion HUST & HKUST

  24. Experimental study • Algorithms: • PIVOT, DP, GAP • Only PIVOT has error bars showing one unit of standard deviation. • Datasets • Synthetic Datasets. • Noise level: 20% • Real Clickstream Dataset • MSNBC • Real Paleontology Dataset g10s10. • 2,000 sequences, 124 items. • Details of the result are in the paper. HUST & HKUST

  25. Scalability using G-Distance -Synthetic Dataset The bottleneck of PIVOT (or using I-Distance): computing the input pair-order matrix costs O(mn2). m: number of input full rankings n: number of items HUST & HKUST

  26. I-Distance and G-Distance- Paleontological data (2,000 sequences, 124 items.) GAP using Median Rank only • The adoption of multiple quantile ranks makes sense. • Since GAP is fast, we can run it several times to search the best result. HUST & HKUST

  27. Conclusions • Introduce a two-phase rank aggregation framework • to exploit “closeness” on multiple quantile ranks • Can achieve more reliable bucket forming • Introduce the Abnormal Rank Gap Heuristic • Can better check “closeness” on single quantile rank • Avoid breaking big buckets into small ones. • Future work • The general setting: • input full rankings have various lengths. • Some theoretical basis: • gain better insight of GAP’s effectiveness. HUST & HKUST

  28. A Note on Correction of Reference 7 • Two authors were left out in Reference 7. (Sivakumar D., and Vee E.) • The correct version should be • Fagin, R., Kumar, R., Mahdian, M, Sivakumar D., and Vee E.. Comparing and Aggregating Ranking with Ties. ACM PODS, 2004, pp. 47–58. HUST & HKUST

  29. Thank you Any Question? HUST & HKUST

  30. References • [Ailon,STOC’2005] • Ailon, N., Charikar, M., and Newman, A. Aggregating Inconsistent Information: Ranking and Clustering. ACM STOC., 2005, pp. 684-693. • [Fagin, PODS’2004] • Fagin, R., Kumar, R., Mahdian, M, Sivakumar D., and Vee E.. Comparing and Aggregating Ranking with Ties. ACM PODS, 2004, pp. 47–58. • [Fagin, SIGMOD’2003] • Fagin, R., Kumar, R., and Sivakumar, D. Efficient Similarity Search and Classification via Rank Aggregation. ACM SIGMOD, 2003, pp. 301–312. • [Gionis, KDD’2006] • Gionis, A., Mannila, H., Puolamaki, K., and Ukkonen, A. Algorithms for Discovering Bucket Orders from Data. ACM KDD, 2006, pp. 561-566. • [Puolamäki , PLoS Comput Biol’s 2006 ] • Puolamäki, K., Fortelius, M., and Mannila, H. Seriation in Paleontological Data Using Markov Chain Monte Carlo Methods. PLoS Comput Biol 2(2): e6, 2006. HUST & HKUST

More Related