300 likes | 465 Views
Discovering Bucket Orders from Full Rankings. Jianlin Feng* Department of Computer Science and Technology Huazhong University of Science and Technology Qiong Fang, Wilfred Ng Department of Computer Science and Engineering Hong Kong University of Science and Technology. * Work done at UIUC.
E N D
Discovering Bucket Orders from Full Rankings Jianlin Feng* Department of Computer Science and Technology Huazhong University of Science and Technology Qiong Fang, Wilfred Ng Department of Computer Science and Engineering Hong Kong University of Science and Technology * Work done at UIUC SIGMOD 6/10/08
Definitions of Rankings and Orders • Full ranking • A permutation of n items (or objects). • a full ranking T of 6 items: a c b d f e • Formalized by a Total Order • A binary relation of items, satisfying the three criteria of anti-symmetry, transitivity, and linearity. • Partial ranking • A full ranking of k nonempty buckets • Items in the same bucket are tied. • Formalized by a Bucket Order • A total order of buckets (i.e, “ties”) • a bucket order B: {a, b, c, d} {e, f} HUST & HKUST
Introduction • Input: m full rankings (total order) over n items • Output: a single full ranking over n items • Rank aggregation • Voting • Meta-search • Multi-criteria query • Or Output: A single bucket order over n items ? • Bucket Order Discovering (BOD) HUST & HKUST
Motivation (1):Representing Collective Browsing Habits • Each user’s habit is reflected in his or her browsing sequence: user 1: sports weather user 2: politics weather … user n: politics weather • Similar users should have similar, but not strictly the same browsing sequences. • A “representative” bucket order of collective browsing habits: {politics, sports} {weather} frontpage news frontpage {frontpage, news} HUST & HKUST
species fossil site species fossil site Motivation (2) : Approximating Bucket Order of Fossil Sites • Seriation in Paleontology • Given a 0-1 matrix, find an order of the rows such that the 1s are asconsecutive as possible. • Markov Chain Monte Carlo (MCMC) total orders • Puolamäki et al, PLoS Comput Biol’s 2006 • The underlying order is indeed a bucket order. • Paleontological dataset:g10s10 • 124 fossil sites • the “ground truth” bucket order • 15 buckets. • Given the total orders generated by MCMC, • linear extensions of the underlying bucket order, • We want to find a good approximation of the underlying bucket order. Seriation HUST & HKUST
Problem statement:Bucket Order Discovering (BOD) • Given m full rankings R={T1, T2, ..., Tm} over n items, • We want to find a bucket order B such that • “representative” perspective: B is a good “representative” that summarizes R well; • “approximation” perspective : B is a good “approximation” of some “ground truth” bucket order G • where R is simply a set of “linear extensions” of G. HUST & HKUST
Outline • Motivation • Problem formulation • Previous algorithms • The Bucket Pivot Algorithm • The Dynamic Programming Algorithm • Our approach • The Bucket Gap Algorithm • Experimental study • Conclusion HUST & HKUST
What Means a Good Bucket Order? • Precedence probability Ptu: • The fraction of the input full rankings in which item t precedes u. • A good bucket order B should well preserve the pair-wise precedence relationship: • small |Ptu - 1.0| ==> t should precede u in B. • small |Ptu - 0.5| ==> t and u should be “tied” in B. • small |Ptu - 0.0| ==> u should precede t in B. • The distance betweenB andtheinput full rankings • The sum of values |Ptu - 1.0|, or |Ptu - 0.5|, or |Ptu - 0.0| . HUST & HKUST
Distance in Matrix Notation (Gionis et al, KDD’2006) • the input pair-order matrixC : Ctu isPtu. • the pair-order matrix CBfor bucket order B: • CBtu equals 1.0, if t precedes u in B • CBtu equals 0.0, if u precedes t in B • CBtu equals 0.5, if t and u are “tied” in B • The distance betweenB andtheinput full rankings • This is the I-Distance for goodness of “ ” representative HUST & HKUST
G-Distance for goodness of “approximation”(Gionis et al, KDD’2006) • CG: the pair order matrix of the “ground truth” G. HUST & HKUST
Formal Definition of BOD • The BOD problem is now formulated as • Given a collection of input full rankings, • find a bucket order that minimizesI-Distance (or G-Distance). • This optimization problem is NP-hard. (Gionis et al, KDD’2006.) • We have to use heuristic algorithms. HUST & HKUST
Outline • Motivation • Problem formulation • Previous algorithms • The Bucket Pivot Algorithm • The Dynamic Programming Algorithm • Our approach • The Bucket Gap Algorithm • Experimental study • Conclusion HUST & HKUST
The Bucket Pivot Algorithm (PIVOT) (Gionis et al, KDD’2006) • Input: the input pair-order matrix C • Output: a bucket order B • Idea: • If Ctu is close to 0.5 enough: • 0.5 - f≤ Ctu < 0.5 + f, f : bounding parameter • Then t and u should be put into the same bucket in B. • Else “left” (u t) or “right” (t u) • To avoid checking each Ctu, perform like the quick-sort algorithm • Adapted from the FAS-PIVOTalgorithm (Ailon et al, STOC’2005) HUST & HKUST
If a is the pivot in 1st recursion: • {a, b, c, d, e, f} f is 0.35 Limitations of PIVOT :Results heavily depend on pivots chosen and f f is 0.25 The input pair-order matrix C • Ifa is the pivotin 1st recursion: • {a, c, d, f} {b} {e} • Ifb is the pivotin 1st recursion: • {a, c} {b, d, f} {e} HUST & HKUST
The Dynamic Programming Algorithm (DP):(Fagin et al, PODS’2004) • Idea: • If two items’median ranks are close enough, they should be put into the same bucket. • Median_rank(i) = median(T1(i), T2(i), …, Tn(i)) • Step 1: pre-processsing • to avoid checking “closeness” on median rank between each pair of items. • (MEDRANK, Fagin et al, SIGMOD’2003): sorts n items into a total order T in non-decreasing order of items’ median ranks • T: <a: 1, c: 2, b: 4, d: 4, f : 5, e: 6> • Step 2: using “closeness” on median rank to form buckets • Using dynamic programming to segment T into a bucket order B. HUST & HKUST
Two Limitations of DP:from “Approximation” Perspective • Limitation 1: • Two items from different buckets in the “ground truth” bucket order G can also have close median ranks. • Limitation 2: • DP’s minimizing bucket costs tends to break a big bucket b of G into several small buckets. • Bucket cost: • Observed on g10s10: • DP generates 34 buckets, while G has only 15 buckets. Median rank of the l-th item along a total order T. average position HUST & HKUST
Outline • Motivation • Problem formulation • Previous algorithms • The Bucket Pivot Algorithm • The Dynamic Programming Algorithm • Our approach • The Bucket Gap Algorithm • Experimental study • Conclusion HUST & HKUST
The Bucket Gap Algorithm (GAP):Basic Ideas • Motivated by the two limitations of DP • Idea 1: If two items are close on multiple quantile ranks, it is more reliable to put them into the same bucket. • Quantile_rank(i) = quantile(T1(i), T2(i), …, Tn(i)) • Median rank is the quantile rank w.r.t the quantile 50%. • Idea 2: Items from different buckets should have “abnormally large gaps” between their quantile ranks. • DP’s idea: items in the same bucket should have small gaps between their median ranks. HUST & HKUST
The Bucket Gap Algorithm:A Two Phase Framework • Phase 1: check “closeness” of items on each quantile rank separately. • For each quantile, sort all the items in non-decreasing order of their corresponding quantile ranks. • Such a total order is called a quantile order. • Use our novel Abnormal Rank Gap heuristic to segment quantile orders into initial bucket orders. • Phase 2: aggregate the “closeness” of items on each quantile rank to generate the final bucket order. • Perform a median rank aggregation on the initial bucket orders. HUST & HKUST
MEDRANK+: generating quantile orders First sort quantiles in increasing order a a In each round, output items with their quantile ranks to corresponding quantile orders. a d c 30% a: 1 Then, perform a round-robin scan of all the input full rankings. 50% a: 1 70% 90% HUST & HKUST
the Abnormal Rank Gap Heuristic • A quantile order Q1: • 5 rank gaps: • Average gapga and Standard deviationsg • ga = 4/5, sg = sqrt(14) / 5. • A rank gap gi is abnormal if gi > average gap + one unit of standard deviation • The Heuristic • An abnormal rank gap separates two consecutive buckets. • Na abnormal rank gaps (Na +1) buckets • Only g5 is abnormal in Q1 Initial Bucket Order B1: < { a, b, c, d, f }, { e } > HUST & HKUST
Median Rank Aggregation on Initial Bucket Orders • Put items with the same median rank into the same bucket in the final bucket order. HUST & HKUST
Outline • Motivation • Problem formulation • Previous algorithms • The Bucket Pivot Algorithm • The Dynamic Programming Algorithm • Our approach • The Bucket Gap Algorithm • Experimental study • Conclusion HUST & HKUST
Experimental study • Algorithms: • PIVOT, DP, GAP • Only PIVOT has error bars showing one unit of standard deviation. • Datasets • Synthetic Datasets. • Noise level: 20% • Real Clickstream Dataset • MSNBC • Real Paleontology Dataset g10s10. • 2,000 sequences, 124 items. • Details of the result are in the paper. HUST & HKUST
Scalability using G-Distance -Synthetic Dataset The bottleneck of PIVOT (or using I-Distance): computing the input pair-order matrix costs O(mn2). m: number of input full rankings n: number of items HUST & HKUST
I-Distance and G-Distance- Paleontological data (2,000 sequences, 124 items.) GAP using Median Rank only • The adoption of multiple quantile ranks makes sense. • Since GAP is fast, we can run it several times to search the best result. HUST & HKUST
Conclusions • Introduce a two-phase rank aggregation framework • to exploit “closeness” on multiple quantile ranks • Can achieve more reliable bucket forming • Introduce the Abnormal Rank Gap Heuristic • Can better check “closeness” on single quantile rank • Avoid breaking big buckets into small ones. • Future work • The general setting: • input full rankings have various lengths. • Some theoretical basis: • gain better insight of GAP’s effectiveness. HUST & HKUST
A Note on Correction of Reference 7 • Two authors were left out in Reference 7. (Sivakumar D., and Vee E.) • The correct version should be • Fagin, R., Kumar, R., Mahdian, M, Sivakumar D., and Vee E.. Comparing and Aggregating Ranking with Ties. ACM PODS, 2004, pp. 47–58. HUST & HKUST
Thank you Any Question? HUST & HKUST
References • [Ailon,STOC’2005] • Ailon, N., Charikar, M., and Newman, A. Aggregating Inconsistent Information: Ranking and Clustering. ACM STOC., 2005, pp. 684-693. • [Fagin, PODS’2004] • Fagin, R., Kumar, R., Mahdian, M, Sivakumar D., and Vee E.. Comparing and Aggregating Ranking with Ties. ACM PODS, 2004, pp. 47–58. • [Fagin, SIGMOD’2003] • Fagin, R., Kumar, R., and Sivakumar, D. Efficient Similarity Search and Classification via Rank Aggregation. ACM SIGMOD, 2003, pp. 301–312. • [Gionis, KDD’2006] • Gionis, A., Mannila, H., Puolamaki, K., and Ukkonen, A. Algorithms for Discovering Bucket Orders from Data. ACM KDD, 2006, pp. 561-566. • [Puolamäki , PLoS Comput Biol’s 2006 ] • Puolamäki, K., Fortelius, M., and Mannila, H. Seriation in Paleontological Data Using Markov Chain Monte Carlo Methods. PLoS Comput Biol 2(2): e6, 2006. HUST & HKUST