Discovering Bucket Orders from Full Rankings

Discovering Bucket Orders from Full Rankings Jianlin Feng* Department of Computer Science and Technology Huazhong University of Science and Technology Qiong Fang, Wilfred Ng Department of Computer Science and Engineering Hong Kong University of Science and Technology * Work done at UIUC SIGMOD 6/10/08

Definitions of Rankings and Orders • Full ranking • A permutation of n items (or objects). • a full ranking T of 6 items: a  c  b  d  f  e • Formalized by a Total Order • A binary relation of items, satisfying the three criteria of anti-symmetry, transitivity, and linearity. • Partial ranking • A full ranking of k nonempty buckets • Items in the same bucket are tied. • Formalized by a Bucket Order • A total order of buckets (i.e, “ties”) • a bucket order B: {a, b, c, d}  {e, f} HUST & HKUST

Introduction • Input: m full rankings (total order) over n items • Output: a single full ranking over n items • Rank aggregation • Voting • Meta-search • Multi-criteria query • Or Output: A single bucket order over n items ? • Bucket Order Discovering (BOD) HUST & HKUST

Motivation (1):Representing Collective Browsing Habits • Each user’s habit is reflected in his or her browsing sequence: user 1:  sports  weather user 2:  politics  weather … user n:  politics  weather • Similar users should have similar, but not strictly the same browsing sequences. • A “representative” bucket order of collective browsing habits:  {politics, sports}  {weather} frontpage news frontpage {frontpage, news} HUST & HKUST

species fossil site species fossil site Motivation (2) : Approximating Bucket Order of Fossil Sites • Seriation in Paleontology • Given a 0-1 matrix, find an order of the rows such that the 1s are asconsecutive as possible. • Markov Chain Monte Carlo (MCMC)  total orders • Puolamäki et al, PLoS Comput Biol’s 2006 • The underlying order is indeed a bucket order. • Paleontological dataset:g10s10 • 124 fossil sites • the “ground truth” bucket order • 15 buckets. • Given the total orders generated by MCMC, • linear extensions of the underlying bucket order, • We want to find a good approximation of the underlying bucket order. Seriation HUST & HKUST

Problem statement:Bucket Order Discovering (BOD) • Given m full rankings R={T1, T2, ..., Tm} over n items, • We want to find a bucket order B such that • “representative” perspective: B is a good “representative” that summarizes R well; • “approximation” perspective : B is a good “approximation” of some “ground truth” bucket order G • where R is simply a set of “linear extensions” of G. HUST & HKUST

Outline • Motivation • Problem formulation • Previous algorithms • The Bucket Pivot Algorithm • The Dynamic Programming Algorithm • Our approach • The Bucket Gap Algorithm • Experimental study • Conclusion HUST & HKUST

What Means a Good Bucket Order? • Precedence probability Ptu: • The fraction of the input full rankings in which item t precedes u. • A good bucket order B should well preserve the pair-wise precedence relationship: • small |Ptu - 1.0| ==> t should precede u in B. • small |Ptu - 0.5| ==> t and u should be “tied” in B. • small |Ptu - 0.0| ==> u should precede t in B. • The distance betweenB andtheinput full rankings • The sum of values |Ptu - 1.0|, or |Ptu - 0.5|, or |Ptu - 0.0| . HUST & HKUST

Distance in Matrix Notation (Gionis et al, KDD’2006) • the input pair-order matrixC : Ctu isPtu. • the pair-order matrix CBfor bucket order B: • CBtu equals 1.0, if t precedes u in B • CBtu equals 0.0, if u precedes t in B • CBtu equals 0.5, if t and u are “tied” in B • The distance betweenB andtheinput full rankings • This is the I-Distance for goodness of “ ” representative HUST & HKUST

G-Distance for goodness of “approximation”(Gionis et al, KDD’2006) • CG: the pair order matrix of the “ground truth” G. HUST & HKUST

Formal Definition of BOD • The BOD problem is now formulated as • Given a collection of input full rankings, • find a bucket order that minimizesI-Distance (or G-Distance). • This optimization problem is NP-hard. (Gionis et al, KDD’2006.) • We have to use heuristic algorithms. HUST & HKUST

The Bucket Pivot Algorithm (PIVOT) (Gionis et al, KDD’2006) • Input: the input pair-order matrix C • Output: a bucket order B • Idea: • If Ctu is close to 0.5 enough: • 0.5 - f≤ Ctu < 0.5 + f, f : bounding parameter • Then t and u should be put into the same bucket in B. • Else “left” (u t) or “right” (t u) • To avoid checking each Ctu, perform like the quick-sort algorithm • Adapted from the FAS-PIVOTalgorithm (Ailon et al, STOC’2005) HUST & HKUST

If a is the pivot in 1st recursion: • {a, b, c, d, e, f} f is 0.35 Limitations of PIVOT :Results heavily depend on pivots chosen and f f is 0.25 The input pair-order matrix C • Ifa is the pivotin 1st recursion: • {a, c, d, f}  {b}  {e} • Ifb is the pivotin 1st recursion: • {a, c}  {b, d, f}  {e} HUST & HKUST

The Dynamic Programming Algorithm (DP):(Fagin et al, PODS’2004) • Idea: • If two items’median ranks are close enough, they should be put into the same bucket. • Median_rank(i) = median(T1(i), T2(i), …, Tn(i)) • Step 1: pre-processsing • to avoid checking “closeness” on median rank between each pair of items. • (MEDRANK, Fagin et al, SIGMOD’2003): sorts n items into a total order T in non-decreasing order of items’ median ranks • T: <a: 1, c: 2, b: 4, d: 4, f : 5, e: 6> • Step 2: using “closeness” on median rank to form buckets • Using dynamic programming to segment T into a bucket order B. HUST & HKUST

Two Limitations of DP:from “Approximation” Perspective • Limitation 1: • Two items from different buckets in the “ground truth” bucket order G can also have close median ranks. • Limitation 2: • DP’s minimizing bucket costs tends to break a big bucket b of G into several small buckets. • Bucket cost: • Observed on g10s10: • DP generates 34 buckets, while G has only 15 buckets. Median rank of the l-th item along a total order T. average position HUST & HKUST

The Bucket Gap Algorithm (GAP):Basic Ideas • Motivated by the two limitations of DP • Idea 1: If two items are close on multiple quantile ranks, it is more reliable to put them into the same bucket. • Quantile_rank(i) = quantile(T1(i), T2(i), …, Tn(i)) • Median rank is the quantile rank w.r.t the quantile 50%. • Idea 2: Items from different buckets should have “abnormally large gaps” between their quantile ranks. • DP’s idea: items in the same bucket should have small gaps between their median ranks. HUST & HKUST

The Bucket Gap Algorithm:A Two Phase Framework • Phase 1: check “closeness” of items on each quantile rank separately. • For each quantile, sort all the items in non-decreasing order of their corresponding quantile ranks. • Such a total order is called a quantile order. • Use our novel Abnormal Rank Gap heuristic to segment quantile orders into initial bucket orders. • Phase 2: aggregate the “closeness” of items on each quantile rank to generate the final bucket order. • Perform a median rank aggregation on the initial bucket orders. HUST & HKUST

MEDRANK+: generating quantile orders First sort quantiles in increasing order a a In each round, output items with their quantile ranks to corresponding quantile orders. a d c 30% a: 1 Then, perform a round-robin scan of all the input full rankings. 50% a: 1 70% 90% HUST & HKUST

the Abnormal Rank Gap Heuristic • A quantile order Q1: • 5 rank gaps: • Average gapga and Standard deviationsg • ga = 4/5, sg = sqrt(14) / 5. • A rank gap gi is abnormal if gi > average gap + one unit of standard deviation • The Heuristic • An abnormal rank gap separates two consecutive buckets. • Na abnormal rank gaps  (Na +1) buckets • Only g5 is abnormal in Q1  Initial Bucket Order B1: < { a, b, c, d, f }, { e } > HUST & HKUST

Median Rank Aggregation on Initial Bucket Orders • Put items with the same median rank into the same bucket in the final bucket order. HUST & HKUST

Experimental study • Algorithms: • PIVOT, DP, GAP • Only PIVOT has error bars showing one unit of standard deviation. • Datasets • Synthetic Datasets. • Noise level: 20% • Real Clickstream Dataset • MSNBC • Real Paleontology Dataset g10s10. • 2,000 sequences, 124 items. • Details of the result are in the paper. HUST & HKUST

Scalability using G-Distance -Synthetic Dataset The bottleneck of PIVOT (or using I-Distance): computing the input pair-order matrix costs O(mn2). m: number of input full rankings n: number of items HUST & HKUST

I-Distance and G-Distance- Paleontological data (2,000 sequences, 124 items.) GAP using Median Rank only • The adoption of multiple quantile ranks makes sense. • Since GAP is fast, we can run it several times to search the best result. HUST & HKUST

Conclusions • Introduce a two-phase rank aggregation framework • to exploit “closeness” on multiple quantile ranks • Can achieve more reliable bucket forming • Introduce the Abnormal Rank Gap Heuristic • Can better check “closeness” on single quantile rank • Avoid breaking big buckets into small ones. • Future work • The general setting: • input full rankings have various lengths. • Some theoretical basis: • gain better insight of GAP’s effectiveness. HUST & HKUST

A Note on Correction of Reference 7 • Two authors were left out in Reference 7. (Sivakumar D., and Vee E.) • The correct version should be • Fagin, R., Kumar, R., Mahdian, M, Sivakumar D., and Vee E.. Comparing and Aggregating Ranking with Ties. ACM PODS, 2004, pp. 47–58. HUST & HKUST

Thank you Any Question? HUST & HKUST

References • [Ailon,STOC’2005] • Ailon, N., Charikar, M., and Newman, A. Aggregating Inconsistent Information: Ranking and Clustering. ACM STOC., 2005, pp. 684-693. • [Fagin, PODS’2004] • Fagin, R., Kumar, R., Mahdian, M, Sivakumar D., and Vee E.. Comparing and Aggregating Ranking with Ties. ACM PODS, 2004, pp. 47–58. • [Fagin, SIGMOD’2003] • Fagin, R., Kumar, R., and Sivakumar, D. Efficient Similarity Search and Classification via Rank Aggregation. ACM SIGMOD, 2003, pp. 301–312. • [Gionis, KDD’2006] • Gionis, A., Mannila, H., Puolamaki, K., and Ukkonen, A. Algorithms for Discovering Bucket Orders from Data. ACM KDD, 2006, pp. 561-566. • [Puolamäki , PLoS Comput Biol’s 2006 ] • Puolamäki, K., Fortelius, M., and Mannila, H. Seriation in Paleontological Data Using Markov Chain Monte Carlo Methods. PLoS Comput Biol 2(2): e6, 2006. HUST & HKUST

Discovering Bucket Orders from Full Rankings

Discovering Bucket Orders from Full Rankings

Presentation Transcript

Discovering Leaders from Community Actions

How Full is Your Bucket?

AWARDS AND FULL RANKINGS

bucket fillers AND BUCKET DIPPERS

Bucket List

How Full is Your Bucket?

ORDERS

How Full is Your Bucket?

Discovering Leaders from Community Actions

Token Bucket Leaky Bucket

Orders

Token Bucket Leaky Bucket

Thanksgiving Examen Discovering Our Sealed Orders Phil. 4:4-7

U.S.MANUFACTURING TECHNOLOGY ORDERS REPORT FROM AMT

Orders from Headquarters

Reflections from Discovering Church Planting

RANKINGS some comments from Charles University

Elevator Bucket Bolts India - Bucket Bolts

READ [PDF] How Full Is Your Bucket? For Kids

Help full strategies increase traffic and rankings