410 likes | 568 Views
Ranking with Uncertain Scores. Mohamed Soliman and Ihab Ilyas University of Waterloo. [ICDE’07, SIGMOD’07 demo] New semantics for top- k queries on uncertain data Studying the interaction between scores and probabilities
E N D
Ranking with Uncertain Scores Mohamed Soliman and Ihab Ilyas University of Waterloo
[ICDE’07, SIGMOD’07 demo] New semantics for top-k queries on uncertain data • Studying the interaction between scores and probabilities • Integrating scoring and uncertainty dimensions into the same query definitions • [VLDB’08] Finding the top-k probable nearest neighbors in uncertain databases • Lazy evaluation of probabilistic NN integrals • Optimizing number of object retrievals • [TODS’08] Efficient support of probabilistic ranking and aggregation queries • Pipelined query processing • Unified framework for handling uncertain ranking and aggregation
Ranking (Top-k) Queries • Given a query-specified scoring function, return the k records with maximum scores • Each record has a single score • Score ties are resolved deterministically • Total order model SELECT A.address FROMApartment A ORDER BY 0.6*A.rent+0.4*A.deposit LIMIT 10 Record’s score Result size (k) M.A. Soliman and I.F. Ilyas ICDE'09
Ranking (Top-k) Queries • Real world data are influenced by several sources of uncertainty • Data entry error • Privacy concerns • Presentation style • Incomplete/missing attribute values exist • Uncertain ratings exist • The total order model does not apply any more M.A. Soliman and I.F. Ilyas ICDE'09
Incomplete/Missing Data Attribute values expressed as ranges Missing attribute value SELECT A.address FROMApartment A ORDER BY 0.6*A.rent+0.4*A.deposit LIMIT 10 ? M.A. Soliman and I.F. Ilyas ICDE'09
Modeling Uncertain Scores • Record’s score is a random variable defined on a real interval • A straightforward method is to rank on expected scores 1 3 2 4 5 E(s(a1))> E(s(a3))> E(s(a2))>E(s(a4))> E(s(a5)) Then, a1 > a3 > a2 > a4 > a5 M.A. Soliman and I.F. Ilyas ICDE'09
Ranking on Expected Scores • Uniform score densities But … Pr(t1 > t2 > t3)=0.25 Pr(t1 > t3 > t2)=0.2 Pr(t2 > t1 > t3)=0.05 Pr(t2 > t3 > t1)=0.2 Pr(t3 > t1 > t2)=0.05 Pr(t3 > t2 > t1)=0.25 t1 50 100 0 50 t2 60 40 t3 50 70 30 Non-uniform distribution on possible rankings Equal expectations all rankings are equally likely M.A. Soliman and I.F. Ilyas ICDE'09
Our Proposal : Ranking with Uncertain Scores • Partial order-based model • Multiple possible rankings are encapsulated • Enumerating all possible rankings is infeasible • Multiple challenges • Ranking Model • Query Semantics • Query Processing M.A. Soliman and I.F. Ilyas ICDE'09
Agenda • Data model • Query definitions • Processing algorithms • Summary and conclusions M.A. Soliman and I.F. Ilyas ICDE'09
Data Model • Score of record ti • An interval [loi, upi ] enclosing possible score values • A PDF fiencoding the likelihood of possible scores in [loi, upi ] • Probabilistic Partial Order (PPO) model • Non-intersecting intervals are totally ordered • Score Dominance: ti dominates tjiffloi ≥ upj • Intersecting intervals have a Probabilistic Dominance Relationship irreflexive:[(ti ti)], antisymetric:[(ti > tj) (tj ti)] transitive: [(ti > tj),(tj > tk) (ti > tk)] Pr( ti > tj ) (0,1) Pr( ti > tj )=1- Pr( tj > ti ) M.A. Soliman and I.F. Ilyas ICDE'09
Data Model • Each linear extension is mapped to a nested integral Pr (t1, t2, … tn)= [7,7] [4,8] [6,6] [3,5] [2,3.5] [1,1] M.A. Soliman and I.F. Ilyas ICDE'09
Agenda • Data model • Query definitions • Processing algorithms • Summary and conclusions M.A. Soliman and I.F. Ilyas ICDE'09
Query Definitions UTop-Prefix(3) UTop-Set(3) UTop-Rank(1,2) • Record-Rank Queries • UTop-Rank(i,j) • Top-k Queries • UTopPrefix(k) • UTopSet(k) M.A. Soliman and I.F. Ilyas ICDE'09
Query Definitions • Rank-Agg Queries • Find a consensus ranking minimizing the average distance to all linear extensions • Each linear extension is a voter • Spearman footrule distance t1 t2 t1 t2 t3 t1 t3 t2 t3 t2 t3 0.5 0.2 0.3 M.A. Soliman and I.F. Ilyas ICDE'09
Query Definitions • Based on possible worlds semantics • Query answer is (with high probability) the answer obtained when drawing a random possible world • Project possible worlds space on different dimensions • Top ranks Records at Rank 1 t5 t2 M.A. Soliman and I.F. Ilyas ICDE'09
Query Definitions • Based on possible worlds semantics • Query answer is (with high probability) the answer obtained when drawing a random possible world • Project possible worlds space on different dimensions • Top ranks • Top vectors Prefixes of length 2 t2,t5 t5,t1 t5,t2 M.A. Soliman and I.F. Ilyas ICDE'09
Query Definitions • Based on possible worlds semantics • Query answer is (with high probability) the answer obtained when drawing a random possible world • Project possible worlds space on different dimensions • Top ranks • Top vectors • Top sets Top sets of length 2 {t5,t2} {t5,t1} M.A. Soliman and I.F. Ilyas ICDE'09
Agenda • Data model • Query definitions • Processing algorithms • Summary and conclusions M.A. Soliman and I.F. Ilyas ICDE'09
Processing Algorithms: Score Samples • Map linear extensions to the space of possible score combinations (draw one score value for each record) Different linear extensions [7,7] s(t1)=6 s(t2)=4.5 s(t3)=3.7 s(t4)=3 s(t5)=7 s(t6)=1 s(t1)=6 s(t2)=7.5 s(t3)=3 s(t4)=3.5 s(t5)=7 s(t6)=1 t5 t1 t2 t3 t4 t6 t2 t5 t1 t4 t3 t6 t5 [4,8] t2 [6,6] t1 … [3,5] t4 t3 [2,3.5] t6 [1,1] Different score samples M.A. Soliman and I.F. Ilyas ICDE'09
Processing Algorithms: Record-Rank Queries Space of score combinations with t at rank i…j Space of all possible score combinations • Monte-Carlo Integration • Sample from the space of possible score combinations • Can be sampled at random • Sample from each [ loi, upi ] a possible score value • Allows estimating the summation of the probabilities of linear extensions having recordtin the rank range i … j x Avg( πfi(x) ) • Pr(t at rank i …j) = x Vol( ) Pr(t at rank i …j) = • If fi’s are uniform: M.A. Soliman and I.F. Ilyas ICDE'09
Processing Algorithms: Record-Rank Queries • Monte-Carlo integration allows us to avoid computing complex multidimensional integrals with dependent limits • Accuracy depends only on the number of drawn samples • Size of linear extensions space does not affect accuracy • Error in O(1/s(1/2)), where s is the number of samples • Need to evaluate a linear number of integrals (one Monte-Carlo integral per record) • Given a fixed number of samples, total cost is linear in n • The l most probable answers can be computed on the fly M.A. Soliman and I.F. Ilyas ICDE'09
Processing Algorithms: Top-k Queries • Markov Chain Monte-Carlo (MCMC) method t1 t1 t2 t2 t4 t3 Random Pos. t3 t4 Pr(t4>t3)= 0.05 t5 t5 t6 t6 Pr(prefix)= 0.06 Pr(prefix)= 0.25 Accept with probability min(1,0.06/0.25) Rejected ! Swap (t3, t4) biased by probabilistic dominance M.A. Soliman and I.F. Ilyas ICDE'09
Processing Algorithms: Top-k Queries • Markov Chain Monte-Carlo (MCMC) method t2 t1 Random Pos. t1 t2 Pr(t2>t1)= 0.9 t3 t3 t4 t4 t5 t5 t6 t6 Pr(prefix)= 0.55 Pr(prefix)= 0.25 Accept with probability min(1,0.55/0.25) Swap (t1, t2) biased by probabilistic dominance Accepted ! M.A. Soliman and I.F. Ilyas ICDE'09
Processing Algorithms: Top-k Queries Accepted Sate Rejected State Probability Prefix M.A. Soliman and I.F. Ilyas ICDE'09
Processing Algorithms: Top-k Queries • We use multiple independent Markov chains that simulate the top-k prefix/set distribution independently • Using Gelman-Rubin diagnostic to test chains convergence to the true distribution • MCMC method converges to the target distribution with arbitrary state transitions provided that each state has non-zero probability of being visited, and the chain is aperiodic • The l most probable visited states across all chains are the approximate query answers • Approximation error based on upper-bounds on top-k Prefix/Set probability: • Top-k Prefix: Mini=1…k Maxj=1 …n Pr(tj at rank i) • Top-k Set: kth largest Pr(ti at rank 1…k ) M.A. Soliman and I.F. Ilyas ICDE'09
Processing Algorithms: Rank-Agg Queries • Optimal rank aggregation under Spearman’s footrule distance • Polynomial time algorithm [Dwork et al., 2001] by modeling the problem as bipartite graph matching • Main Challenge: huge set of voters (linear extensions) • We show that we can use record-rank probabilities to construct the graph model used in [Dwork et al., 2001] • Overall polynomial cost to compute Pr(t at rank i) (using the Monte-Carlo algorithm for Record-Rank queries) and solve the resulting bipartite graph matching problem M.A. Soliman and I.F. Ilyas ICDE'09
Processing Algorithms: Rank-Agg Queries (footrule) • Optimal rank aggregation under Spearman footrule distance • For a set of rankings {ω1,..,ωm}, find the permutation ω* that minimizes 1/m ∑i F(ωi,ω*) • Polynomial time algorithm [Dwork et al., 2001] 2 1 a a b a 3 4 3 2 b c c b 2 3 c a b 4 1 3 c ω1 ω2 ω3 2 Optimal ranking: a>c>b Matching Cost=2+1+3=6 weight (c,3)= ∑i |ωi(c)-3| M.A. Soliman and I.F. Ilyas ICDE'09
Processing Algorithms: Rank-Agg Queries (footrule) • We view each LE as a voter • We are looking for the ranking that minimizes disagreements among LE’s • Main Challenge: Computing edge weights in the biparitie graph without expanding the huge set of voters (LE’s). • We show that we can compute weight(t,r) as follows: • weight(t,r) α ∑i=1…n Pr(t at rank i) * |i-r| • Overall polynomial cost to compute Pr(t at rank i) (using the Monte-Carlo algorithm for Record-Rank queries) and solve the bipartite graph matching problem M.A. Soliman and I.F. Ilyas ICDE'09
Processing Algorithms: Rank-Agg Queries (footrule) • For each rank i and each record t, compute Pr(t at rank i) using Monte-Carlo integration • The edge weight w(t,r) α ∑i Pr(t at rank i).|i-r| • Apply polynomial graph matching algorithm λ1= {t1:0.8, t2:0.2} λ2= {t1:0.2, t2:0.5,t3:0.3} λ3= {t2:0.3,t3:0.7} t1 1 0.2 t1 t2 t1 1.1 0.8 t2 1.8 t3 t1 t3 t2 2 t2 0.5 t3 1.7 t2 t3 0.7 0.9 t3 3 0.5 0.2 0.3 0.3 Min-cost Perfect Matching= {(t1,1), (t2,2), (t3,3)} M.A. Soliman and I.F. Ilyas ICDE'09
Processing Algorithms: Rank-Agg Queries (Kendall’s tau) • Optimal rank aggregation under Kendall’s tau distance • NP-Hard in general (reduction from minimum feedback arc set) • We define a key property that allows identifying easier instances of the problem • Weak Stochastic Transitivity: [Pr(x>y) ≥ 0.5 and Pr(y>z) ≥ 0.5] → [Pr(x>z) ≥ 0.5] • A PPO with non-uniform densities: • If Weak Stochastic Transitive, then computing optimal rank aggregation has polynomial time algorithm • Else the problem is NP-Hard • A PPO with uniform score densities • The PPO is provably Weak Stochastic Transitive and optimal rank aggregation can be computed in O(nlog(n)). M.A. Soliman and I.F. Ilyas ICDE'09
Processing Algorithms: Rank-Agg Queries (Kendall’s tau) • A PPO with uniform score densities • Theorem: [E(fi) ≥ E(fj)] ↔ [Pr(ti > tj) ≥ 0.5] • The optimal rank aggregation ω*: [Pr(ti > tj) ≥ 0.5] →[ω*(ti) < ω*(tj)] • Guaranteed to be a valid ranking (no cycles) • Can be computed by sorting on expected scores • Must belong to the set of LE’s t5:7 t1:6 t2:6 t3:4 t4:2.75 t6:1 [7,7] t5 [4,8] t2 t1 [6,6] [3,5] t4 t3 [2,3.5] t6 [1,1] M.A. Soliman and I.F. Ilyas ICDE'09
Gelman-Rubin Diagnostic • Run m ≥ 2 chains of length 2n from over-dispersed starting values. • Discard the first n draws in each chain. • Calculate the within-chain and between-chain variance. • Calculate the estimated variance of the distribution as a weighted sum of the within-chain and between-chain variance. • Calculate the potential scale reduction factor. M.A. Soliman and I.F. Ilyas ICDE'09
Gelman-Rubin Diagnostic where • sj2 is the within chain variance • W is the average within chain variance • This is the variance of the chain means multiplied by n because each chain is based on n draws. where M.A. Soliman and I.F. Ilyas ICDE'09
Experiments: Setup • Two real datasets • Apts: 33,000 apartment listings obtained by scrapping the search results of apartments.com • Cars: 10,000 car ads scrapped from carpages.ca. • The rent attribute in Apts is used as the scoring function (65% of scrapped apartment listings have uncertain rent values), and similarly, the price attribute in Cars is used as the scoring function (10% of scrapped car ads have uncertain price). • Three synthetic data sets with different distributions of score intervals’ bounds: • Syn-u-0:5: bounds are uniformly distributed • Syn-g-0:5: bounds are drawn from Gaussian distribution • Syn-e-0:5: bounds are drawn from exponential distribution. • Proportion of records with uncertain scores in each dataset is 50%, and the size of each dataset is 100,000 records. • Score densities are taken as uniform. M.A. Soliman and I.F. Ilyas ICDE'09
Experiments: Database Shrinkage M.A. Soliman and I.F. Ilyas ICDE'09
Experiments: Database Shrinkage M.A. Soliman and I.F. Ilyas ICDE'09
Experiments: UTop-Rank Accuracy M.A. Soliman and I.F. Ilyas ICDE'09
Experiments: UTop-Rank Efficiency M.A. Soliman and I.F. Ilyas ICDE'09
Experiments: UTop-k Accuracy M.A. Soliman and I.F. Ilyas ICDE'09
Experiments: UTop-k Convergence M.A. Soliman and I.F. Ilyas ICDE'09
Summary and Conclusions • A study for the problem of ranking in databases with incomplete/uncertain information • A probabilistic model encoding possible rankings allowing for new definitions of ranking queries • Sampling-based algorithms based on Monte-Carlo/MCMC methods to find query answers • Accuracy of query answer mainly depends on the number of samples • Polynomial time algorithms for some instances of the problem of rank aggregation in partial orders M.A. Soliman and I.F. Ilyas ICDE'09