Ranking with Uncertain Scores

Ranking with Uncertain Scores Mohamed Soliman and Ihab Ilyas University of Waterloo

[ICDE’07, SIGMOD’07 demo] New semantics for top-k queries on uncertain data • Studying the interaction between scores and probabilities • Integrating scoring and uncertainty dimensions into the same query definitions • [VLDB’08] Finding the top-k probable nearest neighbors in uncertain databases • Lazy evaluation of probabilistic NN integrals • Optimizing number of object retrievals • [TODS’08] Efficient support of probabilistic ranking and aggregation queries • Pipelined query processing • Unified framework for handling uncertain ranking and aggregation

Ranking (Top-k) Queries • Given a query-specified scoring function, return the k records with maximum scores • Each record has a single score • Score ties are resolved deterministically • Total order model SELECT A.address FROMApartment A ORDER BY 0.6*A.rent+0.4*A.deposit LIMIT 10 Record’s score Result size (k) M.A. Soliman and I.F. Ilyas ICDE'09

Ranking (Top-k) Queries • Real world data are influenced by several sources of uncertainty • Data entry error • Privacy concerns • Presentation style • Incomplete/missing attribute values exist • Uncertain ratings exist • The total order model does not apply any more M.A. Soliman and I.F. Ilyas ICDE'09

Incomplete/Missing Data Attribute values expressed as ranges Missing attribute value SELECT A.address FROMApartment A ORDER BY 0.6*A.rent+0.4*A.deposit LIMIT 10 ? M.A. Soliman and I.F. Ilyas ICDE'09

Modeling Uncertain Scores • Record’s score is a random variable defined on a real interval • A straightforward method is to rank on expected scores 1 3 2 4 5 E(s(a1))> E(s(a3))> E(s(a2))>E(s(a4))> E(s(a5)) Then, a1 > a3 > a2 > a4 > a5 M.A. Soliman and I.F. Ilyas ICDE'09

Ranking on Expected Scores • Uniform score densities But … Pr(t1 > t2 > t3)=0.25 Pr(t1 > t3 > t2)=0.2 Pr(t2 > t1 > t3)=0.05 Pr(t2 > t3 > t1)=0.2 Pr(t3 > t1 > t2)=0.05 Pr(t3 > t2 > t1)=0.25 t1 50 100 0 50 t2 60 40 t3 50 70 30 Non-uniform distribution on possible rankings Equal expectations  all rankings are equally likely M.A. Soliman and I.F. Ilyas ICDE'09

Our Proposal : Ranking with Uncertain Scores • Partial order-based model • Multiple possible rankings are encapsulated • Enumerating all possible rankings is infeasible • Multiple challenges • Ranking Model • Query Semantics • Query Processing M.A. Soliman and I.F. Ilyas ICDE'09

Agenda • Data model • Query definitions • Processing algorithms • Summary and conclusions M.A. Soliman and I.F. Ilyas ICDE'09

Data Model • Score of record ti • An interval [loi, upi ] enclosing possible score values • A PDF fiencoding the likelihood of possible scores in [loi, upi ] • Probabilistic Partial Order (PPO) model • Non-intersecting intervals are totally ordered • Score Dominance: ti dominates tjiffloi ≥ upj • Intersecting intervals have a Probabilistic Dominance Relationship irreflexive:[(ti ti)], antisymetric:[(ti > tj)  (tj ti)] transitive: [(ti > tj),(tj > tk)  (ti > tk)] Pr( ti > tj ) (0,1) Pr( ti > tj )=1- Pr( tj > ti ) M.A. Soliman and I.F. Ilyas ICDE'09

Data Model • Each linear extension is mapped to a nested integral Pr (t1, t2, … tn)= [7,7] [4,8] [6,6] [3,5] [2,3.5] [1,1] M.A. Soliman and I.F. Ilyas ICDE'09

Query Definitions UTop-Prefix(3) UTop-Set(3) UTop-Rank(1,2) • Record-Rank Queries • UTop-Rank(i,j) • Top-k Queries • UTopPrefix(k) • UTopSet(k) M.A. Soliman and I.F. Ilyas ICDE'09

Query Definitions • Rank-Agg Queries • Find a consensus ranking minimizing the average distance to all linear extensions • Each linear extension is a voter • Spearman footrule distance t1 t2 t1 t2 t3 t1 t3 t2 t3 t2 t3 0.5 0.2 0.3 M.A. Soliman and I.F. Ilyas ICDE'09

Query Definitions • Based on possible worlds semantics • Query answer is (with high probability) the answer obtained when drawing a random possible world • Project possible worlds space on different dimensions • Top ranks Records at Rank 1 t5 t2 M.A. Soliman and I.F. Ilyas ICDE'09

Query Definitions • Based on possible worlds semantics • Query answer is (with high probability) the answer obtained when drawing a random possible world • Project possible worlds space on different dimensions • Top ranks • Top vectors Prefixes of length 2 t2,t5 t5,t1 t5,t2 M.A. Soliman and I.F. Ilyas ICDE'09

Query Definitions • Based on possible worlds semantics • Query answer is (with high probability) the answer obtained when drawing a random possible world • Project possible worlds space on different dimensions • Top ranks • Top vectors • Top sets Top sets of length 2 {t5,t2} {t5,t1} M.A. Soliman and I.F. Ilyas ICDE'09

Processing Algorithms: Score Samples • Map linear extensions to the space of possible score combinations (draw one score value for each record) Different linear extensions [7,7] s(t1)=6 s(t2)=4.5 s(t3)=3.7 s(t4)=3 s(t5)=7 s(t6)=1 s(t1)=6 s(t2)=7.5 s(t3)=3 s(t4)=3.5 s(t5)=7 s(t6)=1 t5 t1 t2 t3 t4 t6 t2 t5 t1 t4 t3 t6 t5 [4,8] t2 [6,6] t1 … [3,5] t4 t3 [2,3.5] t6 [1,1] Different score samples M.A. Soliman and I.F. Ilyas ICDE'09

Processing Algorithms: Record-Rank Queries Space of score combinations with t at rank i…j Space of all possible score combinations • Monte-Carlo Integration • Sample from the space of possible score combinations • Can be sampled at random • Sample from each [ loi, upi ] a possible score value • Allows estimating the summation of the probabilities of linear extensions having recordtin the rank range i … j x Avg( πfi(x) ) • Pr(t at rank i …j) = x Vol( ) Pr(t at rank i …j) = • If fi’s are uniform: M.A. Soliman and I.F. Ilyas ICDE'09

Processing Algorithms: Record-Rank Queries • Monte-Carlo integration allows us to avoid computing complex multidimensional integrals with dependent limits • Accuracy depends only on the number of drawn samples • Size of linear extensions space does not affect accuracy • Error in O(1/s(1/2)), where s is the number of samples • Need to evaluate a linear number of integrals (one Monte-Carlo integral per record) • Given a fixed number of samples, total cost is linear in n • The l most probable answers can be computed on the fly M.A. Soliman and I.F. Ilyas ICDE'09

Processing Algorithms: Top-k Queries • Markov Chain Monte-Carlo (MCMC) method t1 t1 t2 t2 t4 t3 Random Pos. t3 t4 Pr(t4>t3)= 0.05 t5 t5 t6 t6 Pr(prefix)= 0.06 Pr(prefix)= 0.25 Accept with probability min(1,0.06/0.25) Rejected ! Swap (t3, t4) biased by probabilistic dominance M.A. Soliman and I.F. Ilyas ICDE'09

Processing Algorithms: Top-k Queries • Markov Chain Monte-Carlo (MCMC) method t2 t1 Random Pos. t1 t2 Pr(t2>t1)= 0.9 t3 t3 t4 t4 t5 t5 t6 t6 Pr(prefix)= 0.55 Pr(prefix)= 0.25 Accept with probability min(1,0.55/0.25) Swap (t1, t2) biased by probabilistic dominance Accepted ! M.A. Soliman and I.F. Ilyas ICDE'09

Processing Algorithms: Top-k Queries Accepted Sate Rejected State Probability Prefix M.A. Soliman and I.F. Ilyas ICDE'09

Processing Algorithms: Top-k Queries • We use multiple independent Markov chains that simulate the top-k prefix/set distribution independently • Using Gelman-Rubin diagnostic to test chains convergence to the true distribution • MCMC method converges to the target distribution with arbitrary state transitions provided that each state has non-zero probability of being visited, and the chain is aperiodic • The l most probable visited states across all chains are the approximate query answers • Approximation error based on upper-bounds on top-k Prefix/Set probability: • Top-k Prefix: Mini=1…k Maxj=1 …n Pr(tj at rank i) • Top-k Set: kth largest Pr(ti at rank 1…k ) M.A. Soliman and I.F. Ilyas ICDE'09

Processing Algorithms: Rank-Agg Queries • Optimal rank aggregation under Spearman’s footrule distance • Polynomial time algorithm [Dwork et al., 2001] by modeling the problem as bipartite graph matching • Main Challenge: huge set of voters (linear extensions) • We show that we can use record-rank probabilities to construct the graph model used in [Dwork et al., 2001] • Overall polynomial cost to compute Pr(t at rank i) (using the Monte-Carlo algorithm for Record-Rank queries) and solve the resulting bipartite graph matching problem M.A. Soliman and I.F. Ilyas ICDE'09

Processing Algorithms: Rank-Agg Queries (footrule) • Optimal rank aggregation under Spearman footrule distance • For a set of rankings {ω1,..,ωm}, find the permutation ω* that minimizes 1/m ∑i F(ωi,ω*) • Polynomial time algorithm [Dwork et al., 2001] 2 1 a a b a 3 4 3 2 b c c b 2 3 c a b 4 1 3 c ω1 ω2 ω3 2 Optimal ranking: a>c>b Matching Cost=2+1+3=6 weight (c,3)= ∑i |ωi(c)-3| M.A. Soliman and I.F. Ilyas ICDE'09

Processing Algorithms: Rank-Agg Queries (footrule) • We view each LE as a voter • We are looking for the ranking that minimizes disagreements among LE’s • Main Challenge: Computing edge weights in the biparitie graph without expanding the huge set of voters (LE’s). • We show that we can compute weight(t,r) as follows: • weight(t,r) α ∑i=1…n Pr(t at rank i) * |i-r| • Overall polynomial cost to compute Pr(t at rank i) (using the Monte-Carlo algorithm for Record-Rank queries) and solve the bipartite graph matching problem M.A. Soliman and I.F. Ilyas ICDE'09

Processing Algorithms: Rank-Agg Queries (footrule) • For each rank i and each record t, compute Pr(t at rank i) using Monte-Carlo integration • The edge weight w(t,r) α ∑i Pr(t at rank i).|i-r| • Apply polynomial graph matching algorithm λ1= {t1:0.8, t2:0.2} λ2= {t1:0.2, t2:0.5,t3:0.3} λ3= {t2:0.3,t3:0.7} t1 1 0.2 t1 t2 t1 1.1 0.8 t2 1.8 t3 t1 t3 t2 2 t2 0.5 t3 1.7 t2 t3 0.7 0.9 t3 3 0.5 0.2 0.3 0.3 Min-cost Perfect Matching= {(t1,1), (t2,2), (t3,3)} M.A. Soliman and I.F. Ilyas ICDE'09

Processing Algorithms: Rank-Agg Queries (Kendall’s tau) • Optimal rank aggregation under Kendall’s tau distance • NP-Hard in general (reduction from minimum feedback arc set) • We define a key property that allows identifying easier instances of the problem • Weak Stochastic Transitivity: [Pr(x>y) ≥ 0.5 and Pr(y>z) ≥ 0.5] → [Pr(x>z) ≥ 0.5] • A PPO with non-uniform densities: • If Weak Stochastic Transitive, then computing optimal rank aggregation has polynomial time algorithm • Else the problem is NP-Hard • A PPO with uniform score densities • The PPO is provably Weak Stochastic Transitive and optimal rank aggregation can be computed in O(nlog(n)). M.A. Soliman and I.F. Ilyas ICDE'09

Processing Algorithms: Rank-Agg Queries (Kendall’s tau) • A PPO with uniform score densities • Theorem: [E(fi) ≥ E(fj)] ↔ [Pr(ti > tj) ≥ 0.5] • The optimal rank aggregation ω*: [Pr(ti > tj) ≥ 0.5] →[ω*(ti) < ω*(tj)] • Guaranteed to be a valid ranking (no cycles) • Can be computed by sorting on expected scores • Must belong to the set of LE’s t5:7 t1:6 t2:6 t3:4 t4:2.75 t6:1 [7,7] t5 [4,8] t2 t1 [6,6] [3,5] t4 t3 [2,3.5] t6 [1,1] M.A. Soliman and I.F. Ilyas ICDE'09

Gelman-Rubin Diagnostic • Run m ≥ 2 chains of length 2n from over-dispersed starting values. • Discard the first n draws in each chain. • Calculate the within-chain and between-chain variance. • Calculate the estimated variance of the distribution as a weighted sum of the within-chain and between-chain variance. • Calculate the potential scale reduction factor. M.A. Soliman and I.F. Ilyas ICDE'09

Gelman-Rubin Diagnostic where • sj2 is the within chain variance • W is the average within chain variance • This is the variance of the chain means multiplied by n because each chain is based on n draws. where M.A. Soliman and I.F. Ilyas ICDE'09

Experiments: Setup • Two real datasets • Apts: 33,000 apartment listings obtained by scrapping the search results of apartments.com • Cars: 10,000 car ads scrapped from carpages.ca. • The rent attribute in Apts is used as the scoring function (65% of scrapped apartment listings have uncertain rent values), and similarly, the price attribute in Cars is used as the scoring function (10% of scrapped car ads have uncertain price). • Three synthetic data sets with different distributions of score intervals’ bounds: • Syn-u-0:5: bounds are uniformly distributed • Syn-g-0:5: bounds are drawn from Gaussian distribution • Syn-e-0:5: bounds are drawn from exponential distribution. • Proportion of records with uncertain scores in each dataset is 50%, and the size of each dataset is 100,000 records. • Score densities are taken as uniform. M.A. Soliman and I.F. Ilyas ICDE'09

Experiments: Database Shrinkage M.A. Soliman and I.F. Ilyas ICDE'09

Experiments: UTop-Rank Accuracy M.A. Soliman and I.F. Ilyas ICDE'09

Experiments: UTop-Rank Efficiency M.A. Soliman and I.F. Ilyas ICDE'09

Experiments: UTop-k Accuracy M.A. Soliman and I.F. Ilyas ICDE'09

Experiments: UTop-k Convergence M.A. Soliman and I.F. Ilyas ICDE'09

Summary and Conclusions • A study for the problem of ranking in databases with incomplete/uncertain information • A probabilistic model encoding possible rankings allowing for new definitions of ranking queries • Sampling-based algorithms based on Monte-Carlo/MCMC methods to find query answers • Accuracy of query answer mainly depends on the number of samples • Polynomial time algorithms for some instances of the problem of rank aggregation in partial orders M.A. Soliman and I.F. Ilyas ICDE'09

Ranking with Uncertain Scores

Ranking with Uncertain Scores

Presentation Transcript

Scheduling with Uncertain Resources

Standardized Scores (Z-Scores)

Uncertain Standards with Standard Uncertainties

Uncertain

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach

Cleaning Uncertain Data with Quality Guarantees

Adaptive Sampling with Topological Scores

Learning with Uncertain Labels

COPING WITH UNCERTAIN TIMES

Coping with Change in Uncertain Times

Cleaning Uncertain Data with Quality Guarantees

Narrative Authoring with Uncertain Time Inference

Robust Ranking of Uncertain Data

Probabilistic Temporal Planning with Uncertain Durations

scores

Z-SCORES (STANDARD SCORES)

Probabilistic Reasoning with Uncertain Data

XQuery Processing with Relevance Ranking

AAQ Scores Are Associated With

Probabilistic Reasoning with Uncertain Data

Cleaning Uncertain Data with Quality Guarantees

Uncertain Data