Probabilistic Data Management

Probabilistic Data Management Chapter 8: Probabilistic Query Answering (6)

Objectives • In this chapter, you will: • Explore the definitions of more probabilistic query types • Probabilistic top-k query

Recall: Probabilistic Query Types Uncertain/probabilistic database Probabilistic range query Probabilistic k-nearest neighbor query Probabilistic group nearest neighbor (PGNN) query Probabilistic reverse k-nearest neighbor query Probabilistic spatial join /similarity join Probabilistic top-k query (or ranked query) Probabilistic skyline query Probabilistic reverse skyline query Probabilistic Spatial Query Probabilistic Preference Query 3

Motivation Example In a coal mine surveillance application, a number of sensors are deployed to detect density of gas, temperature, and so on Assume we have a preference function f(O) = O.temp + O.den Top-k query: Retrieve k sensors with the highest scores (most dangerous) 4

Motivation Example (cont'd) Sensor data usually contain noises The reported data can be modeled as uncertain objects Obtain top-k query answers over uncertain data with high confidence actual data actual data 5

Background of Probabilistic Top-k Query Under possible worlds semantics Each tuple t is associated with a score t.score Each tuple t is associated with an existence probability t.prob query answer in possible worlds possible worlds 6

Different Semantics of Probabilistic Top-k Query Top-k query in probabilistic databases Consider each possible world from which top-k answers are retrieved Aggregate the top-k answers (weighted by the probabilities of possible worlds) Aggregation Semantics Uncertain Top-k (U-Topk) [Soliman et al., ICDE 2007] Uncertain Rank-k (U-kRank) [Soliman et al., ICDE 2007] Probabilistic Threshold Top-k (PT(h)) [Hua et al., SIGMOD 2008] Expected Ranks (Exp-Rank) [Cormode et al., ICDE 2009] Expected Score (E-Score) [Cormode et al., ICDE 2009] 7

Uncertain Top-k (U-Topk) [Soliman et al., ICDE 2007] group by top-k answer vectors top-k answer vector Find one top-k answer vector that appears in possible worlds with the highest probability top-k answer vector … … … … … … … … probabilistic database top-k answer vector U-Topk answers possible worlds 8

Example of U-Topk Given the Uncertain Database and k=2 Pr[{ t1, t2 }] = 0.2 Pr[{ t1, t3 }] = 0.2 Pr[{ t2, t3 }] = 0.3 Pr[{ t3, t4 }] = 0.3 Final Result: {t2, t3} or {t3, t4} 9

Uncertain Rank-k (U-kRanks) [Soliman et al., ICDE 2007] For some j [1, k], group by tuples with the j-th rank tuple with the j-thrank For each j [1, k], find one tuple that has the j-th rank in possible worlds with the highest probability tuple with the j-thrank … … … … … … … … probabilistic database tuple with the j-thrank U-kRank answers possible worlds 10

Example of U-kRanks Given the Uncertain Database and k=2 At rank i= 1: Pr[t1] = 0.4 Pr[t2] = 0.3 Pr[t3] = 0.3 At rank i= 2: Pr[t2] = 0.2 Pr[t3] = 0.5 Pr[t4] = 0.3 Final Result: {t1, t3} 11

Probabilistic Threshold Top-k (PT(h)) [Hua et al., SIGMOD 2008] group by tuples in top-h answer sets top-h answer set Find k tuples that are in top-h answer sets of possible worlds with the highest probabilities top-h answer set … … … … … … … … probabilistic database top-h answer set PT(h) answers possible worlds 12

Example of PT-k Given the Uncertain Database, k=2, Threshold=0.5 Pr[t1] = 0.4 Pr[t2] = 0.5 Pr[t3] = 0.8 Pr[t4] = 0.3 Threshold=0.5 Pr[t2] = 0.5 Pr[t3] = 0.8 Final Result: {t2, t3} 13

Expected Ranks (Exp-Rank) [Cormode et al., ICDE 2009] … … … … … … … … expected rank of t1: pwrpw(t1)Pr(pw) t1 t2 … … … … Find k tuples with the highest expected ranks … … … … … … probabilistic database … … alternatives possible worlds 14

Expected Score (E-Score) [Cormode et al., ICDE 2009] … … … … … … … … expected score of t1: pwscore(t1)Pr(pw) t1 t2 … … … … Find k tuples with the highest expected scores … … … … … … probabilistic database … … alternatives possible worlds 15

Example of Expected Ranks Given the Uncertain Database and k=2 If a tuple doesn’t appear in a world, its rank is considered to be the last one E[R(t1)] = 1×0.2+ 1×0.2+3×0.3+3× 0.3= 2.2 E[R(t2)] = 2.4 E[R(t3)] = 1.9 E[R(t4)] = 2.9 Final Result: {t3, t1} 16

Unified Ranking Functions Parameterized Ranking Function (PRF) A probabilistic top-k query returns k tuples with the highest |gw| values weighted function Li, J., Deshpande, A. A Unified Approach to Ranking in Probabilistic Databases. In VLDB, 2009. 17

Unified Ranking Functions (cont'd) When w(t, i) = 1, the result is the set of k tuples with the highest probability When w(t, i) = score(t), E-Score When , PT(h) When , U-Rank PRF cannot simulate U-Topk 18

Unified Ranking Functions (cont'd) Two new semantics PRFw(h) and PRFe(h) PRFw(h): w(t, i) = wi for i  h,and w(t, i) = 0 for i > h PRFe(h): w(t, i) = a i, where a can be a real/complex number 19

Ranking Algorithms Assuming tuple independence Compute the probability that a tuple ti has the j-th rank Observation: the coefficient cj of xj in a function, Fi(x), is exactly the probability that ti is at rank j 20

Example Consider the rank of a tuple t3, Incremental computation of Fi(x): .4x 21

Ranking Algorithms (cont'd) Assuming correlated database represented by and/xor tree Generating functions on the and/xor tree Observation: the coefficient cj of the term xj-1y is Pr(r(ti) = j) 22

Summary Probabilistic top-k query Different semantics w.r.t. ranks and probabilities in possible worlds A unified approach 23

Probabilistic Data Management

Probabilistic Data Management

Presentation Transcript

Probabilistic Reasoning for Modeling Unreliable Data

Probabilistic/Uncertain Data Management -- III

Probabilistic Histograms for Probabilistic Data

Probabilistic/Uncertain Data Management

Probabilistic/Uncertain Data Management -- IV

Probabilistic Models of Relational Data

Probabilistic Data Aggregation

Probabilistic Data Aggregation

Probabilistic Queries and Uncertain Data

Using Probabilistic Models for Data Management in Acquisitional Environments

Living Probabilistic Asset Management

COMP9315 Uncertain and Probabilistic Data

Probabilistic Models for Relational Data

Probabilistic Reasoning in Data Analysis

Probabilistic Reasoning with Uncertain Data

Probabilistic Data Aggregation

Probabilistic Reasoning with Uncertain Data