540 likes | 710 Views
On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases. Presented by Xi Zhang Feburary 8 th , 2008. Outline. Background Motivation Examples Top-k Queries in Probabilistic Databases Conclusion. Outline. Background Probabilistic database model
E N D
On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases Presented by Xi Zhang Feburary 8th, 2008
Outline • Background • Motivation Examples • Top-k Queries in Probabilistic Databases • Conclusion
Outline • Background • Probabilistic database model • Top-k queries & scoring functions • Motivation Examples • Top-k Queries in Probabilistic Databases • Conclusion
Probabilistic Databases • Motivation • Uncertainty/vagueness/imprecision in data • History • Imcomplete information in relational DB [Imielinski & Lipski 1984] • Probabilistic DB model [Cavallo & Pittarelli 1987] • Probabilistic Relational Algebra [Fuhr & Rölleke 1997 etc.] • Comeback • Flourish of uncertain data in real world application • Examples: WWW, Biological data, Sensor network etc.
Probabilistic Database Model [Fubr & Rölleke 1997] • Probabilisitc Database Model • A generalizaiton of relational DB • Probabilistic Relational Algebra (PRA) • A generalization of standard relational algebra
A Table in Probabilistic Database DocTerm: Event expression Independent events
Probabilistic Relational Algebra • Just like in Relational Algebra… • Selection • Projection • Join • Union • Difference -
Probabilistic Relational Algebra • Just like in Relational Algebra… • Selection • Projection • Join • Union • Difference -
In derived table Selection DocTerm: Propositional expression of basic events
Projection DocTerm:
Join DocAu: DocTerm:
DocAu: Join + Projection DocTerm: IR: DB: Prob 0.81 * 0.21 = 0.1701 0.56 * 0.91 = 0.5096 0.4368
DocAu: Join + Projection DocTerm: IR: DB: Intensional Semantics v.s. Extensional Semantics Prob 0.81 * 0.21 = 0.1701 0.56 * 0.91 = 0.5096 0.4368
Intensional v.s Extensional • Intensional Semantics • Assume data independence of base tables • Keeps track of data dependence during the evaluation • Extensional Semantics • Assume data independence during the evaluation • Could be WRONG with probability computation!
When Intensional = Extensional? • No identical underlying basic events in the event expression 0.4368 Identical basic event
Fubr & Rölleke 1997 • Summary • Probabilisitc DB Model • Concept of event • Basic v.s. complex event • Event expression • Probabilistic Relational Algebra • Just like in Relational Algebra… • Computation of event probabilities • Intensional v.s. extensional semantics • Yield the same result when NO data dependence in event expressions
Outline • Background • Probabilistic database model • Top-k queries & scoring functions • Motivation Examples • Top-k Queries in Probabilistic Databases • Semantics • Query Evaluation • Conclusion
Top-k Queries • Traditonally, given Objects: o1, o2, …, on An non-negative integer: k A scoring function s: Question: What are the k objects with the highest score? • Have been studied in Web, XML, Relational Databases, and more recently in Probabilistic Databases.
Scoring Function • A scoring function s over a deterministic relation R is • For any ti and tj from R,
Outline • Background • Motivation Examples • Smart Enviroment Example • Sensor Network Example • Top-k Queries in Probabilistic Databases • Conclusion
Motivating Example I • Smart Environment • Sample Question • “Who were the two visitors in the lab last Saturday night?” • Data • Biometric data from sensors • We would be able to see how those data match the profile of every candidate -- a scoring function • Historical statistics • e. g. Probability of a certain candidate being in lab on Saturday nights
Motivating Example I (cont.) Biometrics Probability of being in lab on Saturday nights … ) score( Personnel 0.3 0.9 0.4 Question: Find two people in the lab last Saturday night a Top-2 query over the above probabilistic database under the above scoring function
Motivating Example II • Sensor Network in a Habitat • Sample Question • “What is the temperature of the warmest spot?” • Data • Sensor readings from different sensors • At a sampling time, only one “real” reading from a sensor • Each sensor reading comes with a confidence value
Motivating Example II (cont.) Prob 0.6 C1 (from Sensor 1) 0.4 0.1 C2 (from Sensor 2) 0.6 Question: What is the temperature of the warmest spot? a Top-1 query over the above probabilistic database under the scoring function proportional to temperature
Outline • Background • Motivation Examples • Top-k Queries in Probabilistic Databases • Semantics • Query Evaluation • Conclusion
Models • A probabilistic relation Rp=<R, p, C > • R: the support deterministic relation • p: probability function • C : a partition of R, such that • Simple v.s. General probabilistic relation • Simple • Assume tuple independence, i.e. |C |=|R| • E.g. smart environment example • General • Tuples can be independent or exclusive, i.e. |C |<|R| • E.g. sensor network example
Challenges Given • A probabilistic relation Rp=<R, p, C > • An injective scoring function s over R • No ties • A non-negative integer k • What is the top-k answer set over Rp ? (Semantics) • How to compute the top-k answer of Rp ? (Query Evaluation)
What is a “Good” Semantics? • Desired Properties • Exact-k • Faithfulness • Stability
Properties • Exact-k • If R has at least k tuples, then exactly k tuples are returned as the top-k answer • Faithfulness • A “better” tuple, i.e. higher in score and probability, is more likely to be in the top-k answer, compared to a “worse” one • Stability • Raising the score/prob. of a winning tuple will not cause it to lose • Lowering the score/prob. of a losing tuple will not cause it to win
Global-Topk Semantics Given • A probabilistic relation Rp=<R, p, C > • An injective scoring function s over R • No ties • A non-negative integer k • What is the top-k answer set over Rp ? (Semantics) • Global-Topk • Return the k highest-ranked tuples according to their probability of being in top-k answers in possible worlds • Global-Topk satisfies aforementioned three properties
Smart Environment Example Query: Find two people in lab on last Saturday night Biometrics Face Voice Detection, Detection, Prob. … ) Score( Personnel Aiden Score( 0.70 , 0.60, … ) = 0.65 0.3 Bob Score( 0.50 , 0.60, … ) = 0.55 0.9 Chris Score( 0.50 , 0.40, … ) = 0.45 0.4 possible worlds Aiden Aiden Bob Aiden Top-2 Aiden Bob Chris Bob Chris Chris Bob Chris 0.042 0.018 0.378 0.028 0.162 0.012 0.252 0.108 Global-Topk Semantics: Pr(Bob in top-2) = 0.9 Top-2 Answer Pr(Aiden in top-2) = 0.3 Pr(Chris in top-2) = 0.028 + 0.012 + 0.252 = 0.292
Other Semantics • Soliman, Ilyas & Chang 2007 • Two Alternative Semantics • U-Topk • U-kRanks
U-Topk Semantics Given • A probabilistic relation Rp=<R, p, C > • An injective scoring function s over R • No ties • A non-negative integer k • What is the top-k answer set over Rp ? (Semantics) • U-Topk • Return the most probable top-k answer set that belongs to possible worlds • U-Topk does not satisfies all three properties
Smart Environment Example Query: Find two people in lab on last Saturday night Biometrics Face Voice Detection, Detection, Prob. … ) Score( Personnel Aiden Score( 0.70 , 0.60, … ) = 0.65 0.3 Bob Score( 0.50 , 0.60, … ) = 0.55 0.9 Chris Score( 0.50 , 0.40, … ) = 0.45 0.4 possible worlds Aiden Aiden Bob Aiden Top-2 Aiden Bob Chris Bob Chris Chris Bob Chris 0.042 0.018 0.378 0.028 0.162 0.012 0.252 0.108 U-Topk Semantics: Top-2 Answer Pr({Bob}) = 0.378 … Pr({Aiden, Bob}) = 0.162 + 0.108 = 0.27
U-kRanks Semantics Given • A probabilistic relation Rp=<R, p, C > • An injective scoring function s over R • No ties • A non-negative integer k • What is the top-k answer set over Rp ? (Semantics) • U-kRanks • For i=1,2,…,k, return the most probable ith-ranked tuples across all possible worlds • U-kRanks does not satisfies all three properties
Smart Environment Example Query: Find two people in lab on last Saturday night Biometrics Face Voice Detection, Detection, Prob. … ) Score( Personnel Aiden Score( 0.70 , 0.60, … ) = 0.65 0.3 Bob Score( 0.50 , 0.60, … ) = 0.55 0.9 Chris Score( 0.50 , 0.40, … ) = 0.45 0.4 possible worlds Aiden Aiden Bob Aiden Top-2 Aiden Bob Chris Bob Chris Chris Bob Chris 0.042 0.018 0.378 0.028 0.162 0.012 0.252 0.108 U-kRanks Semantics: Aiden Bob Chris Highest at rank-1 Highest at rank-2 Top-2 Answer {Bob} e.g. Pr(Chris at rank-2) = 0.012 + 0.252 = 0.292
Properties A better sementics * Yes when the relation is simple, No otherwise
Challenges Given • A probabilistic relation Rp=<R, p, C > • An injective scoring function s over R • No ties • A non-negative integer k • What is the top-k answer set over Rp ? (Semantics) • How to compute the top-k answer of Rp ? (Query Evaluation) Global-Topk
Global-Topk in Simple Relation • Given Rp=<R, p, C >, a scoring function s, anon-negative integer k • Assumptions • Tuples are independent, i.e. |C |=|R| • R={t1,t2,…tn}, ordered in the decreasing order of their scores, i.e.
Global-Topk in Simple Relation • Query Evaluation • Recursion • Pk,s(ti): Global-Topk probability of tuple ti • Dynamic Programming
Optimization • Threshold Algorithm (TA) • [Fagin & Lotem 2001] • Given a system of objects, such that • For each object attribute, there is a sorted list ranking objects in the decreasing order of its score on that attribute • An aggregation function f combines individual attribute scores xi, i=1,2,…m, to obtain the overall object score f(x1,x2,…,xm) • f is monotonic • f(x1,x2,…,xm)<= f(x’1,x’2,…,x’m) whenever xi<=x’ifor every i • TA is cost-optimal in finding the top-k objects • TA and its variants are widely used in ranking queries, e.g. top-k, skyline, etc.
Applying TA Optimization • Global-Topk • Two attributes: probability & score • Aggregation function: Global-Topk probability
Global-Topk in General Relation • Given Rp=<R, p, C >, a scoring function s, anon-negative integer k • Assumptions • Tuples are independent or exclusive, i.e. |C |<|R| • R={t1,t2,…tn}, ordered in the decreasing order of their scores, i.e.
Global-Topk in General Relation • Induced Event Relation • For each tuple in R, there is a probabilistic relation Ep=<E, pE, C E> generated by the following two rules • Ep is simple
Sensor Network Example Prob. Relation (general) Prob For example: 0.6 C1 (from Sensor 1) 0.4 t= 0.6 0.1 C2 (from Sensor 2) 0.6 Induced Event Relation (simple) Prob Rule 2 where i=1 0.6 = 0.6 = p(t) Rule 1
Evaluating Global-Topk in General Relation • For each tuple t, generate corresponding induced event relation • Compute the Global-Topk probability of t by Theorem 4.3 • Pick the k tuples with the highest Global-Topk probability
Summary on Query Evaluation • Simple (Independent Tuples) • Dynamic Programming • Tuples are ordered on their scores • Recursion on the tuple index and k • General (Independent/Exclusive Tuples) • Polynomial reduction to simple cases
Complexity * m is a rule engine related factor m represents how complicated the relationship between tuples could be
Outline • Background • Motivation Examples • Top-k Queries in Probabilistic Databases • Conclusion