550 likes | 706 Views
Probabilities in Databases and Logics I. Nilesh Dalvi and Dan Suciu University of Washington. Two Lectures. Today: probabilistic database to model imprecisions probabilistic logics Tomorrow: probabilistic database to model incompletness random graphs. Motivation. Record reconciliation
E N D
Probabilities inDatabases and Logics I • Nilesh Dalvi and Dan Suciu • University of Washington
Two Lectures • Today: • probabilistic database to model imprecisions • probabilistic logics • Tomorrow: • probabilistic database to model incompletness • random graphs
Motivation • Record reconciliation • Information extraction • Constraint violations • Schema matching
name title title rating rating year p p p Twelve Monkeys Twelve Monkeys Monkey Love good fair 1995 .5 .53 .8 Monkey Love Monkey Love fair good 1997 .2 .42 .4 Monkey Love Pl Monkey Love fair fair 1935 .6 .15 .9 Monkey Love Pl poor 2005 .9 .7 Queries: A(x,y) :- Review(x,y), Movie(x,z), z > 1991 Answers: Top k Problem Setting Tables: Review Movie Problem: complexity of query evaluation
Query evaluation problem Fix answer tuple (a,b) Given database I, compute Pr(Q(a,b)) Top-k answering problem Fix k > 0 Given database I, find k answer tuples with highest probabilities Two Problems Fixed schema S, conjunctive query Q(x,y)
Related Work: DB • Cavallo&Pitarelli:1987 • Barbara,Garcia-Molina, Porter:1992 • Lakshmanan,Leone,Ross&Subrahmanian:1997 • Fuhr&Roellke:1997 • Dalvi&S:2004 • Widom:2005
Related Work: Logic • Query reliability [Gradel,Gurevitch,Hirsch’98] • Degrees of belief [Bacchus,Grove,Halpern,Koller’96] • Probabilistic Logic [Nielson] • Probabilistic model checking [Kwiatkowska’02] • Probabilistic Relational Model[Taskar,Abbeel,Koller’02]
Outline • Definitions • Query Evaluation • Top-k answering (joint with Chris Re) • Conclusions
Probabilistic Database • Schema S, Domain D, Set of instances Inst • DefinitionProbabilistic database is a probability distribution Pr : Inst → [0,1], ∑I Pr[I] = 1 If Pr[I] > 0 then I is called “possible world”
Probabilistic Database • Representation: • Independent tuples:I-database DB over some schema Si • Independent and disjoint tuples:ID-database DB over some schema Sid • Semantics: • DB “means” probability distribution Pr over schema S
Mov Mov Scor Scor Representation m42 good m99 good Pr[I1]= (1-p1)*(1-p2)*(1-p3) p1*p2*(1-p3) Pr[I4]= p1*p2*p3 Pr[I8]= Possible worlds semantics I-Databases Reviewsi(M,S,p) Reviews(M,S) Pr[I1] + Pr[I2] + . . . + Pr[I8] = 1
Time Time Time Time Time Act Act Act Act Act t+1 t t t walk walk walk run t+1 walk Pr[I1]= (1-p1-p2)*(1-p3) p2*(1-p3) Pr[I3]= Pr[I5]= p1*p3 ID-Databases Activitiesid Activities Pr[I1] + Pr[I2] + . . . + Pr[I6] = 1
Movied Movie Movie Scored Score Score P P P m42 m42 m42 good good good p1 p1 p1 m99 m99 m99 good good good p2 p2 p2 m76 m76 m76 poor poor poor p3 p3 p3 ID subsumes I Reviewsid Reviewsi = Note: means all tuples are disjoint Reviewsid
Syntax: conjunctive queries over schema S id mid year rating P p m42 1995 0.95 m42 4 0.7 m99 2002 0.65 m42 5 0.45 m76 2002 0.1 m99 5 0.82 m05 2005 0.7 m99 4 0.68 m05 5 0.79 Queries Q(y) :- Movie(x,y), Review(x,z), z >= 3 Reviewi Moviei
Two Query Semantics • Possible answer sets • Given set A: • Used for views • Possible tuples • Given tuple t: • Used for query evaluation and top-k Pr[{t | I ⊨ Q(t)} = A] This talk Pr[I ⊨ Q(t)]
id id id year id year year year year p m76 1995 m42 2004 m87 1934 m99 1935 1935 p2 + p3 = 0.6 m99 1935 m99 1901 m44 1904 m05 1903 2004 p1 + p3 = 0.5 m05 2004 m76 1902 1995 p3 = 0.2 p1 p2 p3 p4 . . . . . . top k Query Semantics Q(y) :- Movie(x,y), Review(x,z) Tupleprobabilities
Summary on Data Model • Data Model:Semantics = possible worldsSyntax = I-databases or ID-databases • Queries:Syntax = unchanged (conjunctive queries)Semantics = tuple probabilities
Outline • Definitions • Query evaluation • Top-k answering • Conclusions
Problem Definition • Fix schema S, query Q, answer tuple t • Problem: given I/ID-database DB, compute Pr[I ⊨ Q(t)] • Conventions: For upper bounds (P or #P): probabilities are rationalsFor lower bounds (#P): probabilities are 1/2 Pr[Q(t)] notation:
Query Evaluationon I-Databases • Outline • Intuition • Extensional plans: PTIME case • Hard queries: #P-complete case • Dichotomy Theorem
mid Year id rate year p p p m42 4 q1 m42 1995 p1 1995 m42 2 q2 m99 2002 p2 m42 3 q3 m76 2002 p3 2002 m99 1 q4 m05 2005 p4 m99 3 q5 m76 5 q6 Intuition Q(y) :- Movie(x,y), Review(x,z) Reviewi Moviei Answer p1× (1 - (1 - q1) ×(1 - q2)×(1 - q3)) p2× (1 - (1 - q4)×(1 - q5)) 1 - (1 - ) × (1 - ) p3× q6
I-Extensional Plans • Add Join ⋈ p = p1 * p2Projection ∏ p = 1-(1-p1)(1-p2)...(1-pn)Selection σ p = p • Note: data complexity is PTIME [Barbara92,Lakshmanan97] p
1995 1995 m1 m1 q1 q1 p p m1 m1 q2 q2 m1 m1 q3 q3 ∏ ∏ ⋈ ⋈ ∏ Review Movie Movie Review Q(y) :- Movie(x,y), Review(x,z) CORRECT INCORRECT!
A B A p B p p1 q1 p2 q2 p3 q3 p4 q4 #P-Complete Queries Ri S Ti Qbad :- Ri(x), S(x,y), Ti(y) Theorem: Data complexity is #P-complete
A A B p B p y1 x1 x2 y3 1/2 1/2 x2 y2 x1 y2 1/2 1/2 x4 y3 x3 y3 1/2 1/2 x4 x3 1/2 y1 Proof: Theorem [Provan&Ball83] Counting the number of satisfying assignments for bipartite DNF is #P-complete S Ti Ri Reduction: x2y3 V x1y2 V x4y3 V x3y1 Qbad :- Ri(x), S(x,y), Ti(y)
I-Dichotomy Q = boolean conjunctive query Definition 1. For each variable x: goals(x) = set of goals that contain x Definition 2. Q is hierarchical if forall x, y: (a) goals(x) ∩ goals(y) = ∅, or (b) goals(x) ⊆ goals(y), or (c) goals(y) ⊆ goals(x)
R S x y Q :- R(x),S(x,y),T(x,y,z),K(x,v) v K z T “hierarchical” Q:- R(x), S(x,y), T(y) x y S T R “non-hierarchical”
[Dalvi&S.’04] I-Dichotomy Schema Si = {R1i, R2i, . . ., Rmi} Theorem Let Q = conjunctive query w/o self-joins. Then one of the following holds: Q is in PTIME Q has a correct extensional plan Q is hierarchical or: Q is #P-complete Q has subgoals R(x,...),S(x,y,...),T(y,...)
Proof Lemma 1. If Q is non-hierarchical, then #P-complete Proof: y x z S T R v K Q :- Ri(v,x), Si(x,y), Ti(y,z), Ki(z) rest is like for Qbad
Proof Lemma 2. If Q is hierarchical, then PTIME Proof: Case 1: has no root Pr(Q) = Pr(Q1) Pr(Q2) Pr(Q3) This is extensional join⋈
Proof Case 2: has root x x Dom={a1, a2, . . ., an} Pr(Q) = 1 - (1-Pr(Q(a1/x))(1-Pr(Q(a2/x))...(1-Pr(Q(an/x))) This is an extensional projection: ∏ QED
Query Evaluationon ID-Databases • ID-extensional plans • #P-complete queries • Dichotomoy Theorem
Extensional Plans for ID-DBs • Only difference: two kinds of projections:independent 1-(1-p1)...(1-pn)disjoint p1 + ... + pn
#P-Complete Queries Q1 :- Ri(x), Si(x,y), Ti(y) Q2 :- Rd(xd,y), Sd(yd,z) Q3 :- Rd(xd,y), Sd(zd,y)
[Dalvi&S.’04] I-DB Dichotomy Schema Sid s.t. each table is either Ri or Rid Theorem Let Q = conjunctive query w/o self-joins. Then one of the following holds: Q is in PTIME Q has a correct extensional plan or: Q is #P-complete Q has one of Q1, Q2, Q3 as subqueries
Extensions • Extensions of the dichotomoy theorem exists for: • Mixed schemas (some relations are deterministic) • Functional dependencies
Summary on Query Evaluation • Extensional plans: popular, efficient, BUT • “Equivalent” plans lead to different results • Some queries admit “correct” plans • Some simple queries: #P-complete complexity • Dichotomy theorem • Future work: remove ‘no-self-join’ restriction
Outline • Definitions • Query evaluation • Top-k answering (joint with Chris Re) • Conclusions
Top-k Ranking Problem • Fix schema S, query Q, number k > 0 • Problem: given I- or ID-database DB,find k answers t1,...,tk with highest probabilities • Note: CheckingPr[Q(ti)] > Pr[Q(tj)] is PP-completeGoal: efficient polynomial time approximation Pr[Q(t1)] > Pr[Q(t2)] > .... > Pr[Q(tk)] > ...
A e1 p e2 e3 Pr 0 0 0 (1-p1)(1-p2)(1-p3) e1 p1 0 0 1 (1-p1)(1-p2)p3 e2 p2 0 1 0 (1-p1)p2(1-p3) e3 p3 0 1 1 (1-p1)p2p3 1 0 0 p1(1-p2)(1-p3) 1 0 1 p1(1-p2)p3 1 1 0 p1p2(1-p3) 1 1 1 p1p2p3 Probabilities of Boolean Expressions What is the probability of e1⋀e2 ⋁e1⋀e2⋁e1⋀e3? Theorem #P-hard [Valiant] (1-p1)p2p3 + p1(1-p2)p3 + p1p2(1-p3) + p1p2p3
Monte Carlo Simulation Algorithm: radomly pick each e1, e2, e3 = false or true compute e1∧e2 ∨ e1∧e3 ∨ e2∧e3: true or false ?repeat Approximate probability p with frequency p’ Better: PTAS [Karp&Luby’83] p’- ε p’+ ε p’ Pr( |p’-p| < ε ) > 1-δ p
N=1 N=2 N=3 Monte Carlo Simulation 0 1 p N=0
Year P 1995 ?? 2002 ?? 1933 ?? 1984 ?? 0 1 The Multisimulation Problem Schedule simulation steps to find top-k
p2 2 p1 1 p4 p3 4 3 p5 5 Multisimulation How to find the top k out of n ? 0 1 Example: looking for top k=2; Which one simulate next ?
Multisimulation Critical region: (k’th left, k+1’th right) 0 1 k=2
Multisimulation Algorithm Case 1: pick a “double crosser” and simulate it 0 1 this k=2
Multisimulation Algorithm Case 2: pick both a “left” AND a “right” crosser 0 1 this and this k=2
Multisimulation Algorithm Case 3: pick a “max crosser” and simulate it 0 1 this k=2
Multisimulation Algorithm End: when critical region is “empty” 0 1 k=2 To sort the top k, find the top k-1, etc
Multisimulation Algorithm Theorem (1) It runs in < 2 Optimal # steps (2) no other deterministic algorithm does better