Probabilities in Databases and Logics I

Probabilities inDatabases and Logics I • Nilesh Dalvi and Dan Suciu • University of Washington

Two Lectures • Today: • probabilistic database to model imprecisions • probabilistic logics • Tomorrow: • probabilistic database to model incompletness • random graphs

Motivation • Record reconciliation • Information extraction • Constraint violations • Schema matching

name title title rating rating year p p p Twelve Monkeys Twelve Monkeys Monkey Love good fair 1995 .5 .53 .8 Monkey Love Monkey Love fair good 1997 .2 .42 .4 Monkey Love Pl Monkey Love fair fair 1935 .6 .15 .9 Monkey Love Pl poor 2005 .9 .7 Queries: A(x,y) :- Review(x,y), Movie(x,z), z > 1991 Answers: Top k Problem Setting Tables: Review Movie Problem: complexity of query evaluation

Query evaluation problem Fix answer tuple (a,b) Given database I, compute Pr(Q(a,b)) Top-k answering problem Fix k > 0 Given database I, find k answer tuples with highest probabilities Two Problems Fixed schema S, conjunctive query Q(x,y)

Related Work: DB • Cavallo&Pitarelli:1987 • Barbara,Garcia-Molina, Porter:1992 • Lakshmanan,Leone,Ross&Subrahmanian:1997 • Fuhr&Roellke:1997 • Dalvi&S:2004 • Widom:2005

Related Work: Logic • Query reliability [Gradel,Gurevitch,Hirsch’98] • Degrees of belief [Bacchus,Grove,Halpern,Koller’96] • Probabilistic Logic [Nielson] • Probabilistic model checking [Kwiatkowska’02] • Probabilistic Relational Model[Taskar,Abbeel,Koller’02]

Outline • Definitions • Query Evaluation • Top-k answering (joint with Chris Re) • Conclusions

Probabilistic Database • Schema S, Domain D, Set of instances Inst • DefinitionProbabilistic database is a probability distribution Pr : Inst → [0,1], ∑I Pr[I] = 1 If Pr[I] > 0 then I is called “possible world”

Probabilistic Database • Representation: • Independent tuples:I-database DB over some schema Si • Independent and disjoint tuples:ID-database DB over some schema Sid • Semantics: • DB “means” probability distribution Pr over schema S

Mov Mov Scor Scor Representation m42 good m99 good Pr[I1]= (1-p1)*(1-p2)*(1-p3) p1*p2*(1-p3) Pr[I4]= p1*p2*p3 Pr[I8]= Possible worlds semantics I-Databases Reviewsi(M,S,p) Reviews(M,S) Pr[I1] + Pr[I2] + . . . + Pr[I8] = 1

Time Time Time Time Time Act Act Act Act Act t+1 t t t walk walk walk run t+1 walk Pr[I1]= (1-p1-p2)*(1-p3) p2*(1-p3) Pr[I3]= Pr[I5]= p1*p3 ID-Databases Activitiesid Activities Pr[I1] + Pr[I2] + . . . + Pr[I6] = 1

Movied Movie Movie Scored Score Score P P P m42 m42 m42 good good good p1 p1 p1 m99 m99 m99 good good good p2 p2 p2 m76 m76 m76 poor poor poor p3 p3 p3 ID subsumes I Reviewsid Reviewsi = Note: means all tuples are disjoint Reviewsid

Syntax: conjunctive queries over schema S id mid year rating P p m42 1995 0.95 m42 4 0.7 m99 2002 0.65 m42 5 0.45 m76 2002 0.1 m99 5 0.82 m05 2005 0.7 m99 4 0.68 m05 5 0.79 Queries Q(y) :- Movie(x,y), Review(x,z), z >= 3 Reviewi Moviei

Two Query Semantics • Possible answer sets • Given set A: • Used for views • Possible tuples • Given tuple t: • Used for query evaluation and top-k Pr[{t | I ⊨ Q(t)} = A] This talk Pr[I ⊨ Q(t)]

id id id year id year year year year p m76 1995 m42 2004 m87 1934 m99 1935 1935 p2 + p3 = 0.6 m99 1935 m99 1901 m44 1904 m05 1903 2004 p1 + p3 = 0.5 m05 2004 m76 1902 1995 p3 = 0.2 p1 p2 p3 p4 . . . . . . top k Query Semantics Q(y) :- Movie(x,y), Review(x,z) Tupleprobabilities

Summary on Data Model • Data Model:Semantics = possible worldsSyntax = I-databases or ID-databases • Queries:Syntax = unchanged (conjunctive queries)Semantics = tuple probabilities

Outline • Definitions • Query evaluation • Top-k answering • Conclusions

Problem Definition • Fix schema S, query Q, answer tuple t • Problem: given I/ID-database DB, compute Pr[I ⊨ Q(t)] • Conventions: For upper bounds (P or #P): probabilities are rationalsFor lower bounds (#P): probabilities are 1/2 Pr[Q(t)] notation:

Query Evaluationon I-Databases • Outline • Intuition • Extensional plans: PTIME case • Hard queries: #P-complete case • Dichotomy Theorem

mid Year id rate year p p p m42 4 q1 m42 1995 p1 1995 m42 2 q2 m99 2002 p2 m42 3 q3 m76 2002 p3 2002 m99 1 q4 m05 2005 p4 m99 3 q5 m76 5 q6 Intuition Q(y) :- Movie(x,y), Review(x,z) Reviewi Moviei Answer p1× (1 - (1 - q1) ×(1 - q2)×(1 - q3)) p2× (1 - (1 - q4)×(1 - q5)) 1 - (1 - ) × (1 - ) p3× q6

I-Extensional Plans • Add Join ⋈ p = p1 * p2Projection ∏ p = 1-(1-p1)(1-p2)...(1-pn)Selection σ p = p • Note: data complexity is PTIME [Barbara92,Lakshmanan97] p

1995 1995 m1 m1 q1 q1 p p m1 m1 q2 q2 m1 m1 q3 q3 ∏ ∏ ⋈ ⋈ ∏ Review Movie Movie Review Q(y) :- Movie(x,y), Review(x,z) CORRECT INCORRECT!

A B A p B p p1 q1 p2 q2 p3 q3 p4 q4 #P-Complete Queries Ri S Ti Qbad :- Ri(x), S(x,y), Ti(y) Theorem: Data complexity is #P-complete

A A B p B p y1 x1 x2 y3 1/2 1/2 x2 y2 x1 y2 1/2 1/2 x4 y3 x3 y3 1/2 1/2 x4 x3 1/2 y1 Proof: Theorem [Provan&Ball83] Counting the number of satisfying assignments for bipartite DNF is #P-complete S Ti Ri Reduction: x2y3 V x1y2 V x4y3 V x3y1 Qbad :- Ri(x), S(x,y), Ti(y)

I-Dichotomy Q = boolean conjunctive query Definition 1. For each variable x: goals(x) = set of goals that contain x Definition 2. Q is hierarchical if forall x, y: (a) goals(x) ∩ goals(y) = ∅, or (b) goals(x) ⊆ goals(y), or (c) goals(y) ⊆ goals(x)

R S x y Q :- R(x),S(x,y),T(x,y,z),K(x,v) v K z T “hierarchical” Q:- R(x), S(x,y), T(y) x y S T R “non-hierarchical”

[Dalvi&S.’04] I-Dichotomy Schema Si = {R1i, R2i, . . ., Rmi} Theorem Let Q = conjunctive query w/o self-joins. Then one of the following holds: Q is in PTIME Q has a correct extensional plan Q is hierarchical or: Q is #P-complete Q has subgoals R(x,...),S(x,y,...),T(y,...)

Proof Lemma 1. If Q is non-hierarchical, then #P-complete Proof: y x z S T R v K Q :- Ri(v,x), Si(x,y), Ti(y,z), Ki(z) rest is like for Qbad

Proof Lemma 2. If Q is hierarchical, then PTIME Proof: Case 1: has no root Pr(Q) = Pr(Q1) Pr(Q2) Pr(Q3) This is extensional join⋈

Proof Case 2: has root x x Dom={a1, a2, . . ., an} Pr(Q) = 1 - (1-Pr(Q(a1/x))(1-Pr(Q(a2/x))...(1-Pr(Q(an/x))) This is an extensional projection: ∏ QED

Query Evaluationon ID-Databases • ID-extensional plans • #P-complete queries • Dichotomoy Theorem

Extensional Plans for ID-DBs • Only difference: two kinds of projections:independent 1-(1-p1)...(1-pn)disjoint p1 + ... + pn

#P-Complete Queries Q1 :- Ri(x), Si(x,y), Ti(y) Q2 :- Rd(xd,y), Sd(yd,z) Q3 :- Rd(xd,y), Sd(zd,y)

[Dalvi&S.’04] I-DB Dichotomy Schema Sid s.t. each table is either Ri or Rid Theorem Let Q = conjunctive query w/o self-joins. Then one of the following holds: Q is in PTIME Q has a correct extensional plan or: Q is #P-complete Q has one of Q1, Q2, Q3 as subqueries

Extensions • Extensions of the dichotomoy theorem exists for: • Mixed schemas (some relations are deterministic) • Functional dependencies

Summary on Query Evaluation • Extensional plans: popular, efficient, BUT • “Equivalent” plans lead to different results • Some queries admit “correct” plans • Some simple queries: #P-complete complexity • Dichotomy theorem • Future work: remove ‘no-self-join’ restriction

Outline • Definitions • Query evaluation • Top-k answering (joint with Chris Re) • Conclusions

Top-k Ranking Problem • Fix schema S, query Q, number k > 0 • Problem: given I- or ID-database DB,find k answers t1,...,tk with highest probabilities • Note: CheckingPr[Q(ti)] > Pr[Q(tj)] is PP-completeGoal: efficient polynomial time approximation Pr[Q(t1)] > Pr[Q(t2)] > .... > Pr[Q(tk)] > ...

A e1 p e2 e3 Pr 0 0 0 (1-p1)(1-p2)(1-p3) e1 p1 0 0 1 (1-p1)(1-p2)p3 e2 p2 0 1 0 (1-p1)p2(1-p3) e3 p3 0 1 1 (1-p1)p2p3 1 0 0 p1(1-p2)(1-p3) 1 0 1 p1(1-p2)p3 1 1 0 p1p2(1-p3) 1 1 1 p1p2p3 Probabilities of Boolean Expressions What is the probability of e1⋀e2 ⋁e1⋀e2⋁e1⋀e3? Theorem #P-hard [Valiant] (1-p1)p2p3 + p1(1-p2)p3 + p1p2(1-p3) + p1p2p3

Monte Carlo Simulation Algorithm: radomly pick each e1, e2, e3 = false or true compute e1∧e2 ∨ e1∧e3 ∨ e2∧e3: true or false ?repeat Approximate probability p with frequency p’ Better: PTAS [Karp&Luby’83] p’- ε p’+ ε p’ Pr( |p’-p| < ε ) > 1-δ p

N=1 N=2 N=3 Monte Carlo Simulation 0 1 p N=0

Year P 1995 ?? 2002 ?? 1933 ?? 1984 ?? 0 1 The Multisimulation Problem Schedule simulation steps to find top-k

p2 2 p1 1 p4 p3 4 3 p5 5 Multisimulation How to find the top k out of n ? 0 1 Example: looking for top k=2; Which one simulate next ?

Multisimulation Critical region: (k’th left, k+1’th right) 0 1 k=2

Multisimulation Algorithm Case 1: pick a “double crosser” and simulate it 0 1 this k=2

Multisimulation Algorithm Case 2: pick both a “left” AND a “right” crosser 0 1 this and this k=2

Multisimulation Algorithm Case 3: pick a “max crosser” and simulate it 0 1 this k=2

Multisimulation Algorithm End: when critical region is “empty” 0 1 k=2 To sort the top k, find the top k-1, etc

Multisimulation Algorithm Theorem (1) It runs in < 2 Optimal # steps (2) no other deterministic algorithm does better

Probabilities in Databases and Logics I

Probabilities in Databases and Logics I

Presentation Transcript

Probabilities

Random Variables and Probabilities

Probabilities and distributions

States and probabilities

Probabilities

LOGICS

Probabilities and Proportions

Description Logics

Restricted -terms and logics

Timed Logics .....

Databases I Domeincalculus

Probabilities and Probabilistic Models

Description Logics

Description Logics

Probabilities

Probabilities and Proportions

Probabilities

Description logics

Internet Databases I