1 / 55

Probabilities in Databases and Logics I

Probabilities in Databases and Logics I. Nilesh Dalvi and Dan Suciu University of Washington. Two Lectures. Today: probabilistic database to model imprecisions probabilistic logics Tomorrow: probabilistic database to model incompletness random graphs. Motivation. Record reconciliation

darva
Download Presentation

Probabilities in Databases and Logics I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilities inDatabases and Logics I • Nilesh Dalvi and Dan Suciu • University of Washington

  2. Two Lectures • Today: • probabilistic database to model imprecisions • probabilistic logics • Tomorrow: • probabilistic database to model incompletness • random graphs

  3. Motivation • Record reconciliation • Information extraction • Constraint violations • Schema matching

  4. name title title rating rating year p p p Twelve Monkeys Twelve Monkeys Monkey Love good fair 1995 .5 .53 .8 Monkey Love Monkey Love fair good 1997 .2 .42 .4 Monkey Love Pl Monkey Love fair fair 1935 .6 .15 .9 Monkey Love Pl poor 2005 .9 .7 Queries: A(x,y) :- Review(x,y), Movie(x,z), z > 1991 Answers: Top k Problem Setting Tables: Review Movie Problem: complexity of query evaluation

  5. Query evaluation problem Fix answer tuple (a,b) Given database I, compute Pr(Q(a,b)) Top-k answering problem Fix k > 0 Given database I, find k answer tuples with highest probabilities Two Problems Fixed schema S, conjunctive query Q(x,y)

  6. Related Work: DB • Cavallo&Pitarelli:1987 • Barbara,Garcia-Molina, Porter:1992 • Lakshmanan,Leone,Ross&Subrahmanian:1997 • Fuhr&Roellke:1997 • Dalvi&S:2004 • Widom:2005

  7. Related Work: Logic • Query reliability [Gradel,Gurevitch,Hirsch’98] • Degrees of belief [Bacchus,Grove,Halpern,Koller’96] • Probabilistic Logic [Nielson] • Probabilistic model checking [Kwiatkowska’02] • Probabilistic Relational Model[Taskar,Abbeel,Koller’02]

  8. Outline • Definitions • Query Evaluation • Top-k answering (joint with Chris Re) • Conclusions

  9. Probabilistic Database • Schema S, Domain D, Set of instances Inst • DefinitionProbabilistic database is a probability distribution Pr : Inst → [0,1], ∑I Pr[I] = 1 If Pr[I] > 0 then I is called “possible world”

  10. Probabilistic Database • Representation: • Independent tuples:I-database DB over some schema Si • Independent and disjoint tuples:ID-database DB over some schema Sid • Semantics: • DB “means” probability distribution Pr over schema S

  11. Mov Mov Scor Scor Representation m42 good m99 good Pr[I1]= (1-p1)*(1-p2)*(1-p3) p1*p2*(1-p3) Pr[I4]= p1*p2*p3 Pr[I8]= Possible worlds semantics I-Databases Reviewsi(M,S,p) Reviews(M,S) Pr[I1] + Pr[I2] + . . . + Pr[I8] = 1

  12. Time Time Time Time Time Act Act Act Act Act t+1 t t t walk walk walk run t+1 walk Pr[I1]= (1-p1-p2)*(1-p3) p2*(1-p3) Pr[I3]= Pr[I5]= p1*p3 ID-Databases Activitiesid Activities Pr[I1] + Pr[I2] + . . . + Pr[I6] = 1

  13. Movied Movie Movie Scored Score Score P P P m42 m42 m42 good good good p1 p1 p1 m99 m99 m99 good good good p2 p2 p2 m76 m76 m76 poor poor poor p3 p3 p3 ID subsumes I Reviewsid Reviewsi = Note: means all tuples are disjoint Reviewsid

  14. Syntax: conjunctive queries over schema S id mid year rating P p m42 1995 0.95 m42 4 0.7 m99 2002 0.65 m42 5 0.45 m76 2002 0.1 m99 5 0.82 m05 2005 0.7 m99 4 0.68 m05 5 0.79 Queries Q(y) :- Movie(x,y), Review(x,z), z >= 3 Reviewi Moviei

  15. Two Query Semantics • Possible answer sets • Given set A: • Used for views • Possible tuples • Given tuple t: • Used for query evaluation and top-k Pr[{t | I ⊨ Q(t)} = A] This talk Pr[I ⊨ Q(t)]

  16. id id id year id year year year year p m76 1995 m42 2004 m87 1934 m99 1935 1935 p2 + p3 = 0.6 m99 1935 m99 1901 m44 1904 m05 1903 2004 p1 + p3 = 0.5 m05 2004 m76 1902 1995 p3 = 0.2 p1 p2 p3 p4 . . . . . . top k Query Semantics Q(y) :- Movie(x,y), Review(x,z) Tupleprobabilities

  17. Summary on Data Model • Data Model:Semantics = possible worldsSyntax = I-databases or ID-databases • Queries:Syntax = unchanged (conjunctive queries)Semantics = tuple probabilities

  18. Outline • Definitions • Query evaluation • Top-k answering • Conclusions

  19. Problem Definition • Fix schema S, query Q, answer tuple t • Problem: given I/ID-database DB, compute Pr[I ⊨ Q(t)] • Conventions: For upper bounds (P or #P): probabilities are rationalsFor lower bounds (#P): probabilities are 1/2 Pr[Q(t)] notation:

  20. Query Evaluationon I-Databases • Outline • Intuition • Extensional plans: PTIME case • Hard queries: #P-complete case • Dichotomy Theorem

  21. mid Year id rate year p p p m42 4 q1 m42 1995 p1 1995 m42 2 q2 m99 2002 p2 m42 3 q3 m76 2002 p3 2002 m99 1 q4 m05 2005 p4 m99 3 q5 m76 5 q6 Intuition Q(y) :- Movie(x,y), Review(x,z) Reviewi Moviei Answer p1× (1 - (1 - q1) ×(1 - q2)×(1 - q3)) p2× (1 - (1 - q4)×(1 - q5)) 1 - (1 - ) × (1 - ) p3× q6

  22. I-Extensional Plans • Add Join ⋈ p = p1 * p2Projection ∏ p = 1-(1-p1)(1-p2)...(1-pn)Selection σ p = p • Note: data complexity is PTIME [Barbara92,Lakshmanan97] p

  23. 1995 1995 m1 m1 q1 q1 p p m1 m1 q2 q2 m1 m1 q3 q3 ∏ ∏ ⋈ ⋈ ∏ Review Movie Movie Review Q(y) :- Movie(x,y), Review(x,z) CORRECT INCORRECT!

  24. A B A p B p p1 q1 p2 q2 p3 q3 p4 q4 #P-Complete Queries Ri S Ti Qbad :- Ri(x), S(x,y), Ti(y) Theorem: Data complexity is #P-complete

  25. A A B p B p y1 x1 x2 y3 1/2 1/2 x2 y2 x1 y2 1/2 1/2 x4 y3 x3 y3 1/2 1/2 x4 x3 1/2 y1 Proof: Theorem [Provan&Ball83] Counting the number of satisfying assignments for bipartite DNF is #P-complete S Ti Ri Reduction: x2y3 V x1y2 V x4y3 V x3y1 Qbad :- Ri(x), S(x,y), Ti(y)

  26. I-Dichotomy Q = boolean conjunctive query Definition 1. For each variable x: goals(x) = set of goals that contain x Definition 2. Q is hierarchical if forall x, y: (a) goals(x) ∩ goals(y) = ∅, or (b) goals(x) ⊆ goals(y), or (c) goals(y) ⊆ goals(x)

  27. R S x y Q :- R(x),S(x,y),T(x,y,z),K(x,v) v K z T “hierarchical” Q:- R(x), S(x,y), T(y) x y S T R “non-hierarchical”

  28. [Dalvi&S.’04] I-Dichotomy Schema Si = {R1i, R2i, . . ., Rmi} Theorem Let Q = conjunctive query w/o self-joins. Then one of the following holds: Q is in PTIME Q has a correct extensional plan Q is hierarchical or: Q is #P-complete Q has subgoals R(x,...),S(x,y,...),T(y,...)

  29. Proof Lemma 1. If Q is non-hierarchical, then #P-complete Proof: y x z S T R v K Q :- Ri(v,x), Si(x,y), Ti(y,z), Ki(z) rest is like for Qbad

  30. Proof Lemma 2. If Q is hierarchical, then PTIME Proof: Case 1: has no root Pr(Q) = Pr(Q1) Pr(Q2) Pr(Q3) This is extensional join⋈

  31. Proof Case 2: has root x x Dom={a1, a2, . . ., an} Pr(Q) = 1 - (1-Pr(Q(a1/x))(1-Pr(Q(a2/x))...(1-Pr(Q(an/x))) This is an extensional projection: ∏ QED

  32. Query Evaluationon ID-Databases • ID-extensional plans • #P-complete queries • Dichotomoy Theorem

  33. Extensional Plans for ID-DBs • Only difference: two kinds of projections:independent 1-(1-p1)...(1-pn)disjoint p1 + ... + pn

  34. #P-Complete Queries Q1 :- Ri(x), Si(x,y), Ti(y) Q2 :- Rd(xd,y), Sd(yd,z) Q3 :- Rd(xd,y), Sd(zd,y)

  35. [Dalvi&S.’04] I-DB Dichotomy Schema Sid s.t. each table is either Ri or Rid Theorem Let Q = conjunctive query w/o self-joins. Then one of the following holds: Q is in PTIME Q has a correct extensional plan or: Q is #P-complete Q has one of Q1, Q2, Q3 as subqueries

  36. Extensions • Extensions of the dichotomoy theorem exists for: • Mixed schemas (some relations are deterministic) • Functional dependencies

  37. Summary on Query Evaluation • Extensional plans: popular, efficient, BUT • “Equivalent” plans lead to different results • Some queries admit “correct” plans • Some simple queries: #P-complete complexity • Dichotomy theorem • Future work: remove ‘no-self-join’ restriction

  38. Outline • Definitions • Query evaluation • Top-k answering (joint with Chris Re) • Conclusions

  39. Top-k Ranking Problem • Fix schema S, query Q, number k > 0 • Problem: given I- or ID-database DB,find k answers t1,...,tk with highest probabilities • Note: CheckingPr[Q(ti)] > Pr[Q(tj)] is PP-completeGoal: efficient polynomial time approximation Pr[Q(t1)] > Pr[Q(t2)] > .... > Pr[Q(tk)] > ...

  40. A e1 p e2 e3 Pr 0 0 0 (1-p1)(1-p2)(1-p3) e1 p1 0 0 1 (1-p1)(1-p2)p3 e2 p2 0 1 0 (1-p1)p2(1-p3) e3 p3 0 1 1 (1-p1)p2p3 1 0 0 p1(1-p2)(1-p3) 1 0 1 p1(1-p2)p3 1 1 0 p1p2(1-p3) 1 1 1 p1p2p3 Probabilities of Boolean Expressions What is the probability of e1⋀e2 ⋁e1⋀e2⋁e1⋀e3? Theorem #P-hard [Valiant] (1-p1)p2p3 + p1(1-p2)p3 + p1p2(1-p3) + p1p2p3

  41. Monte Carlo Simulation Algorithm: radomly pick each e1, e2, e3 = false or true compute e1∧e2 ∨ e1∧e3 ∨ e2∧e3: true or false ?repeat Approximate probability p with frequency p’ Better: PTAS [Karp&Luby’83] p’- ε p’+ ε p’ Pr( |p’-p| < ε ) > 1-δ p

  42. N=1 N=2 N=3 Monte Carlo Simulation 0 1 p N=0

  43. Year P 1995 ?? 2002 ?? 1933 ?? 1984 ?? 0 1 The Multisimulation Problem Schedule simulation steps to find top-k

  44. p2 2 p1 1 p4 p3 4 3 p5 5 Multisimulation How to find the top k out of n ? 0 1 Example: looking for top k=2; Which one simulate next ?

  45. Multisimulation Critical region: (k’th left, k+1’th right) 0 1 k=2

  46. Multisimulation Algorithm Case 1: pick a “double crosser” and simulate it 0 1 this k=2

  47. Multisimulation Algorithm Case 2: pick both a “left” AND a “right” crosser 0 1 this and this k=2

  48. Multisimulation Algorithm Case 3: pick a “max crosser” and simulate it 0 1 this k=2

  49. Multisimulation Algorithm End: when critical region is “empty” 0 1 k=2 To sort the top k, find the top k-1, etc

  50. Multisimulation Algorithm Theorem (1) It runs in < 2 Optimal # steps (2) no other deterministic algorithm does better

More Related