1 / 51

Efficient Query Evaluation on Probabilistic Databases

Efficient Query Evaluation on Probabilistic Databases. Nilesh Dalvi Dan Suciu. Presenter : Amit Goyal Discussion Lead : Jonatan. Outline. Motivation Query Evaluation: Intensional Extensional Query Optimization Complexity Unsafe Plans Extensions Conclusions.

Download Presentation

Efficient Query Evaluation on Probabilistic Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

  2. Outline • Motivation • Query Evaluation: • Intensional • Extensional • Query Optimization • Complexity • Unsafe Plans • Extensions • Conclusions

  3. Databases Are Deterministic • Databases we see today are deterministic • A tuple is either in the query answer or not • They don’t deal with uncertainties

  4. Future of Data Management • Uncertainties in Data • Biological Data • Sensor Data (Geographical Data) • Data extracted from various AI, data mining techniques (information extraction) • Uncertainties are represented as probabilities • Extend data management tools to handle probabilistic data

  5. Example Review Text I have not used IPOD but Apple products are good Facts Table

  6. Representing Uncertainty • Tuple-existence uncertainty • All attributes in a tuple are known precisely; existence of the tuple is uncertain • E.g. in previous slide. More later • Attribute-value uncertainty • Tuples (identified by keys) exist for certain; attributes (one or more) value are however uncertain • Tomorrow, it may rain (probability is 0.6)

  7. Our Goal For Today • Understand how queries can be evaluated efficiently on Probabilistic Databases • For simplicity, we will deal with tuple-level uncertainties only • We also assume independence among tuples. i.e. P(t1, t2) = P(t1) * P(t2)

  8. Cam Feat Cam Feat Cam Feat Cam Feat C21 Lens C29 Batt C21 Lens C21 Lens C31 Lens C29 Batt I1 I2 I3 C31 Lens I4 I5 (1-p1)(1-p2)(1-p3) p1(1-p2)(1-p3) (1-p1)p2(1-p3) p1(1-p2)p3 p1p2p3 Possible Worlds: Example 1 Total number of worlds: 2^count_tuples ; ∑Ii = 1

  9. Possible Worlds: Example 2 s1 s2 0.8 0.5 0.6 t1 T S Possible Worlds pwd(Dp)

  10. Query Evaluation • So, lets consider a query: • Q(D) :- S(A,B), T(C,D), B = C • S join T on B = C, project on D • Intuitively: • Execute the query on each possible world • The final result is a probabilistic relation that represents end result

  11. Query Evaluation: Example S join T on B = C, project on D qpwd(Dp) =

  12. Query Evaluation • Semantically correct • If T has ‘n’ tuples, there can be as many as 2^n possible worlds. • Exponential complexity, thus impractical Goal of the paper: Evaluate query efficiently

  13. Intensional Query Evaluation • Define the complex event ep(t) for each tuple t • For each intermediate tuple, associate an explicit (complex) event expression • Compute the actual probabilities at the end • For this talk, we will look only select, join project queries

  14. v E v v1 v2 E1 ˄ E2 v E1 v1 E1 v2 E2 v E2 v E Intensional Semantics E1 V E2 V…   X … …

  15. Theorem (2) The intesional semantics and the possible world semantics on probabilistic databases are equivalent for conjunctive queries. pwd(qi(Dp)) = qpwd(Dp)

  16. Intensional Semantics: Example T S Project on D S join T on B = C qrank(Dp) = Pr(q) = (0.8 * 0.6) + (0.5 * 0.6) – (0.8 * 0.5 * 0.6) = 0.48 + 0.3 – 0.24 = 0.54

  17. Intensional Semantics • Does not depend on the choice of plan • Impractical to use it: • The event expressions can become very large due to projections • For each tuple t, one has to compute Pr(e) for its event e, which is #P-complete problem • Thus very expensive

  18. Extensional Semantics • Play with probabilities instead of event expressions • Much more efficient • Assume tuple independence • Not always correct. WHY?

  19. v p v 1-(1-p1)(1-p2)… v1 v2 p1 p2 v p1 v1 p1 v2 p2 v p2 v p Extensional Semantics   x

  20. Extensional Query Evaluation:Example Plan : πD(S joinB=C T) T S S join T on B = C Project on D Wrong?? Because the two tuples in the join are no longer independent!!

  21. Extensional: Alternate Query Plan Plan : πD(πB(S) joinB=C T) T S Join with T on B=C Project S on B CORRECT!!

  22. Observation The answer depends on query plan

  23. Notations • R is a relation name. • D = instance of a database schema • Γ = set of functional dependencies • E = set of all complex events • q = query • PRels(q) = the probabilistic relation names in q • Attr(q) = all attributes in all relations in q • Head(q) = the set of attributes that are in output of the query q

  24. Safe Plan • A plan is safe if it produces the correct result • Formally, given a schema Rp, Γp, a plan P for a query q is safe if Pe(Dp) = qrank(Dp) for all instances Dp of that schema

  25. Theorem (3) • Consider a database schema where all the probabilistic relations are tuple-independent. Let q, q’ be the conjunctive queries that do not share any relation name. Then • σ is always safe • x is always safe in q x q’ • Π is safe iff A1,…Ak, R.E → Head (q)

  26. Example • Same example, Γp is : • S.A, S.B → S.E • T.C, T.D → T.E • S.E → S.A, S.B • T.E → T.C, T.D • Query :- S join T on B = C, project on D • Plan : πD(S joinB=C T) • Join is safe. We need to check the safeness of project. From theorem 3, we need to check • A1,…Ak, R.E → Head (q) • T.D, S.E → S.A, S.B, T.C, T.D (pass) • T.D, T.E → S.A, S.B, T.C, T.D (fails, why?) • Where • A1,…Ak is T.D • R.E is S.E and T.E • Head (q) is S.A, S.B, T.C, T.D

  27. Example: Alternative Plan • Query :- S join T on B = C, project on D • Plan : πD(πB(S) joinB=C T) • Project on B is safe. We need to check the safeness of project on D. From theorem 3, we need to check • A1,…Ak, R.E → Head (q) • T.D, S.E → S.B, T.C, T.D • T.D, T.E → S.B, T.C, T.D • Where • A1,…Ak is T.D • R.E is S.E and T.E • Head (q) is S.B, T.C, T.D Plan is safe!!

  28. Separation • Let q be a conjunctive query. Two relations R1, R2 are called connected if the query contains a join condition R1.A = R2.B and either R1.A or R2.B is not in Head(q). The relations R1, R2 are called separate if they are not connected. • Two sets of relations Y1 and Y2 are said to form a separation for query q iff • They partition the set Rels(q) • For any pair of R1 and R2 s.t. R1 belongs to Y1 and R2 belongs to Y2, they are separate • Intuitively, • The query does not contains a join condition • If the query has join condition, output of query does contains both R1.A and R2.B

  29. Separation: Example • Query :- S(A,B), T(C,D), B = C • qBC = (S joinB=C T) • Head(qBC) = {B,C,D} S join T on B = C Both B and C are present in head(qBC). Thus S and T are separate for this query

  30. Finding Safe Plan • Authors proposed SAFE-PLAN algorithm to find safe plans for a query • Try to postpone all safe projections in the query plan • When no more safe projections possible, it tries to perform a join, by splitting q into q1 join q2 • Since we perform join in the last, all attributes of join condition must be in Head(q), thus making sure that relations involved in join are separate. • If a safe plan exist, the algorithm finds it

  31. Finding Safe Plan: Example Processing :- SAFE-PLAN(πD(S joinB=C T)) Is πHead(q)(qA) is a safe operator? Conditions: T.D, S.E → S.A, T.D (safe) T.D, T.E → S.A, T.D (unsafe) Head(qA) = {A, D} qA = πD(S joinB=C T)) Z = {A} Head(q) = {D}

  32. Finding Safe Plan: Example Processing :- SAFE-PLAN(πD(S joinB=C T)) Is πHead(q)(qB) is a safe operator? Conditions: T.D, S.E → S.B, T.D (safe) T.D, T.E → S.B, T.D (safe) Return πD(SAFE-PLAN(qB)) Head(qB) = {B, D} qB = πD(S joinB=C T)) Z = {B} Head(q) = {D}

  33. Finding Safe Plan: Example Processing :- πD(SAFE-PLAN(qB)) • Is πHead(q)(qAB) is a safe operator? • Conditions: • T.D, S.E → S.A, S.B, T.D (safe) • T.D, T.E → S.A, S.B, T.D (unsafe) Head(qAB) = {A, B, D} qAB = πD(S joinB=C T)) Z = {A} Head(qB) = {B, D}

  34. Finding Safe Plan: Example Processing :- πD(SAFE-PLAN(qB)) • Is πHead(q)(qBC) is a safe operator? • Conditions: • T.D, S.E → T.C, S.B, T.D (safe) • T.D, T.E → T.C, S.B, T.D (safe) • Return • πBD(SAFE-PLAN(qBC)) Head(qBC) = {B, C, D} qBC = πD(S joinB=C T)) Z = {C} Head(qB) = {B, D}

  35. Finding Safe Plan: Example Processing :- πD(πBD(SAFE-PLAN(qBC)) • Is πHead(q)(qABC) is a safe operator? • Conditions: • T.D, S.E → S.A,T.C, S.B, T.D (safe) • T.D, T.E → S.A,T.C, S.B, T.D (unsafe) Head(qABC) = {A, B, C, D} qABC = πD(S joinB=C T)) Z = {A} Head(qBC) = {B, C, D}

  36. Finding Safe Plan: Example Processing :- πD(πBD(SAFE-PLAN(qBC)) • Split qBC into q1 joinB=C q2, s.t. • q1(B) :- S(A,B) • q2(C,D) :- T(C,D) • We know that S and T are separate on query qBC!! • Return • SAFE-PLAN(q1) joinB=C • SAFE-PLAN(q2)) No projection possible!! qBC = πD(S joinB=C T)) Head(qBC) = {B, C, D}

  37. Finding Safe Plan: Example πD(πBD(SAFE-PLAN(q1) joinB=C SAFE-PLAN(q2))) • Is πHead(q1)(qA) is a safe operator? • Conditions: • S.B, S.E → S.A, S.B (safe) • Return • πB(SAFE-PLAN(S(A,B))) • i.e. πB(S(A,B)) Head(qA) = {A, B} qA = S(A,B) Z = {A} Head(q1) = {B}

  38. Finding Safe Plan: Example • SAFE-PLAN(q2) = T(C,D) • Thus, final result : πD(πBD(πB(S) joinB=C T)) • πBD is redundant. Can be optimized. • SAFE-PLAN algorithm is sound and complete • How can we optimize our query plan? • Traditional equivalences do not work in extensional semantics. • Need to define extensional semantics equivalences

  39. Query Optimization • Select behaves exactly like traditional select operator • Extensional joins are commutative R join S  S join R • Extensional joins are associative R join (S join T)  (R join S) join T • Cascading Projections πA(πAUB(R))  πA(R) • Pushing Projection below a join πA(R join S) => (πA(R)) join (πA(S)) • Lifting Projections Up a Join: only when it satisfies the project condition in theorem 3 (πA(R)) join S => πAUAttrs(S)(R join S) • Theorem (10) : Let Z1 and Z2 be two safe plans for a query q. Then Z1  Z2

  40. Complexity Fundamentals • PTIME : solvable in polynomial time • NP complete : Is? Checks satisfiability. • #P complete : How many?

  41. Complexity Analysis • The data complexity of a query q is the complexity of evaluating qrank(Dp) as a function of size of Dp • If q has a safe plan, then its data complexity is in PTIME • All extensional operators are in PTIME • If q does not has a safe plan, then its data complexity is in #P-complete. i.e. if SAFE-PLAN algorithm fails to return a plan

  42. Unsafe Plans • What if there is no safe plan? The author proposes two solutions • Least Unsafe Plans • Monte-Carlo Approximations

  43. Least Unsafe Plans • Minimize the error in computing the probabilities • Modify SAFE-PLAN algorithm • When splitting a query q in two sub-queries q1 and q2, allow joins b/w q1 and q2 on attributes not in Head(q), then project out these attributes • These projections will be unsafe. Minimize their degree of unsafety • Pick q1, q2 to be a minimum cut of graph (rather than separation) • Problem of finding minimum cut is in PTIME

  44. Monte-Carlo Approximations • Let q’ be the query obtained from q by making it return all the variables in its body. • Evaluate q’ instead of q without any probability calculations • Group the tuples based on the values of attributes in Head(q) • Complex event expression of a group will be in CNF. i.e. Vni=1Ci where each Ci is in DNF. i.e. e1 ˄ e2 ˄ … • Back to same problem!! • Complexity of evaluating the probability of a boolean expression is in #P-complete

  45. Monte-Carlo Approximations • Given a DNF formula with N clauses and any ε and δ, the probability can be approximated in time O(N/ε2 ln (1/δ)) • Probability of the error being greater than ε is less than δ. • If N is small, an exact algorithm may be applied in place of simulation

  46. Extensions • Till now: • All the events in probabilistic relations are distinct • Dealt with select, project, join queries. • The authors have extended their solutions to non-distinct relations and additional operators

  47. Handling Repeated Events • Multiple tuple can share a common event • 4 easy steps to handle them: • Normalize the schema – represents the same data in normalized form, s.t. no probabilistic table has repeated events TP :- T1 and TP2 • Translate original query into new schema • Find a safe plan • Translate back to original schema

  48. Handling Repeated Events:Example • Consider two prob. Relations: R(A,B) and S(C,D) s.t. R has all distinct events while S has a distinct event for each value of D • Query q(x) :- R(x,y), S(y,z) • Step1: create a new schema. Decompose S into two relations: S1(C, D, EID) and S2(EID) • q’(x) :- R(x,y), S1(y,z, eid), S2(eid) • Using SAFE-PLAN, we get the following plan • P’ = πA(R joinB=C (πB,EID(S1) joinEID S2)) • Substitute back S1 and S2 accordingly

  49. Additional Operators • Union, Difference and Groupby operators • Covers almost all queries with nested sub-queries, aggregates, group-by and existensial/universal quantifiers

  50. Uncertain Predicates • q≈ predicate on a deterministic database • Syntactic closeness: String Matching. e.g. certain ~ uncertain • Edit distances, q-grams etc. • Semantic closeness: e.g. musical ~ opera • TF/IDF, ontologies from Wordnet • Numeric closeness: e.g. 25 ~ 26 • similar numeric values • Once distances are defined, they need to be meaningfully converted into probabilities • gaussian, student-T, normal-gamma • parameters can be learned (ideal case) or can be specified by user

More Related