Queries with Difference on Probabilistic Databases

Queries with Difference on Probabilistic Databases SanjeevKhanna Sudeepa Roy Val Tannen University of Pennsylvania

Probabilistic Databases • To model and query uncertain data (sensor networks, information extraction…) • Possible worlds model • Each possible world W is a standard database instance, has a probability P[W] • Compact representation Dassuming independence S T R D

Query Semantics • Query Semantics on probabilistic databases: • Apply the query q on each possible world W • Add up the probabilities of the worlds that give the same query answer A P[q(D) = A] = ∑W : q(W) = AP[W] • Goal: Efficiently evaluate P[q(D) = A] • Data complexity; want time polynomial in n = |D| • Can we always efficiently compute P[q(D)]? • NO, in general it is #P-hard

Query Answering in Two Steps easy hard Event variables D S R T Probability Boolean query q():-R(x),S(x, y),T(y) Introduce event variables for tuples (P[w1] = 0.3, …) Step 1: Boolean provenance for q(D) [FR ’97, Z ’97] f = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3 Step 2:ComputeP[q(D)] = P[f] given P[w1] = 0.3, P[v1] = 0.4, …

Probability Computation for Positive Queries • Dichotomy Result [DS ’04, ’07; DSS ’10] Givenq as input, we can efficiently decide if qis • Safe: Safe plans run in poly-time on all instances, or, • Unsafe: #P-hard, e.g. q() :- R(x) S(x, y) T(y) • Instance-by-instance approach [SDG ’10, RPT ’11] • Both q and D are given as input • Poly-time algorithm to compute P[q(D)] for special cases even if q is unsafe What about queries with difference?

Boolean Provenances for Difference S R T q1(x):- R(x, y), S(y, z) q2(x):- R(x, y), S(y, z), T(z) q = q1 – q2

Previous Work on Difference FOR ’11 • Framework for exact and approximate probability computation • But, no guarantee of polynomial running time In fact, we show in this paper that with difference, in some cases no approximation exists (unless NP = RP) How far can we go with difference in poly-time?

A Quick Comparison With difference • DNF of boolean provenance may be exponential in n • P[q(D)] may not be approximable Without difference • DNF of booleanprovenance is poly-size (n|q|) • P[q(D)] is always approximable (FPRAS) • FPRAS: Fully Polynomial Randomized Approx. Scheme • Compute with prob. ≥ ¾in time polynomial in n, 1/ε • p  [(1-ε) P[q(D)], (1+ε) P[q(D)]

Our Results • We study queries of the form q1 – q2 and their generalization • FPRAS: If q1 is any UCQ, q2 is any safe CQ- • #P-hardness: Even if both q1 and q2 are safe CQ- • Inapproximability: Even if q1 is the trivial TRUE query and q2 is a UCQ • Our FPRAS result extends to a larger class of queries of which q1 – q2 is a special case [CQ-: Conjunctive queries without self-joins]

Difference Rank • Define difference rank (q) of query q recursively • (R) = 0 • (q1 - q2) = (q1) + (q2) + 1 • R – S : rank 1 • (q1 ⋈ q2) = (q1) + (q2) • (R – S1) ⋈ (R - S2) : rank 2 • (R - T1) ⋈ T2: rank 1 • (q1 q2) = max ((q1), (q2)) • (R – S1) ⋈ (R - S2) (R - T1) ⋈ T2 : rank 2 • Select, project: rank remains the same

FPRAS for queries q with (q) = 1given some conditions hold(inapproximable for (q) = 1 in general)

Steps in FPRAS • Step 1: Compute boolean provenance of q[D] for any query q with (q) = 1 • Step 2: Write the boolean provenance in a “Probability Friendly Form” (if possible) • Step 3: FPRAS inspired by Karp-Luby framework

Boolean Provenance for Queries q s.t. (q) = 1 Lemma: For any q with (q) = 1, on any D, the provenance f of q(D) has form f is poly-size in n = |D|, poly-time computable

Probability Friendly Form (PFF) f is in PFF, if the negated DNF-s can be written in poly-size d-DNNFs (next slide) If f is in PFF, we can approximate P[f] using Karp-Luby Framework

d-DNNF Darwiche ’01, ’02, DM ’02 deterministic - DecomposableNegation Normal Form At most one child of a +-node is satisfiable Children of a .-node do not share variables No internal node can have negation In general, can be a DAG + Probability can be computed in linear time +

Karp-Luby Framework [KL ’83] Given boolean expression DAGs F1, …, Fm f = F1 + F2 + ... + Fm P[f] can be computed in poly-time (in m, n) if in poly-time,  i (1) P[Fi] can be computed (2) it can be checked if a given assignment satisfies Fi (3) a random satisfying assignment of Fican be sampled Well-studied special case: DNF counting, where F1, …, Fm are DNF minterms: f = xyz + xyw + wuv

Conditions (1) and (2) hold for PFF Product of minterm and d-DNNF is another d-DNNF + + + + w2=1, z1=1

Condition (3) also holds Lemma: Generating a random satisfying assignment on a d-DNNF can be done in poly-time At random • Process in reverse topological order • Generate a random satisfying assignment bottom up v1 = 1, v2 = 0 + v1 = 0, v2 = 0 v2 = 0 v1 = 1 + v1 = 0 v2 = 1 v2 = 0

Expressibility in PFF So, if f is in PFF, we can approximate P[q(D)] But, can we decide in poly-time if some sub-expressions of a boolean expression have poly-size d-DNNFs? •  Not known •  But, there are natural sufficient conditionsthat can be verified in poly-time • If certain sub-queries are safe and hence generate read-once expressions [OH ’08] • If sub-queries generate poly-size OBDDs [JS ’11] • Extends to instance-by-instance approach (both q, D given)

#P-hardness for q1 - q2both q1, q2 are safe CQ-

#P-hardness: Steps in the proof “Hard” query q = q1 – q2 • q1() := R1(x, y1) R2(x, y2) R3(x, y3) R4(x, y4) • q2() := R1(x1, y) R2(x2, y) R3(x3, y) R4(x4, y) Counting edge covers in bipartite graphs of degree ≤ 4, where the edge set can be partitioned into 4 disjoint matchings Counting independent sets in 3-regular bipartite graphs (XZ ’06)

Other Related Work • Semantics of probabilistic query answering • Fuhr-Rollecke ’97, Zimanyi ‘97 • Dichotomy of CQ- ,CQ and UCQ queries • Dalvi-Suciu ’04, ’07, Dalvi-Schnaitter-Suciu ’10 • Knowledge compilation techniques • Olteanu-Huang ’08, Jha-Olteanu-Suciu ’10, Jha-Suciu ’11, Fink-Olteanu ’11 • Instance-by-instance approach • Sen-Deshpande-Getoor ’10, Roy-Perduca-Tannen ’11

Conclusions and Future work A step towards understanding complexity of exact and approximate computation for queries with difference operations Future work • Dichotomy results that classify syntactically difference queries (similar to positive UCQ)? • Extending FPRAS to queries with difference rank > 1? • Experimental evaluation of our algorithms

Thank you Questions?

Queries with Difference on Probabilistic Databases

Queries with Difference on Probabilistic Databases

Presentation Transcript

Efficient Evaluation of HAVING Queries on a Probabilistic Database

Probabilistic Databases

A Course on Probabilistic Databases

On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach

Indexing Correlated Probabilistic Databases

Databases – Queries and Database Practice Queries

Probabilistic Cardinal Direction Queries On Spatio -Temporal Data

CS1100: Data, Databases, Queries

Managing Probabilistic Duplicates in Databases

CS1100: Data, Databases, Queries

Probabilistic Queries and Uncertain Data

Efficient Query Evaluation on Probabilistic Databases

Scalable Probabilistic Databases with Factor Graphs and MCMC

A Course on Probabilistic Databases

Probabilistic Similarity Queries in Uncertain Databases

Introduction to Databases Queries

Data, Databases, and Queries

More On Queries with SQL

Data, Databases, and Queries

Efficient Query Evaluation on Probabilistic Databases