270 likes | 289 Views
Exploring meaningful explanations behind query answers through ERD, intervening databases for insights beyond simple predicates, and ranking diverse explanations efficiently.
E N D
Explaining Query Answers with Explanation-Ready Databases Sudeepa Roy Laurel Orr Dan Suciu
This talk • Given • a database D • an aggregate query Q • the output(s) Q[D] • a question on the output(s) q • “why x is high/low/higher than y etc.” • How do we provide meaningful explanations?
dir = Directorate (e.g. CS) div = Division (e.g. IIS for DB) The NSF Example Award (aid, amount, title, year, startdate, enddate, dir, div) Institution(aid, inst, address) Investigator(aid, name, email) D Find top-5 univs with the highest total $$$ in CS from 1990 SELECT TOP 5 B.inst, SUM(A.amount) AS totalAward FROM AwardA, InstitutionB WHERE (A.aid= B.aid) AND (dir= ’CS’) AND (year >= 1990) GROUP BY B.inst ORDER BY totalAward DESC Q But I remember a news article… Q[D] [http: //www.nsf.gov/awardsearch/download.jsp]
The NSF Example Give me an Explanation! q[D] why is q: Q[D, UIUC] – Q[D, CMU] high?
Donald Rubin David Hume (1711-76) Karl Pearson (1857-1936) Judea Pearl Aristotle (384-322 BC) Carl Gustav Hempel (1905-97) In search for a “causal” explanation by “intervention”… Make changes to the world (a.k.a. database) If it affects the outcome (a.k.a. query answer) you have a good explanation (to your question)
Explanation by Intervention • Candidate explanation Φ • The change it causes to the database is the interventionΔΦ ⊆ D • Goal: find top explanations by q[D]-q[D - ΔΦ] q[D] Q • D q[D-ΔΦ] ΔΦ UIUC CMU Intervention
Related Work on Explanations [Scorpion, Wu-Madden, VLDB 2013] [Roy-Suciu, SIGMOD 2014] • Both restricted to • Φ : simple conjunctive predicates • single block queries Q SELECT time, AVG(Temp) FROM readings GROUP BY time Q.Why is there a peak for #SIGMOD papers from industry around2000? Ans. Φ: [inst= ‘bell-labs’] 100 Q[D] = original avg(temp) at time 12 pm AVG(Temp) Q(D - ΔΦ) = NEW avg(temp) at time 12 pm 50 allows dependency between relations delete author ⇒ delete paper delete paper delete author 12 Q. Why is AVG(temp.) at 12pm so high? Ans. Φ: [sensor = 3] materializes the join before looking for explanations Ack: Thanks to Eugene Wu for the example!
NSF example again... Award (aid, amount, title, year, startdate, enddate, dir, div) Institution(aid, inst, address) Investigator(aid, name, email) • Previous work considers predicates only from Award and Institution • [div = ‘ACI’] (cyber infrastructure) • $893M for UIUC and $26M for CMU • What about the Investigator table? • Investigator not in query, so is ignored • [name = ‘RP’] from UIUC • Received > $580M from NSF SELECT TOP 5 B.inst, SUM(A.amount) AS totalAward FROM AwardA, InstitutionB WHERE (A.aid= B.aid) AND (dir= ’CS’) AND (year >= 1990) GROUP BY B.inst ORDER BY totalAward DESC
There are explanations beyondsimple conjunctive predicates Explanations of the form: • Awards with >= 3 PIs • PIs with >= 5 awards • PIs with total >= $10 M money • PIs with >= 10 co-PIs • Top-10 PIs in terms of money cannot be handled by previous approaches The concept of intervention supports more general explanations
Our Contributions • Formalize richer explanations beyond simple conjunctive predicates • By explanation-ready databases • Rank these explanations efficiently • Support queries beyond single-block queries
General Framework for Explanations Φ: “PIs with >= $10 million award” R1: Award(aid, amount, title, year, startdate, enddate, dir, div) R2: Institution (aid, inst, address) R3: Investigator (aid, name, emailID) Δ1,Φ Δ3,Φ Step 1: Find investigators with this property Δ3, Φ • Step 2: Find tuples from other tables affected • by Δ3, Φ • Δ1, Φ = Awards by investigators in Δ3, Φ • Δ2, Φ= ∅ • Can be computed by queries too • Φ complex than conjunctive predicates • requires other tables • but can be computed by a query Intervention of Φ = (Δ1, Φ, Δ2, Φ,….., Δk, Φ)
Explanation-Ready Databases (ERD): 1/2 Three explanation ids: Φ∈ {1, 2, 3} k = 2 • Fix a small set of templates (NSF: ~8) • Get their candidate explanations by running SQL queries (NSF: ~188k) • For all candidate explanations Φ, pre-compute by SQL queries: ΔΦ = (Δ1, Φ , Δ2, Φ ,….., Δk, Φ) • Store them in k tables (Δ1, Δ2,….., Δk) • indexed by a new attribute Φ, description stored separately Δ1 R1 intervention for Φ = 2 R2 Δ2
Explanation-Ready Databases (ERD): 2/2 Three explanation ids: Φ∈ {1, 2, 3} • The new database (D, Δ) is ready to find good explanations = ERD • Given a query Q and a question q: • Compute q[D - ΔΦ] for all Φ • Output the top ones by their interventions on the question q[D] - q[D - ΔΦ] Δ1 R1 intervention for Φ = 2 R2 Δ2
An ERD (D, Δ) depends only on D • ΔΦ depends on Φ and semantic of D • ERD does not depend on query Q or user question q • The ERD for the NSF dataset can be used to answer other questions: Query 1 Query 2 Question 2: Why avg $ of ACI high? Question 1a: Why UIUC $ > CMU $ Question 1b: Why UCSD $ > MIT $
Explanation query q* on ERD Goal: Given ERD (D, Δ), compute q[D - ΔΦ] for all Φ (to rank them) Naïve method: Iterate over all Φ Compute q[D - ΔΦ], possibly by using “incremental view maintenance” (IVM) But, large number of Φ-s : “For loop” slow Our approach: simultaneously evaluate all Φ-s by running one explanation queryq*[D, Δ]
Tables as “Annotated Relations” • Each tuple maps to a numeric annotation (initially 1) • As query progresses, annotations are propagated • through a query plan • Simultaneously for all Φ-s Δ D Initially: 1 1 1 1 1 1 1 1 1
Invariant maintained by q* • For all explanations Φ • For all partial or final query q Restriction to Φ q[D ⊖ ΔΦ] q[D] ⊖ q*Φ[D, Δ] = Difference considering both set and annotations i.e. first apply intervention of Φ then evaluate the query ≣ first evaluate the query and explanation query, then compute the intervention by restricting to Φ
In the paper… • Addition ⊕, difference ⊖, intersection ⊙ for annotated relations • Rules and algorithms for all operators maintaining the invariants • Examples and necessary restrictions on the inputs
ERDs can go beyond single-block queries • e.g. can handle trend : slope in linear regression on a derived measure WITH R AS (SELECT time as x, AVG (Revenue - Expenses) as y FROM Sales WHERE Year >= 2007 and Year <= 2010 GROUP BY Year) SELECT ((COUNT(R.x) * SUM(R.x*R.y)) – (SUM(R.x) * SUM(R.y))) / ((COUNT(R.X)*(SUM(R.x*R.x))- (SUM(R.x))*(SUM(R.x)))) FROM R
IVM = Incremental View Maintenance Experiments • Algorithms: • SingleQ-IVM = Our explanation query • Iter-IVM = Iteratively run similar IVM • Naïve-Iter = Iterative naïve approach • Naïve-SingleQ = Run a single query but with expensive cross-product • Parameters: • |D| = n • No. of explanations |Φ| • Average intervention size for explanations: avgΦ|ΔΦ| • Complexity of queries (group by / single output, nested)
Experiments • Datasets: • (performance) synthetic datasets built from “Natality” dataset (4M rows, 233 attributes) • (qualitative evaluation) “NSF” dataset • Award (400k), Institution (419k), and Investigator (621k) • |Φ| = 188k of 8 different types • |Δ| = 1.74M • Also “Flights” dataset (in the paper)
Experiments: # of tuples naive iterative IVM our algorithm original query Q[D] • The running time of our algorithms do not increase rapidly with #tuples • Until the final join to compute difference Q[D] – Q[D- Δ]
Experiments: # of explanations iterativeIVM (rapidly increases with #explanations) original query Q[D] our algorithm Q[D- Δ] our algorithm Q[D] – Q[D- Δ] • The running time of our algorithms do not increase rapidly with #explanations • Computing difference Q[D] – Q[D- Δ] is faster than Q[D- Δ] • can even be faster than the original query
Experiments: Group size |ΔΦ| • Running time is bad when group size is ~100k for all 1k explanations • |D| = 4M, but |Δ| = 50M! • In practice not all explanations are likely to have huge ΔΦ
[Ahmad et al. 2012] Experiments: IVM (DBToaster) vs. ERD IVM: Compute Q[D - Δ] Actual deletion fixed query dynamic updates • Iteratively run DBToaster, a state-of-the-art IVM system • ERD outperforms iterative IVM even when time to read data in IVM and compilation of the input query are ignored ERD: Compute Q[D – Δ1], Q[D – Δ2], …, Q[D – ΔP] Hypothetical deletion, dynamic queries fixed updates our algorithm
Explanations for NSF example Top explanations: • Investigators with more than 100M awards in total (there were none in CMU) • Investigators with ≥ 11 and 7-8 awards • Investigators from UIUC ‘RP’ and ‘TD’ (who share some awards) • Awards with 6 or more Investigators • Each contributes to more than $400-850 million difference in $$$ between UIUC and CMU • The above explanations are diverse • They give the insight that a few researchers in UIUC got a number of big awards (in cyber-infrastructure - ACI) • Could not be obtained if we were restricted to simple predicates from the Award and Institution tables and ignored Investigators altogether
Summary • We need to put effort in drawing interesting insights from data that can answer our questions • We propose ERDs that can handle • complex explanations • complex queries • However, there are restrictions, limitations (bag, negation), space overhead, manually defining templates • scope for future work Thank you