Explaining Query Answers with Explanation-Ready Databases

Explaining Query Answers with Explanation-Ready Databases Sudeepa Roy Laurel Orr Dan Suciu

This talk • Given • a database D • an aggregate query Q • the output(s) Q[D] • a question on the output(s) q • “why x is high/low/higher than y etc.” • How do we provide meaningful explanations?

dir = Directorate (e.g. CS) div = Division (e.g. IIS for DB) The NSF Example Award (aid, amount, title, year, startdate, enddate, dir, div) Institution(aid, inst, address) Investigator(aid, name, email) D Find top-5 univs with the highest total $$$ in CS from 1990 SELECT TOP 5 B.inst, SUM(A.amount) AS totalAward FROM AwardA, InstitutionB WHERE (A.aid= B.aid) AND (dir= ’CS’) AND (year >= 1990) GROUP BY B.inst ORDER BY totalAward DESC Q But I remember a news article… Q[D] [http: //www.nsf.gov/awardsearch/download.jsp]

The NSF Example Give me an Explanation! q[D] why is q: Q[D, UIUC] – Q[D, CMU] high?

Donald Rubin David Hume (1711-76) Karl Pearson (1857-1936) Judea Pearl Aristotle (384-322 BC) Carl Gustav Hempel (1905-97) In search for a “causal” explanation by “intervention”… Make changes to the world (a.k.a. database) If it affects the outcome (a.k.a. query answer) you have a good explanation (to your question)

Explanation by Intervention • Candidate explanation Φ • The change it causes to the database is the interventionΔΦ ⊆ D • Goal: find top explanations by q[D]-q[D - ΔΦ] q[D] Q • D q[D-ΔΦ] ΔΦ UIUC CMU Intervention

Related Work on Explanations [Scorpion, Wu-Madden, VLDB 2013] [Roy-Suciu, SIGMOD 2014] • Both restricted to • Φ : simple conjunctive predicates • single block queries Q SELECT time, AVG(Temp) FROM readings GROUP BY time Q.Why is there a peak for #SIGMOD papers from industry around2000? Ans. Φ: [inst= ‘bell-labs’] 100 Q[D] = original avg(temp) at time 12 pm AVG(Temp) Q(D - ΔΦ) = NEW avg(temp) at time 12 pm 50 allows dependency between relations delete author ⇒ delete paper delete paper delete author 12 Q. Why is AVG(temp.) at 12pm so high? Ans. Φ: [sensor = 3] materializes the join before looking for explanations Ack: Thanks to Eugene Wu for the example!

NSF example again... Award (aid, amount, title, year, startdate, enddate, dir, div) Institution(aid, inst, address) Investigator(aid, name, email) • Previous work considers predicates only from Award and Institution • [div = ‘ACI’] (cyber infrastructure) • $893M for UIUC and $26M for CMU • What about the Investigator table? • Investigator not in query, so is ignored • [name = ‘RP’] from UIUC • Received > $580M from NSF SELECT TOP 5 B.inst, SUM(A.amount) AS totalAward FROM AwardA, InstitutionB WHERE (A.aid= B.aid) AND (dir= ’CS’) AND (year >= 1990) GROUP BY B.inst ORDER BY totalAward DESC

There are explanations beyondsimple conjunctive predicates Explanations of the form: • Awards with >= 3 PIs • PIs with >= 5 awards • PIs with total >= $10 M money • PIs with >= 10 co-PIs • Top-10 PIs in terms of money cannot be handled by previous approaches The concept of intervention supports more general explanations

Our Contributions • Formalize richer explanations beyond simple conjunctive predicates • By explanation-ready databases • Rank these explanations efficiently • Support queries beyond single-block queries

General Framework for Explanations Φ: “PIs with >= $10 million award” R1: Award(aid, amount, title, year, startdate, enddate, dir, div) R2: Institution (aid, inst, address) R3: Investigator (aid, name, emailID) Δ1,Φ Δ3,Φ Step 1: Find investigators with this property Δ3, Φ • Step 2: Find tuples from other tables affected • by Δ3, Φ • Δ1, Φ = Awards by investigators in Δ3, Φ • Δ2, Φ= ∅ • Can be computed by queries too • Φ complex than conjunctive predicates • requires other tables • but can be computed by a query Intervention of Φ = (Δ1, Φ, Δ2, Φ,….., Δk, Φ)

Explanation-Ready Databases (ERD): 1/2 Three explanation ids: Φ∈ {1, 2, 3} k = 2 • Fix a small set of templates (NSF: ~8) • Get their candidate explanations by running SQL queries (NSF: ~188k) • For all candidate explanations Φ, pre-compute by SQL queries: ΔΦ = (Δ1, Φ , Δ2, Φ ,….., Δk, Φ) • Store them in k tables (Δ1, Δ2,….., Δk) • indexed by a new attribute Φ, description stored separately Δ1 R1 intervention for Φ = 2 R2 Δ2

Explanation-Ready Databases (ERD): 2/2 Three explanation ids: Φ∈ {1, 2, 3} • The new database (D, Δ) is ready to find good explanations = ERD • Given a query Q and a question q: • Compute q[D - ΔΦ] for all Φ • Output the top ones by their interventions on the question q[D] - q[D - ΔΦ] Δ1 R1 intervention for Φ = 2 R2 Δ2

An ERD (D, Δ) depends only on D • ΔΦ depends on Φ and semantic of D • ERD does not depend on query Q or user question q • The ERD for the NSF dataset can be used to answer other questions: Query 1 Query 2 Question 2: Why avg $ of ACI high? Question 1a: Why UIUC $ > CMU $ Question 1b: Why UCSD $ > MIT $

Explanation query q* on ERD Goal: Given ERD (D, Δ), compute q[D - ΔΦ] for all Φ (to rank them) Naïve method: Iterate over all Φ Compute q[D - ΔΦ], possibly by using “incremental view maintenance” (IVM) But, large number of Φ-s : “For loop” slow Our approach: simultaneously evaluate all Φ-s by running one explanation queryq*[D, Δ]

Tables as “Annotated Relations” • Each tuple maps to a numeric annotation (initially 1) • As query progresses, annotations are propagated • through a query plan • Simultaneously for all Φ-s Δ D Initially: 1 1 1 1 1 1 1 1 1

Invariant maintained by q* • For all explanations Φ • For all partial or final query q Restriction to Φ q[D ⊖ ΔΦ] q[D] ⊖ q*Φ[D, Δ] = Difference considering both set and annotations i.e. first apply intervention of Φ then evaluate the query ≣ first evaluate the query and explanation query, then compute the intervention by restricting to Φ

In the paper… • Addition ⊕, difference ⊖, intersection ⊙ for annotated relations • Rules and algorithms for all operators maintaining the invariants • Examples and necessary restrictions on the inputs

ERDs can go beyond single-block queries • e.g. can handle trend : slope in linear regression on a derived measure WITH R AS (SELECT time as x, AVG (Revenue - Expenses) as y FROM Sales WHERE Year >= 2007 and Year <= 2010 GROUP BY Year) SELECT ((COUNT(R.x) * SUM(R.x*R.y)) – (SUM(R.x) * SUM(R.y))) / ((COUNT(R.X)*(SUM(R.x*R.x))- (SUM(R.x))*(SUM(R.x)))) FROM R

IVM = Incremental View Maintenance Experiments • Algorithms: • SingleQ-IVM = Our explanation query • Iter-IVM = Iteratively run similar IVM • Naïve-Iter = Iterative naïve approach • Naïve-SingleQ = Run a single query but with expensive cross-product • Parameters: • |D| = n • No. of explanations |Φ| • Average intervention size for explanations: avgΦ|ΔΦ| • Complexity of queries (group by / single output, nested)

Experiments • Datasets: • (performance) synthetic datasets built from “Natality” dataset (4M rows, 233 attributes) • (qualitative evaluation) “NSF” dataset • Award (400k), Institution (419k), and Investigator (621k) • |Φ| = 188k of 8 different types • |Δ| = 1.74M • Also “Flights” dataset (in the paper)

Experiments: # of tuples naive iterative IVM our algorithm original query Q[D] • The running time of our algorithms do not increase rapidly with #tuples • Until the final join to compute difference Q[D] – Q[D- Δ]

Experiments: # of explanations iterativeIVM (rapidly increases with #explanations) original query Q[D] our algorithm Q[D- Δ] our algorithm Q[D] – Q[D- Δ] • The running time of our algorithms do not increase rapidly with #explanations • Computing difference Q[D] – Q[D- Δ] is faster than Q[D- Δ] • can even be faster than the original query

Experiments: Group size |ΔΦ| • Running time is bad when group size is ~100k for all 1k explanations • |D| = 4M, but |Δ| = 50M! • In practice not all explanations are likely to have huge ΔΦ

[Ahmad et al. 2012] Experiments: IVM (DBToaster) vs. ERD IVM: Compute Q[D - Δ] Actual deletion fixed query dynamic updates • Iteratively run DBToaster, a state-of-the-art IVM system • ERD outperforms iterative IVM even when time to read data in IVM and compilation of the input query are ignored ERD: Compute Q[D – Δ1], Q[D – Δ2], …, Q[D – ΔP] Hypothetical deletion, dynamic queries fixed updates our algorithm

Explanations for NSF example Top explanations: • Investigators with more than 100M awards in total (there were none in CMU) • Investigators with ≥ 11 and 7-8 awards • Investigators from UIUC ‘RP’ and ‘TD’ (who share some awards) • Awards with 6 or more Investigators • Each contributes to more than $400-850 million difference in $$$ between UIUC and CMU • The above explanations are diverse • They give the insight that a few researchers in UIUC got a number of big awards (in cyber-infrastructure - ACI) • Could not be obtained if we were restricted to simple predicates from the Award and Institution tables and ignored Investigators altogether

Summary • We need to put effort in drawing interesting insights from data that can answer our questions • We propose ERDs that can handle • complex explanations • complex queries • However, there are restrictions, limitations (bag, negation), space overhead, manually defining templates • scope for future work Thank you

Explaining Query Answers with Explanation-Ready Databases

Explaining Query Answers with Explanation-Ready Databases

Presentation Transcript

Query Execution in Databases: An Introduction

Query Processing in Spatial Network Databases

Query Processing over Incomplete Autonomous Databases

Query Processing in Mobile P2P Databases

Query Evaluation Techniques for Larger Databases**

Page 51-52 Answers and Explanation

Relational Databases and Query Languages

Explaining DJ with AOP

Semantic Web Query Processing with Relational Databases

Efficient Query Evaluation on Probabilistic Databases

Query Processing over Incomplete Autonomous Databases

Approximate XML Query Answers

Statistical Databases – Query Auditing

Query Optimization in Object Databases

Query Processing in Spatial Network Databases

12 – Databases: Structured Query Language

Approximate XML Query Answers

Query Processing in Mobile Databases

Scrubbing Query Results from Probabilistic Databases

Distributed Databases and Query Processing

APPROXIMATE QUERY PROCESSING IN DATABASES

Efficient Query Evaluation on Probabilistic Databases