Generating Test Data for Killing SQL Mutants: A Constraint Based Approach

Generating Test Data for Killing SQL Mutants: A Constraint Based Approach Shetal Shah, S. Sudarshan, Suhas Kajbaje, Sandeep PatidarBhanu Pratap Gupta, Devang Vira CSE Department, IIT Bombay ICDE 2011

Testing SQL Queries: A Challenge • Complex SQL queries hard to get right • Question: How to check if an SQL query is correct? • Formal verification is not applicable since we do not have a separate specification and an implementation • State of the art solution: manually generate test databases and check if the query gives the intended result • Often misses errors ICDE 2011

deptnameGcount(instrID) deptnameGcount(instrID) σsal>100000 σsal>100000 instructor instructor dept dept Example • Find number of high paid instructors in each department Problem: does not output departments with no high paid instructors Test dataset should be able to distinguish these two alternatives ICDE 2011

Generating Test Data: Prior Work • Automated Test Data generation • Initial work based on database constraints alone • Later work: DB constraints + SQL query • Agenda, Reverse Query Processing (RQP), QAGen • Our earlier work in ICDE 2010 (poster) on killing SQL mutants • More recent (independent) de la Riva et al [AST10] consider test data generation for killing mutants • But limited space of mutations, no guarantees of completeness • None of the above guarantee anything about detecting errors in SQL queries • Question: How do you model SQL errors? • Answer: Query Mutation ICDE 2011

Query Mutations • Mutations: model common programming errors • Join used instead of outerjoin (or vice versa) • Join/selection condition errors • < vs. <=, missing or extra condition • Wrong aggregate (min vs. max) • Mutant: syntactic variation of the given query Q • Mutant may be the intended query • A dataset D kills a mutant M of query Q if output of Q and M differ on D • For a mutation space, a dataset suite is complete if all non-equivalent mutants are killed ICDE 2011

Mutation Testing of Queries • Mutation testing can be used to check coverage of a test suite • Given a test suite, generate mutants in a pre-defined mutation space; evaluate on a given test suite • Our goal is different: • Generate datasets for the given query, based on possible mutations • Each dataset and query result on the dataset shown to a human, who verifies that the query result is what is expected on that dataset • We do not need to actually generate and execute mutants ICDE 2011

Query and Mutation Space Query: Single Block SQL Query with unconstrained aggregation at the top Mutation Space: • Join Type Mutation : change of any of inner join, left outer join, right outer join, full outer join to another • Across all possible join orders (unlike earlier work) • Comparison Operator Mutations: An occurrence of one of (=, <, >, <=, >=, <>) in the WHERE clause replaced by another • Unconstrained Aggregation Mutations: • MIN ↔ MAX, SUM ↔ SUM(DISTINCT)AVG ↔ AVG(DISTINCT), COUNT ↔ COUNT(DISTINCT) ICDE 2011

Killing Join Mutants: Basic Idea • Schema: r(A), s(B) • To kill this mutant: ensure that for some r tuple there is no matching s tuple • Generated test case • r(A) ={(1)}; s(B) ={} • Basic idea, version 1 [ICDE 2010] • run query on given database, • from result extract matching tuples for r and s • delete s tuple to ensure no matching tuple for r • Limitation: foreign keys, repeated relations ICDE 2011

Need to Ensure Difference in Final Query Result • Schema: r(A,B), s(C,D), t(E); Extra join above mutated node • To kill this mutant we must ensure that for an r tuple there is no matching s tuple, but there is a matching t tuple • Generated test case: r(A,B)={(1,2)}; s(C,D)={}; t(E)={(2)} • Our approach: generate not-matching constraints of the form∃ r1 in r, t1 in t s.t. (r1.B=t1.E and ∃si in s s.t. (si.C=r1.A))and solve using a solver • Informally, we “nullify” s to kill the above mutation ICDE 2011

Working around Foreign Key Constraints • teaches instructoris equivalent to teaches instructor if there is a foreign key from teaches.ID to instructor.ID • BUT: teachesσdept=CS(instructor)is not equivalent to teaches σdept=CS(instructor) • Key idea: have a teaches tuple with an instructor not from CS • Selections and joins can be used to kill mutations • We show that it comes for free when we “nullify” relations to kill other join mutations or generate data to kill selection mutations ICDE 2011

Join Orders and Equivalent Mutations • Schema: r(A,B), s(C,D), t(E), Mutation: rs • Any result with a r.B being null will be removed by join with t • Similarly equivalence can result due to selections • But mutation of s to s would be non equivalent with join order (r t) s • SQL Query may not specify join orders; many join trees possible; the datasets should kill mutants across all join orders • Note: de la Riva [AST10] do not consider alternative join orders ICDE 2011

Equivalent Trees and Equivalence Classes of Attributes • Whether query conditions written as • A.x = B.x AND B.x = C.x or as • A.x = B.x AND A.x = C.x • should not affect set of mutants generated • Solution: Equivalence classes of attributes ICDE 2011

Data Generation Algorithm • Assumptions • Only PK/FK constraints, single block SQL queries, only simple arithmetic expressions, predicates are purely conjunctive (and a few more in paper) • Algorithm Overview • For each dataset generate constraints. • DB constraints : primary key, foreign key constraints • Query constraints: join conditions, selection conditions, other than conditions inconsistent with non-matching constraints (see below) • Constraints to kill mutations: not-matching constraints derived from join/selection conditions • One constraint per dataset • Generate Test Data • Using a constraint solver (currently CVC3) on above constraints ICDE 2011

Main Procedure: generateDataSet(query q) • preprocess query tree • build equivalence classes • foreign key closure • generateDataSetForOriginalQuery() A constraint set with only DB and Query constraints such that the result set for q is non-empty • killEquivalenceClasses() • killOtherPredicates() • killComparisonOperators() • killAggregates() ICDE 2011

Killing Join Mutants: Equijoin(killEquivalenceClasses) • Query : r s t where r.A = s.B and s.B =t.C • r.A, s.B, t.C are in one equivalence class • Generate one dataset for each element of each equivalence class • Three datasets generated, with not-matching mutation constraints: • D1: ∃r1εr ∃s1εs (r1.A = s1.B ∧ ∃tiεt (ti.C=r1.A)) • D2: ∃s1εs ∃t1εt (s1.B = t1.C ∧ ∃riεr (ri.A=s1.B)) • D3: ∃r1εr ∃t1εt (r1.A = t1.C ∧ ∃siεs (si.B=t1.C)) • But if t.C is an FK referencing s.B, instead generate • D3’: ∃r1εr (∃tiεt, siεs (si.B=r1.A ∧ ti.C=r1.A)) ICDE 2011

Comparison Operation Mutations • Example of comparison operation mutations: • A < 5 vs. A <= 5 vs. A > 5 vs A >= 5 vs. A=5, vs A <> 5 • Idea: generate three datasets (leaving rest of query unchanged), one each for: • A < 5 • A = 5 • A > 5 • These datasets will kill all above mutations ICDE 2011

Aggregation Operation Mutations • Aggregation operations • count(A) ↔ count(distinct A); sum(A) ↔ sum(distinct A) • avg(A) ↔ avg(distinct A); min(A) ↔ max(A) • Eg: selectmin (salary) frominstructorgroup bydept_name; • Dataset Generated: [CS, Srinivasan, 80000] [CS, Wu, 80000] [CS, Alex, 65000] will kill each of the above pairs of mutations • Issues: • Database/query constraints forcing A to be unique • Database/query constraints forcing G to be unique • Carefully crafted set of constraints, which are relaxed to handle such cases • Details in paper One group with two values equal (and <> 0), Third different ICDE 2011

On Completeness and Complexity • With respect to query size: • Number of datasets generated: linear • Number of mutants: exponential • Theorem: The problem of generating data to kill a mutant is NP-hard • reduction from query containment; polynomial in special cases. • Theorem: For the class of queries, and the space of join-type and selection mutations considered, the suite of datasets generated by our algorithm is complete, i.e., the datasets kill all non-equivalent mutations of a given query • Also complete for a restricted classes of aggregation mutations with aggregation as top operation of tree, under some restrictions on joins in input ICDE 2011

Performance Results • University database schema from Database System Concepts 6th Ed consisting of 7 relations • Further Optimizations : Unfolding of bounded first order quantified constraints • E.g. “for all i in 1 to 4, P(i)” unfolded as P(1) ∧ P(2) ∧ P(3) ∧ P(4) • Also tested with addition of input database constraints • Forcing output tuples to come from input database ICDE 2011

Results for inner join queries • # Datasets generated small; # mutants (killed) grows exponentially • Unfolding of constraints gives substantial benefits • Increasing # foreign keys reduces the total time taken • Results similar for queries with left/right outer join ICDE 2011

Results for queries with selections, aggregations • Unfolding reduces times taken, epsecially when the number of tuples are large (aggregations with joins) • (Not shown above) using input database increases time taken but not by much ICDE 2011

Conclusions and Future Work • Presented a principled approach to data generation for testing queries • And showed it works efficiently for a large class of queries • We’ve only made first steps; more work to be done: • Constrained aggregations and sub-queries • Other mutations: e.g., missing/extra join and selection conditions • Multiple queries • Query and form parameters • Acknowledgements • Bhupesh Chawda for discussions/implementation help • Microsoft Research India for their generous travel support ICDE 2011

Thank You Questions ICDE 2011

Related Work • Tuya and Suarez-Cabal [IST07], Chan et al. [QSIC05] defined a class of SQL query mutations • Shortcoming: do not address test data generation • More recently (and independent of our work) de la Riva et al [AST10] address data generation using constraints, with the Alloy solver • Do not consider alternative join orders, No completeness results, Limitations on constraints ICDE 2011

Dataset for Original Query • Array of tuples of constraint variables, per relation • One or more constraint tuple from array, for each occurrence of a relation • plus occurrences to ensure f.k. constraints • Equality conditions between variables based on equijoins • Other selection and join conditions become constraints • Constraints for primary and foreign keys • f.k. from R.A to S.B • FORALL i EXISTS j (O_R[i].A = O_S[j].B); • p.k. on R.A • FORALL i FORALL j (O_R[i].A = O_R[j].A]) => “all other attrs equal” • since range is over finite domain, p.k. and f.k. constraints can be unfolded • See sample set of constraints in file cvc3_0.cvc ICDE 2011

Killing Join Mutants: Equijoin • killEquivalenceClasses() • for each equivalence class ec do • Let allRelations := Set of all <rel, attr> pairs in ec • for each element e in allRelations do • conds := empty set • Let e := R.a • S := (set of elements in ec which are foreign keys referencing R.a directly or indirectly) UNION R:a • P := ec - S • if P:isEmpty() then • continue • else … main code for generating constraints (see next slide) ICDE 2011

Killing Join Mutants: EquiJoins • conds.add(generateEqConds(P)) • conds:add( • “NOT EXISTS i: R[i].a = ” + cvcMap(P[0])) • for all other equivalence classes oe do • conds.add(generateEqConds(oe)) • for each other predicate p do • conds:add(cvcMap(p)) • conds.add(genDBConstraints()) /*P.K. and F.K*/ • callSolver(conds) • if solution exists then • create a dataset from solver output ICDE 2011

Killing Other Predicates • Create separate dataset for each attribute in predicate • e.g. For Join condition B.x = C.x + 10 • Dataset 1 (nullifying B:x): • ASSERT NOT EXISTS (i : B_INT) : (B[i].x = C[1].x + 10); • Dataset 2 (nullifying C:x): • ASSERT NOT EXISTS (i : C_INT) : (B[1].x = C[i].x + 10); ICDE 2011

Generating Test Data for Killing SQL Mutants: A Constraint Based Approach