210 likes | 230 Views
This study explores strategies to mitigate the effects of flaky tests on mutation testing, including comparing test suites by mutation score and guiding testing based on the mutant-test matrix. The results show that flakiness can significantly impact mutation testing, and mitigation strategies based on reruns and isolation are effective.
E N D
Mitigating the Effects of Flaky Testson Mutation Testing August Shi, Jonathan Bell, Darko Marinov ISSTA 2019 Beijing, China 7/18/2019 CCF-1421503 CNS-1646305 CNS-1740916 CCF-1763788 CCF-1763822 OAC-1839010
Mutation Testing Code Under Test Code Under Test test1 test2 test3 Mutant 1 Code Under Test Code Under Test UNRELIABLE Mut 1 test1 test2 test3 Killed Code Under Test Code Under Test test1 test2 test3 Mutant 2 • Compare test suites by mutation score • Guide testing based on mutant-test matrix Mut 2 Survived
Mutation Testing with Flaky Tests Code Under Test Code Under Test test1 test2 test3 Mutant 1 Code Under Test Code Under Test Code Under Test STILL FLAKY Mut 1 test1 test2 test3 Killed? Code Under Test Code Under Test test1 test2 test3 Mutant 2 Run 2 Run 1 • Get test suite with deterministic outcomes • Debug/fix flaky tests1 • Remove/ignore flaky tests Mut 2 Survived? 1August Shi et al. “iFixFlakies: A Framework for Automatically Fixing Order-Dependent Tests”. ESEC/FSE 2019
Flaky Coverage Example • Other reasons for flakiness: • Concurrency • Randomness • I/O • Order dependency 1 public classWatchDog { 2 ... 3publicvoid run() { 4 ... 5 synchronized (this) { 6 longtimeLeft = timeout – (System.currentTimeMillis() - startTime); 7 isWaiting = timeLeft > 0; 8 while (isWaiting) { 9 ... 10 wait(timeLeft); 11 ... 12 }} 13 ... 14 }} public void test() { newWatchDog.run(); ... }
Motivating Study • Measure flakiness of coverage • 30 open-source GitHub projects from prior work • No flaky test outcomes! (all 35,850 tests pass in 17 runs) • Rerun tests and measure differences in coverage • 113,356 (22%) statements with different tests covering across runs • 5,736 (16%) tests cover different statements across runs • Lots of flakiness in coverage, even without flaky outcomes!
Mutation Testing with Flaky Coverage 1 public classWatchDog { 2 ... 3publicvoid run() { 4 ... 5 synchronized (this) { 6 longtimeLeft = timeout – (System.currentTimeMillis() - startTime); 7 isWaiting = timeLeft > 0; 8 while (isWaiting) { 9 ... 10 wait(timeLeft); 11 ... 12 }} 13 ... 14 }} Mutation delete call public void test() { newWatchDog.run(); ... } Mutation not covered!
Mutation Testing Results are Unreliable • Flakiness can shift mutation testing results • Mutation scores may be inflated/deflated • Mutant-test matrix unreliable • Need to mitigate the effects of flakiness on mutation testing! • Mitigation strategies based on reruns and isolation2 • Implemented on PIT, a popular mutation testing tool for Java https://doi.org/10.6084/m9.figshare.8226332 https://github.com/hcoles/pitest/pull/534 https://github.com/hcoles/pitest/pull/545 2Jonathan Bell et al. “DeFlaker: Automatically Detecting Flaky Tests”. ICSE 2018
Mitigating Flakiness in Mutation Testing Traditional mutation testing Mutants to test Sorted tests per mutant Full test-suite coverage collection Test-mutant prioritization Mutant execution Improvements to cope with flakiness Rerun and isolate tests Run tests with least flaky coverage first Track mutations covered Rerun/isolate tests See paper
Coverage Collection • When running multiple times, union coverage • More lines covered means more mutants generated • Run tests in isolation to remove test-order dependencies
Executing Tests on Mutants • Monitor if tests actually execute mutated bytecode • Traditionally, mutant-test pair has status Killed or Survived • Only applicable if test executes the mutated bytecode • Mutant-test pair with test that does not execute mutated bytecode hasnew status Unknown • Test can potentially cover mutation, based on prior coverage
New Status for Mutants • Overall mutant status depends on status of all mutant-test pairs run for the mutant • Need to reduce number of Unknown mutants and pairs Killed Survived + Covered + Covered Unknown (not covered)
Rerunning Mutant-test Pairs • While status of mutant-test pair is Unknown, rerun • Change isolation level during reruns Mutant-test pairs for mutants in same class in same JVM Default Mutant-test Pairs More Isolation Increasing Cost Mutant-test pairs for same mutant in same JVM Most Isolation Mutant-test pairs in own JVM
Experimental Setup • Evaluate on same 30 projects in motivating study • All modifications on top of PIT mutation testing tool • RQ1: Flakiness in traditional mutation testing? • RQ2: Effect of coverage on mutants generated? • RQ3: Effect of re-executing tests on mutant status? • RQ4: Prioritize tests for mutant-test executions? See paper See paper
RQ1: Flakiness in Traditional Mutation Testing Max difference up to 23pp! Must improve mutation scores more than this variance! 9% of mutants-test pairs are unknown (max up to 55%)! Matrix results can be unreliable
RQ3: Mutant Re-execution Results • Increasing isolation greatly increases covered pairs • Unnecessary to rerun too often with the most isolation
Discussion • Flakiness can have negative effects beyond mutation testing • Tools/studies that rely on coverage must consider flakiness • Fault localization, program repair, test prioritization, test-suite reduction, test selection, test generation, runtime verification, … • Mitigation strategies applicable beyond mutation testing • Different isolation strategies for different tasks
Conclusions • Even seemingly non-flaky tests have flaky coverage • 22% of statements not covered consistently! • We present problems in mutation testing due to flakiness • We propose techniques to mitigate effects • Different combinations of reruns and isolation • We reduce Unknown mutants/pairs by 79.4%/88.1% • Flakiness can have negative effects beyond mutation testing awshi2@illinois.edu https://doi.org/10.6084/m9.figshare.8226332
Prioritizing Tests for Mutants • Run mutant-test pairs in the order that gets the overall mutant status faster, more reliably • Once mutant status known, no need to run more • Prioritize tests per mutant based on coverage • Tests with more “stable” coverage on mutant prioritized earlier • Later prioritize based on time • When to rerun? • Immediately rerun pair? • Run all pairs first before rerunning?
RQ2: Coverage and Mutant Generation • Not much difference in numbers of mutants and pairs • Can potentially use Default for mutant generation