Basic Definitions: Testing

Basic Definitions: Testing • What is software testing? • Running a program • In order to find faults • a.k.a. defects • a.k.a. errors • a.k.a. flaws • a.k.a. faults • a.k.a. BUGS • Hrm. . . that’s a lot of “a.k.a”s • Let’s refine this terminology a bit

Faults, Errors, and Failures • Fault: a static flaw in a program • What we usually think of as “a bug” • Error: a bad program state that results from a fault • Not every fault always produces an error • Failure: an observable incorrect behavior of a program as a result of an error • Not every error ever becomes visible

To Expose a Fault with a Test • Reachability: the test much actually reach and execute the location of the fault • Infection: the fault must actually corrupt the program state (produce an error) • Propagation: the error must persist and cause an incorrect output – a failure

An Example int findLast (int a[], int n, int x) { // Returns index of last element in a // equal to x, or -1 if no such. // n is length of a int i; for (i = n-1; i > 0; i--) { if (a[i] == x) return i; } return -1; } Find the fault

An Example int findLast (int a[], int n, int x) { // Returns index of last element in a // equal to x, or -1 if no such. // n is length of a int i; for (i = n-1; i > 0; i--) { if (a[i] == x) return i; } return -1; } Here’s a test case: a = {}n = 0x = 2 Does not even reachthe fault

An Example int findLast (int a[], int n, int x) { // Returns index of last element in a // equal to x, or -1 if no such. // n is length of a int i; for (i = n-1; i > 0; i--) { if (a[i] = x) return i; } return -1; } Here’s another: a = {3, 9, 4}n = 3x = 2 Reaches the faultInfects state with errorBut no failure

An Example int findLast (int a[], int n, int x) { // Returns index of last element in a // equal to x, or -1 if no such. // n is length of a int i; for (i = n-1; i > 0; i--) { if (a[i] = x) return i; } return -1; } And finally: a = {2, 9, 4}n = 3x = 2 Reaches the faultInfects state with errorAnd fails – returns -1 instead of 0

Controllability and Observability • Goals for a test case: • Reach a fault • Produce an error • Make the error visible as a failure • In order to make this easy the program must be controllable and observable • Controllability: • How easy it is to drive the program where we want to go • Observability: • How easy it is to tell what the program is doing

Design for Testability • If a program is not designed to be controllable and observable, it generally won’t be • We have to start preparing for testing before we write any code • Testing as an after-the-fact, ad hoc, exercise is often limited by earlier design choices

Test-Driven Development • One way to design for testability is to write the test cases before the code • Idea arising from Extreme Programming and agile development • Write automated test cases first • Then write the code to satisfy tests • Helps focus attention on making software well-specified • Forces observability and controllability: you have to be able to handle the test cases you’ve already written (before deciding they were impractical) • Reduces temptation to tailor tests to idiosyncratic behaviors of implementation

Controllability: Simulation and Stubbing • A key to controllable code is effective simulation and stubbing • Simulation of low-level hardware devices through a clean driver interface • Real hardware may be slow • May be impossible/expensive to induce some hardware failure modes on real hardware • Real hardware may be a limited resource • Stubbing for other routines and code • Other code/modules may not be complete • May be slow and irrelevant to test • May need to simulate failure of other modules

Simulation and Stubbing: JPL Example • When testing JPL flash storage modules we rely on software simulation of flash devices • Real flash devices are slow • Can’t do aggressive random testing • Real flash devices are expensive • JPL only has a few boards – constant competition to test on these • Running hundreds of thousand of tests will wear the flash hardware out • Enables us to introduce rare hardware failures • System resets, spontaneous bad blocks and write failures, etc.

Controllability: Downwards Scalability • Another important aspect of controllability is to make code “downwards scalable” • Many faults cause an error only in a corner case due to a resource limit • An effective strategy for finding errors is to reduce the resource limits • Test a version of the program with very tight bounds • Finding corner cases is easier if the corners are close together • Too many programs hard-code resource limits or make assumptions about resources unconnected to defined limits • E.g., not checking the result of malloc

Downwards Scalability: JPL Example • Flight flash hardware is usually 1-4 GB device • E.g., 64 blocks of 32 pages of 8192 bytes • We primarily test with much smaller “devices” (using software simulation) • 6 blocks of 4 pages of 64 bytes • Forces flash file system to compact storage more often • Tests assumptions about how space is used on flash • Forces more multi-page writes and directory entries over multiple pages

Downwards Scalability: JPL Example • Easier to explore various combinations of states of blocks/pages of the device Used page Free page Dirty page Bad block

Controllability • Other important themes for controllability • Network/file access • If program reads from the network or to remote files, this is hard to control • Again, simulation and stubbing are key • System calls • Similarly, reading the time from the operating system can be hard to control • Simulation and stubbing – Operating System Abstraction Layer etc. • GUI control • Allow scripted control of GUI elements so tests can be automated

Observability: Assertions • Assertions improve observability by making (some) errors into failures • Even if the effect of a fault doesn’t propagate, it may be visible if an assertion checks the state at the right time • Assertions also improve observability by making the error, rather than failure, visible • Know how the state was corrupted directly, not just eventual effect

Observability: Invariant Checkers • Can extend the idea of assertions to writing “full” invariant checkers • Do a crawl of code’s basic data structures • Check various invariants that would be too expensive to check at runtime • Invariant checker can be written to be easy-to-use: recursion, memory allocation, etc. • Won’t run on actual system • But be careful! If your invariant checker has a bug and changes the system state. . .

Observability • Other important themes for observability • Logging • Especially critical for GUI interfaces, to mirror GUI events in ordered parseable messages • Network/file access • If program writes to the network or to remote files, this is hard to observe

Controllability & Observability: Memory Allocation • More extreme case: embedded code for mission or safety critical systems • May be running without memory protection • Dynamic allocation often forbidden • Design module to accept a static block allocated elsewhere, and only access this memory • Controllability: allows us to introduce memory faults, simulate warm reboots • Observability: allows us to easily instrument code with low-overhead checks to find memory safety violations during testing

Coverage • Literature of software testing is primarily concerned with various notions of coverage • Ammann and Offutt identify four basic kinds of coverage: • Graph coverage • Logic coverage • Input space partitioning • Syntax-based coverage

Graph Coverage • Cover all the nodes, edges, or paths of some graph related to the program • Examples: • Statement coverage • Branch coverage • Path coverage • Data flow (def-use) coverage • Model-based testing coverage • Many more – most common kind of coverage, by far

Graph Coverage • Most FSM testing algorithms can be seen as graph coverage • Consider VC – computing a spanning tree to nodes is standard graph exploration • Beizer: “find a graph and cover it”

x < y x >= y y = 0 x = x + 1 x = y 2 3 2 x < y x >= y 1 1 4 3 y = 0 x = x + 1 Statement/Basic Block Coverage Statement coverage:Cover every node of thesegraphs if (x < y) { y = 0; x = x + 1; } else { x = y; } if (x < y) { y = 0; x = x + 1; } Treat as one node becauseif one statement executesthe other must also execute(code is a basic block)

x < y x >= y y = 0 x = x + 1 x = y 2 3 2 x < y x >= y 1 1 4 3 y = 0 x = x + 1 Branch Coverage if (x < y) { y = 0; x = x + 1; } else { x = y; } Branch coverage vs.statement coverage:Same for if-then-else if (x < y) { y = 0; x = x + 1; } But consider this if-thenstructure. For branch coveragecan’t just cover all nodes, butmust cover all edges – get tonode 3 both after 2 and withoutexecuting 2!

x < y x >= y y = 0 x = x + 1 x = y 2 3 5 x < y x >= y 4 1 4 6 y = 0 x = x + 1 Path Coverage How many paths throughthis code are there? Needone test case for each toget path coverage if (x < y) { y = 0; x = x + 1; } else { x = y; } if (x < y) { y = 0; x = x + 1; } To get statement and branchcoverage, we only need twotest cases: 1 2 4 5 6 and 1 3 4 6 Path coverage needs two more: 1 2 4 5 6 1 3 4 6 1 2 4 6 1 3 4 5 6 In general: exponential inthe number of conditional branches!

6 4 z w !z !w 1 2 4 3 7 5 y = x - 2 x = y + 2 Data Flow Coverage x = 3; y = 3; if (w) { x = y + 2; } if (z) { y = x – 2; } n = x + y x = 3 Def(x) Annotate program withlocations where variablesare defined and used(very basic staticanalysis) y = 3 Def(y) Def-use pair coverage requiresexecuting all possible pairsof nodes where a variable isfirst defined and then used,without any interveningre-definitions Def(x) Use(y) E.g., this path covers the pairwhere x is defined at 1 and usedat 7: 1 2 3 5 6 7 Def(y) Use(x) May be many pairs,some not actually executable But this path does NOT:1 2 3 4 5 6 7 n = x + y Use(x) Use(y)

2 1 3 Logic Coverage What if, instead of: if (x < y) { y = 0; x = x + 1; } ((a>b) || G)) && (x < y) ((a <= b) && !G) || (x >= y) y = 0 x = x + 1 we have: if (((a>b) || G)) && (x < y)) { y = 0; x = x + 1; } Now, branch coverage will guaranteethat we cover all the edges, but doesnot guarantee we will do so for allthe different logical reasons We want to test the logic of the guardof the if statement

Active Clause Coverage duplicate ( (a > b) or G ) and (x < y) 1 T F T T 2 F F T F With these values for G and (x<y), (a>b) determines the value of the predicate With these values for (a>b) and (x<y), G determines the value of the predicate 3 F T T T 4 F F T F With these values for (a>b) and G, (x<y) determines the value of the predicate 5 T T T T 6 T T F F 29

Input Domain Partitioning Partition schemeq of domain D The partition q defines a set of blocks, Bq = b1 , b2 , … bQ The partition must satisfy two properties: blocks must be pairwise disjoint (no overlap) together the blocks cover the domain D (complete) b1 b2 b3 bi bj = , i  j, bi, bj  Bq  b = D b  Bq Coverage then means using at least one input from each of b1, b2, b3, . . . 30

Input Domain Partitioning Some subtleties here… What’s wrong with this partition of file contents? { b1: Sorted ascending file b2: Sorted descending file b3: Neither sorted ascending nor sorted descending } b1 b2 b3 bi bj = , i  j, bi, bj  Bq  b = D b  Bq 31

Syntax-Based Coverage • Based on mutation testing (a pet topic of Amman and Offutt, who are heavily into this research area) • Bit different kind of creature than the other coverages we’ve looked at • Idea: generate many syntactic mutants of the original program • Coverage: how many mutants does a test suite kill (detect)? 32

Mutating Our Buggy Program int findLast (int a[], int n, int x) { // Returns index of last element in a // equal to x, or -1 if no such. // n is length of a int i; for (i = n-1; i > 0; i--) { if (a[i] = x) return i; } return -1; }

Mutant #1 int findLast (int a[], int n, int x) { // Returns index of last element in a // equal to x, or -1 if no such. // n is length of a int i; for (i = n; i > 0; i--) { if (a[i] = x) return i; } return -1; }

Mutant #2 int findLast (int a[], int n, int x) { // Returns index of last element in a // equal to x, or -1 if no such. // n is length of a int i; for (i = n-1; i > 0; i--) { if (a[i] = x) return i; } return 0; }

Mutant #3 int findLast (int a[], int n, int x) { // Returns index of last element in a // equal to x, or -1 if no such. // n is length of a int i; for (i = n-1; i > 0; i--) { if (a[i] != x) return i; } return -1; }

Mutant #4 int findLast (int a[], int n, int x) { // Returns index of last element in a // equal to x, or -1 if no such. // n is length of a int i; for (i = n-1; i > 0; i--) { if (a[i] = n) return i; } return -1; }

Mutant #5: Wait, this one’s the fix! int findLast (int a[], int n, int x) { // Returns index of last element in a // equal to x, or -1 if no such. // n is length of a int i; for (i = n-1; i >= 0; i--) { if (a[i] = x) return i; } return -1; }

Syntax-Based Coverage MUTANTS OF P Program P P 100% coveragemeans you killall the mutants withyour test suite 39

Generation vs. Recognition • Generation of tests based on coverage means producing a test suite to achieve a certain level of coverage • As you can imagine, generally very hard • Consider: generating a suite for 100% statement coverage easily reaches “solving the halting problem” level • Obviously hard for, say, mutant-killing • Recognition means seeing what level of coverage an existing test suite reaches

Coverage and Subsumption • Sometimes one coverage approach subsumes another • If you achieve 100% coverage of criteria A, you are guaranteed to satisfy B as well • For example, consider node and edge coverage • (there’s a subtlety here, actually – can you spot it?) • What does this mean? • Unfortunately, not a great deal • If test suite X satisfies “stronger” criteria A and test suite Y satisfies “weaker” criteria B • Y may still reveal bugs that X does not! • For example, consider our running example and statement vs. branch coverage • It means we should take coverage with a grain of salt, for one thing

Testing “for” Coverage • Never seek to improve coverage just for the sake of increasing coverage • Well, unless it’s a command from-on-high • Coverage is not the goal • Finding failures that expose faults is the goal • No amount of coverage will prove that the program cannot fail “Program testing can be used to show the presence of bugs, but never to show their absence!” – E. Dijkstra, Notes On Structured Programming

The Purpose of Testing “Program testing can be used to show the presence of bugs, but never to show their absence!” – E. Dijkstra, Notes On Structured Programming • Dijkstra meant this as a criticism of testing and an argument in favor of more disciplined and total approaches (proving programs correct) • But he also points out what testing is good for: exposing errors • Coverage is valuable if and only if test sets with higher coverage are more likely to expose failures

The Purpose of Testing “Program testing can be used to show the presence of bugs” • When we first start “testing,” we often want to “see that the program works” • Try out some scenarios and watch the program “do its stuff” • Surprised (annoyed) when (if) the program fails • This is not really testing: testing is not the same as a demonstration • Aim to break (your) code, if it can be broken

Levels of Testing • Adapted from Beizer, by Amman and Offutt • Level 0: Testing is debugging • Level 1: Testing is to show the program works • Level 2: Testing is to show the program doesn’t work • Level 3: Testing is not to prove anything specific, but to reduce risk of using program • Level 4: Testing is a mental discipline that helps develop higher quality software

What’s So Good About Coverage? • Consider a fault that causes failure every time the code is executed • Don’t execute the code: cannot possibly find the fault! • That’s a pretty good argument for statement coverage int findLast (int a[], int n, int x) {// Returns index of last element // in a equal to x, or -1 if no// such. n is length of a int i; for (i = n-1; i >= 0; i--) { if (a[i] = x) return i;}return 0; }

What’s So Good About Coverage? • We should have an argument for any kind of coverage: • “If I don’t cover this, then there is more chance I’ll miss a fault like that” • Backed with empirical data, preferably! int findLast (int a[], int n, int x) {// Returns index of last element // in a equal to x, or -1 if no// such. n is length of a int i; for (i = n-1; i >= 0; i--) { if (a[i] = x) return i;}return 0; }

Return to Our Example int findLast (int a[], int n, int x) { // Returns index of last element in a // equal to x, or -1 if no such. // n is length of a int i; for (i = n-1; i > 0; i--) { if (a[i] == x) return i; } return -1; } Let’s write a tester forthis version of theprogram (back to thefirst off-by-one bug) Forget for a momentthat we know what thebug is!

Return to Our Example int findLast (int a[], int n, int x) { // Returns index of last element in a // equal to x, or -1 if no such. // n is length of a int i; for (i = n-1; i > 0; i--) { if (a[i] = x) return i; } return -1; } What kind of coveragemight we want to thinkabout when testing thiscode?

Return to Our Example #define N 5 // 5 is “big enough”? int testFind () { int a[N]; int p, i; for (p = 0; p < N; p++) { random_assign(a, N) a[p] = 3; for (i = p; i < N; i++) { if (a[i] == 3) a[i] = a[i] – 1; } printf (“TEST: findLast({”); print_array(a, N); printf (“}, %d, 3)”, N); assert (findLast(a, N, 3) == p); } } What kind of coveragedoes this tester exploit?

Basic Definitions: Testing