430 likes | 580 Views
The grand game of t esting and some related issues. Yuri Gurevich Research in Software Engineering, Microsoft. Agenda. Introducing the grand game Various dimensions of testing Myers’s paradox Zoology of software testing R emarks Beware of statistics
E N D
The grand game of testingand some related issues Yuri GurevichResearch in Software Engineering, Microsoft
Agenda • Introducing the grand game • Various dimensions of testing • Myers’s paradox • Zoology of software testing • Remarks • Beware of statistics • Testing vs. scientific experimentation
Testing is rather important.Here is a proof by authority • Bill Gates at MIT in 1996: “Testing is another area where I have to say I’m a little bit disappointed in the lack of progress. At Microsoft, in a typical development group, there are many more testers than there are engineers writing code. Yet Engineers spend well over a third of their time doing testing type work. You could say that we spend more time testing than we do writing code. And if you go back through the history of large-scale systems, that’s the way they’ve been.”
In-class exercise • Scenario. A given program reads three integer values from a card. The three values are interpreted as representing the lengths of the sides of a triangle. The program prints a message that states whether the triangle is scalene, isosceles or equilateral. • Task. Write down particular inputs to the program in order to test the program as thoroughly as possible. From “The Art of Software Testing” by Glenford J. Myers, Wiley 1979.
Myers: Do you have these? • a scalene triangle • an isosceles triangle • an equilateral triangle • 3 permutations of 2 • a zero-length side • a negative-length side • three positive sides, sum of two = third 8. 3 permutations of 7 • three positive sides, sum of two < third • 3 permutations of 9. • (0,0,0) • noninteger values • wrong number of initial values • the expected output in each case.
Comments • Additional test ideas: • Real inputs • Boolean, character inputs • excessively large inputs • Critiquing the spec • Are degenerated triangles legal? It seems no. • Is 3.0 an integer? It seems no. • Why inputs are integer? Naturally they should be real.Probably to simplify the problem and avoid the rounding issues. • Most importantly, what should the program do on wrong inputs? But of course finding spec deficiencies may be a part of testing. • Why this triangle problem? • He assumes that reader has the necessary knowledge, and so the problem can be formulated easily. • Some alternatives: Euclid’s algorithm, integer multiplication.
What programs do on wrong inputs • The description of the expected result is a necessary part of a test is. • No expectations no surprises. • More generally, test what the program is supposed to do as well as what it is not supposed to do. • An electronic trading program is supposed to perform the trades correctly but this is not enough. It is not supposed to round prices and put the difference to programmer’s account.
The grand game • Programmer vs. Tester • individuals, teams, organizations • The game may be cooperative but the players have different goals. Tester’s goal is to find errors. • Successful tests discover errors. Cf. positive diagnostic in medicine. • It is preferable that Programmer and Tester are distinct. • What if the code does not have any errors? Can you find all the errors? • How is the winner determined? • The game is not necessarily antagonistic. • What is a move? When is the game over? • There are no clear-cut answers. This is a matter of engineering and economics.
A partial taxonomy • Black box, white (or glass) box, grey box • Unit (or module), feature; integration; system • Alpha (in house), beta (by end users) • Poking around, manual, automated script, model based • Acceptance, usability • Functional; compatibility; regression; performance: load, stress • Comparison; install, uninstall; recovery • Combinatorial • Incremental • Sanity, smoke • Random • Security, privacy, compliance • Conformance (of implementation to the spec or the other way round)
Black box testing • It is functional testing also known as data-driven or input-output testing. • It is conformance testing. A spec is given. You try to find inputs where the program disagrees with the spec. • The exhaustive testing is usually impossible or infeasible. • Think of testing a compiler or OS. • So when to stop testing? • It is all about economics.
White box testing • The test data is derived from the program (and too often at the neglect of the spec). • What is the analog of the black-box exhaustive testing? • Program coverage? No • A better answer is exhaustive path.
Exhaustive path testing • Normally infeasible • Does not guarantee the compliance with the spec • The missing path problem • Possibly missing data-sensitive errors • if (a-b) < rather than |a-b| < .
3. Myers’s paradox Myers: “The number of uncovered bugs in a program section is proportional to the number of discovered bugs in the section.” The observation might have been known earlier but we have no relevant references. In fact, the observation seems to be overlooked by the later investigators of reliability.
Is the paradox true or false? • Here is a counterexample with two one-line sections: sum(1,1) := 3 2sum(1,2) := 2 • So the paradox is not literally true. Is it false?
From sections to modules • Due to the explosion in the size and complexity of software products, it is more appropriate these days to speak about modules rather than program sections. • What about component interaction?
One analogy • “A bird in the hand is worth two in the bush.” • Is that literally true? The value of any bird in any hand is worth (= or ?) the combined value of any two birds in any bush. Ridiculous! But is the saying false? • No, people say it for a good reason. • “Sinitza in the hands is better than zhuravl in the sky” is different, more specific, and yet the meaning is the same. • We do not evaluate the paradox on the true/false scale. One other scale is from “hogwash” to “brilliant.”
One meaning of the paradox • Beware. Whatever you’ve done to debug a module and however you measure your debugging effort provide no guarantee that the module is well debugged. • Here is 20-years later reference of a good beware article:“A Critique of Software Defect Prediction Models,”by Norman E. Fenton and Martin Neil, IEEE TSE, 1999 From the abstract: “Many organizations want to predict the number of defects (faults) in software systems, before they are deployed, to gauge the likely delivered quality and maintenance effort. To help in this, numerous software metrics and statistical models have been developed, with a correspondingly large literature. We provide a critical review of this literature and the state-of-the-art.”
A stronger meaning of the paradox • Sriram Biyani and P. Santhanam, "Exploring Defect Data from Development and Customer Usage on Software Modules over Multiple Releases," ISSRE 1998, International Symposium on Software Reliability Engineering. • “We have shown that: • modules with more defects in development are likely to have more defects in the field. • ...”
Deriving Myers’s paradox • The paradox, in its literal form, can actually be derived from assumptions that are strong but not completely ridiculous. We give two illustrations.
Example 1: Equally effective testing Assume that the two program sections (or modules) have been tested with the same efficiency; the fraction of bugs caught by the testing is the same, say 1/k, for both sections. Then, if b bugs are found in a section, there are kb bugs there altogether, of which kb - b = b(k-1) bugs remain undiscovered.
Example 2: Uniform distribution of bugs • Modules A, B have 10,000 lines of code each. A has 1000 bugs; B has 500. In both cases the bugs are uniformly distributed. • A perfect inspection of 1000 lines of code reveals about 100 bugs in A and about 50 bugs in B. Twice as many bugs have been found in A. • A has twice as many remaining bugs: 900 vs. 450.
Counting bugs • The goal of testing is often formulated thus: find bugs. This is not unreasonable as bugs are always there. • The New York Time example. • But bug counting is harder one may think. • Think a high-school essay. Programming languages are less expressive but (a) the logic may be more involved and (b) the computer is more demanding detail-wise than a human interlocutor. • Think 1+f(x).
Where is the bug? • The story of a merchant with two enemies. One poisons the water, and the other makes a hole in the container. • Take-home exercise. Program a function f so that • some value of f is wrong and • there are two different lines i and j of code such that • changing line i fixes the bug, • changing line j fixed the bug, • but both changes return the bug.
What is a bug? • That is a huge issue, and we are not going there. Just a couple of remarks. • Taxonomies of bugs • Triage • One man’s bug may be another man’s feature.
One advantage of human inspection • Test automation is key of course but notice this. Human inspection detects actual errors rather then their symptoms.
Testing: a good research area • Testing is much needed. • The theory is still maturing. • And thus the entry is still relatively easy. • The field is huge and diverse. • There is plenty room for engineering ingenuity. • There is plenty room for mathematical analysis. • As we will see, testing is a gamein more sense than one.
6. Beware of statistics:Simpson’s paradox A university is trying out two automated diagnostic tools A and B. Student with computer problems can dial either tool; if the tool fails to solve the problem then a human will take a call. Is the following possible? • Tool B works better for PCs. • Tool B works better for Macs. • Tool A works better overall.
Testing is no verification • Testing and experimentation find faults. Neither prove correctness. • But the evidence may mount. This brings us to the problem of inductive reasoning • We observed sun rising on the East so many times. Hence it always rises on the East.
One induction joke • Is every odd number prime? • Engineer: • Physicist: • But induction itself is no joke. • Induction in natural sciences • Induction reasoning of software employment.
The problem of induction • Hume: You can check only a finite number out of the infinitely many instances. How do you justify the jump to conclusion? Is it even useful to bother to check more instances? • There have been numerous attempts to justify induction e.g. probabilistically.
Popper • The justification of induction is so hard because it is impossible. • Any scientific theory is just that, a theory. • Example: Newton to Einstein • On the other hand, scientific theories can be falsified. There is no symmetry between verification and falsification. • Additional observations may have no additional weight. • The water boiling example. • You have to challenge the theory
Falsification principle • Falsification is the demarcation line between science • like physics, chemistry, etc. and pseudo-science • like astrology, Marxism, etc. • The principle caught the imagination of scientists.
Falsifiability of the falsification principle: the logic approach • Self-application. Is the falsification principle scientific? • Existential claimsx (x) can in principle be verified provided that (x) is observable. • Universal claims x (x) can in principle be falsified provided that is observable. • What is the claim is of the form x y (x,y) or ...? • Typically scientific claims are universal e.g. in Newtonian mechanics, any two objects attract each other … Maybe this explains Popper’s popularity with the scientists.
Pragmatics • What is really the form of a scientific claim? • There are many explanations of a failed experiment. • The predicted discovery of Pluto was a triumph of Newtonian mechanics. What would happen if Pluto were not found? • Theories are remarkably resistant to falsification, especially if there is no good alternative theory.
Falsification principle and software • Is falsifiability a demarcation line between bad and good software? • Can you falsify software? • If yes, how many bugs does it take to do the trick? • Buggy software is fine until we have a better one. Like with theories.
Distinction • Scientific experimentation is harder.It takes a lot of ingenuity to master one test. • It is typically black box. • Genetic research is as close as you get to an exception in natural sciences. • Can you influence the “programmer”?