Empirical Methods for Benchmarking High Dependability

Empirical Methods for Benchmarking High Dependability The Role of Defect and defect Seeding CSE Research Review 2002

The Role of Defect Seeding in HDC Benchmarking • Traditional Defect Seeding • Assumptions and Challenges • Potential New Approaches • Change histories as a source of seeded defects • N-Version development as a source of seeded defects • Connectors and Wrappers • Randomized benchmarks • Hybrids of new and traditional approaches • Issues for discussion • Mapping of approaches to dependability categories • Alternative concepts and approaches • Special application challenges

Traditional Defect Seeding • Insert N defects in system under test (SUT) • “be-bugging" • Run tests, find M seeded defects, K unseeded defects • Estimate remaining defects as: R= K *((N-M)/M), from K/(K+R) = M/N • Example: • seed the SUT with N=10 defects • tests find M=6 of the 10 seeded defects, K=3 unseeded • estimate is that R=2 remain • claim is that the unseeded defects found represent 60% of the 5 previously-undetected defects in the SUT

Assumptions of Defect Seeding • The seeded defects are representative of existing defects. • Seeding is mostly done by developers, whose blind spots miss many sources of defects. • The test profile is representative of the operational profile. • The developers’ knowledge of actual usage patterns is generally highly imperfect. • The SUT is developed without knowledge of the seeding profile. • If the seeded defects become well-known, there are risks of consciously or unconsciously tailoring the tool to look good on the seeded defect sample. • The source code available for defect seeding. • As systems become increasingly COTS-based, this difficulty increases.

Change histories as a source of seeded defects • Use fixes from earlier versions of SUT as sources of seeded defects • Are representative of existing defects having been real defects • Problem is that of having been the most detectable defects using current techniques • Version changes may be complex combinations of defect fixes, patches, and general upgrades • preparing an appropriately-seeded SUT more difficult

N-Version Development • Generate representative defects by giving the specs to different programmers and generating a family of SUT versions • Use versions as sample space for defect population • Estimate non-seeded defects from number of defects caused by seed vs. non-seed • Calculate dependability estimators over samples with respect to estimated population • Comparative analysis of the defects found in the SUT versions can also generate estimates of the likely number of residual defects • Studies of N-version programming have shown that it is an imperfect source of independent implementations, and it can also be expensive, but it appears to be worth exploration • Program mutations are a similar source of defect-seeding alternatives

Randomized Benchmarks • Seed defects according to a known distribution (e.g. uniform) • Compare sample distribution to seed distribution • Parameter estimators may be used determine actual distribution and used to estimate dependability • Randomization helps avoid gaming to the evaluation criteria • Non-parametric approach • Permutations and combinations of defect seeds • Jackknife and bootstrap methods to get population estimators from multiple runs

Connectors and Wrappers • Seed defects by intervening between normal interfaces and communications • Seeded defects can simulate potential real defects • Data corruption (change data i/o through wrapper interface) • Communication failures (change timing, handshaking, response, etc. through connector) • Actual defect estimates can be made by comparing defects caused by seeds to non-seed defects • Expected dependability can be measured through systems response to seeds and non-seeds

Hybrids of New and Traditional • One can also combine randomized defect seeding with defect distribution statistics to address the defect representativeness issue. • Orthogonal Defect Classification statistics are a good example. • Combinatorial designs to reduce bias in defect seeds with respect to non-seeds • Game theoretic approaches

Issues for Discussion • Mapping of approaches to dependability categories - “Dependability” attributes and tradeoffs • Alternative concepts and approaches - Mutation testing as defect seeding - Model- driven approaches - Others • Special application challenges - Scalability; test oracles ( e.g. for agent –based systems of systems); Heisenbug effects; others

“Dependability” Attributes and Tradeoffs • Robustness: reliability, availability, survivability • Protection: security, safety • Quality of Service: accuracy, fidelity, performance assurance • Integrity: correctness, verifiability • Attributes mostly compatible and synergetic • Some conflicts and tradeoffs • Spreading information: survivability vs. security • Fail-safe: safety vs. quality of service • Graceful degradation: survivability vs. quality of service

Empirical Methods for Benchmarking High Dependability