Testing Effectiveness and Reliability Modeling for Diverse Software Systems

Testing Effectiveness and Reliability Modeling for Diverse Software Systems CAI Xia Ph.D Term 4 April 28, 2005

Outline • Introduction • Background study • Reliability modeling • Testing effectiveness • Future work • Conclusion

Introduction • Software reliability engineering techniques • Fault avoidance • structure programming, software reuse, and formal methods • Fault removal • testing, verification, and validation • Fault tolerance • single-version technique • multi-version technique (design diversity) • Fault prediction • reliability modeling

Software Fault Tolerance • Layers of Software fault tolerance

SFT techniques • Single-version techniques • Checkpointing and recovery • Exception handling • Data diversity • Multi-version techniques (Design diversity) • Recovery block • N-version programming • N self-checking programming

Design diversity • To deploy multiple-version programs to tolerate software faults during operation • Principle: redundancy • Applications • Airplane control systems, e.g., Boeing 777 & AIRBUS A320/A330/A340 • aerospace applications • nuclear reactors • telecommunications products

Design diversity (cont’) • Controversial issues: • Failures of diverse versions may correlate with each other • Reliability modeling on the basis of failure data collected in testing • Testing is a critical issue to ensure the reliability • Testing completeness and effectiveness  Test case selection and evaluation  code coverage? • Real-world empirical data are needed to perform the above analysis

Research questions • How to predict the reliability of design diversity on the basis of the failure data of each individual version? • How to evaluate the effectiveness of a test set? Is code coverage a good indicator?

Experimental description • Motivated by the lack of empirical data, we conducted the Redundant Strapped-Down Inertial Measurement Unit (RSDIMU) project • It took more than 100 students 12 weeks to develop 34 program versions • 1200 test cases were executed on these program versions • 426 mutants were generated by injecting a single fault identified in the testing phase • A number of analyses and evaluations were conducted in our previous work

Reliability models for design diversity • Eckhardt and Lee (1985) • Variation of difficulty on demand space • Positive correlations between version failures • Littlewood and Miller (1989) • Forced design diversity • Possibility of negative correlations • Dugan and Lyu (1995) • Markov reward model • Tomek and Trivedi (1995) • Stochastic reward net • Popov, Strigini et al (2003) • Subdomains on demand space • Upper/lower bounds for failure probability Conceptual models Structural models In between

PS Model • Alternative estimates for probability of failures on demand (pfd) of a 1-out-of-2 system

PS Model (cont’) • Upper bound of system pfd • “Likely” lower bound of system pfd - under the assumption of conditional independence

DL Model • Example: Reliability model of DRB

DL Model (cont’) • Fault tree models for 2-, 3-, and 4-version systems

Comparison of PS & DL Model

Testing effectiveness • The key issue in software testing is test case selection and evaluation • What is a good test case? • testing effectiveness and completeness • fault coverage • To allocate testing resources, how to predict the effectiveness of a given test case in advance?

Testing effectiveness • Code coverage: an indicator of fault detection capability? • Positive evidence • high code coverage brings high software reliability and low fault rate • both code coverage and fault detected in programs grow over time, as testing progresses. • Negative evidence • Can this be attributed to causal dependency between code coverage and defect coverage?

Testing effectiveness (cont’) • Is code coverage a good indicator for fault detection capability? ( That is, what is the effectiveness of code coverage in testing? ) • Does such effect vary under different testing profiles? • Do different code coverage metrics have various effects?

Basic concepts: code coverage • Code coverage - measured as the fraction of program codes that are executed at least once during the test. • Block coverage - the portion of basic blocks executed. • Decision coverage - the portion of decisions executed • C-Use- computational uses of a variable. • P-Use - predicate uses of a variable

Basic concepts: testing profiles • Functional testing – based on specified functional requirements • Random testing - the structure of input domain based on a predefined distribution function • Normal operational testing – based on normal operational system status • Exceptional testing - based on exceptional system status

Experimental requirement • Complicated and real-world application • Large population of program versions • Controlled development process • Bug history recorded • Real faults studied • Our RSDIMU project satisfies above requirements

Test cases description I II III IV V VI

The correlation between code coverage and fault detection Is code coverage a good indicator of fault detection capability? • In different test case regions • Functional testing vs. random testing • Normal operational testing vs. exceptional testing • In different combinations of coverage metrics

The correlation: various test regions • Test case coverage contribution on block coverage • Test case coverage contribution on mutant coverage

The correlation: various test regions • Linear modeling fitness in test case regions • Linear regression relationship between block coverage and defect coverage in whole test set

The correlation: various test regions • Linear regression relationship between block coverage and defect coverage in region IV • Linear regression relationship between block coverage and defect coverage in region VI

The correlation: various test regions Observations: • Code coverage: a moderate indicator • Reasons behind the big variance between region IV and VI

The correlation: functional testing vs. random testing • Code coverage: - a moderate indicator • Random testing – a necessary complement to functional testing • Similar code coverage • High fault detection capability

The correlation: functional testing vs. random testing • Failure details of mutants failed at less than 20 test cases: detected by 169 functional test cases (800 in total) & 94 random test cases (400 in total)

The correlation: functional testing vs. random testing • Failure number of mutants that detected only by functional testing or random testing

The correlation: normal operational testing vs. exceptional testing • The definition of operational status and exceptional status • Defined by specification • application-dependent • For RSDIMU application • Operational status: at most two sensors failed as the input and at most one more sensor failed during the test • Exceptional status: all other situations • The 1200 test cases are classified to operational and exceptional test cases according to their inputs and outputs

The correlation: normal operational testing vs. exceptional testing • Normal operational testing • very weak correlation • Exceptional testing • strong correlation

The correlation: normal operational testing vs. exceptional testing • Normal testing: small coverage range (48%-52%) • Exceptional testing: two main clusters

The correlation: normal operational testing vs. exceptional testing • Failure number of mutants that detected only by normal operational testing or exceptional testing

The difference between two pairs of testing profiles • The whole testing demand space can be classified into seven subsets according to system status Si,j : • S0,0 S0,1 S1,0 S1,1 S2,0 S2,1 Sothers • i: number of sensors failed in the input • j: number of sensors failed during the test • Functional testing vs. random testing • big overlap on seven system status • Normal testing vs. exceptional testing • no overlap on seven system status • This may explain the different performance of code coverage on testing effectiveness under two pairs of testing profiles

The correlation: under different combinations • Combinations of testing profiles • Observations: • Combinations containing exceptional testing indicate strong correlations • Combinations containing normal testing inherit weak correlations

The correlation: under different coverage metrics • Similar patterns as block coverage • Insignificant difference under normal testing • Decision/P-use: control flow change related • Larger variation in code coverage brings more faults detected

Discussions • Does the effect of code coverage on fault detection vary under different testing profiles? • A significant correlation exists in exceptional test cases, while no correlation in normal operational test cases. • Higher correlation is revealed in functional testing than in random testing, but the difference is insignificant. • Do different coverage metrics have various effects on such relationship? • Not obvious with our experimental data

Discussions (cont’) • This is the first time that the effect of code coverage on fault detection are examined under different testing profiles • Overall, code coverage is a moderate indicator for testing effectiveness • The correlation in small code coverage range is insignificant • Our findings of the positive correlation can give guidelines for the selection and evaluation of exceptional test cases

Future work • Generate 1 million test cases and exercise them on current 34 versions to collect statistical failure data • Conduct cross-comparison with previous project to investigate the “variant” and “invariant” features in design diversity • Quantify the relationship between code coverage and testing effectiveness

Conclusion • Survey on software fault tolerance evolution, techniques, applications and modeling • Evaluate the performance of current reliability models on design diversity • Investigate the effect of code coverage under different testing profiles and find it is a clear indicator for fault detection capability, especially for exceptional test cases

Q & A Thank you!

Testing Effectiveness and Reliability Modeling for Diverse Software Systems