Injection of Faults for SW Validation

Injection of Faults for SW Validation Henrique Madeira, 6/out/2007

What is fault injection? Deliberate insertion of upsets (faults or errors) in computer systems to evaluate its behavior in the presence of faults or validate specific fault tolerance mechanisms in computers.

Fault tolerant mechanisms Faults, Errors, and Failures Error Failure Fault

Software Implemented Fault Injection(e.g., Xception) Reproduce pin-level fault injection by software Fault injection approaches • Pin-level fault injection(e.g., RIFLE) Injection of hardware faults only!

Why injecting software faults? • Software faults are the major cause of computer system outages • Goals: • Experimental risk assessment in component-based software development • Dependability evaluation of COTS components • Robustness testing • Fault tolerance layer evaluation • Dependability benchmarking

Output Input SW component under test Interface faults Output Input Target SW component Software faults Two possible injection points • Injection of interface faults in software components (classical robustness testing) Injection of realistic software faults inside software components (new approach)

Ex. 1 - interface faults Robustness failure in RTEMS 4.5.0 Excerpt of application code: requestedSize1 = 4294967295; returnStatus = rtems_region_get_segment (regionId, requestedSize1, option, timeout, ptsegment1); Result: Memory exception at fffffffc (illegal address) Unexpected trap (0x09) at address 0x0200aaac Data access exception at 0xfffffffc This software fault was discovered automatically by the Xception robustness testing tool!

Software faults are injected in the floppy device driver using the G-SWFIT technique. • The device is heavily used by programs. Windows XP Windows 2000 Windows NT Availability Feedback Stability Worst Worst Worst Best Best Best Ex. 2 - injection of software faults What happens if a software bug in the floppy device driver becomes active?

Components/COTS based software • Finegrain components: • Some middleware comp. • User interface small components. • Libs. • Etc. • Coarsegrain components: • Middleware comp. • Web servers • DBMS • OS Trend → use general-purpose COTS components and develop domain specific components and glue code.

Control application + Real-time operating system COTS in small scale systems

SW Components Software components Different sizes Different levels of granularity

Question 1 This is a COTS! What’s the risk of using it in my system?

Question 2 This is custom component previously built! What’s the risk of reusing it in my system?

Question 3 This is a new custom component! What’s the risk of using it without further testing?

Injection of SW faults Comp. 1 Comp. 2 Custom COTS Comp. 4 Custom Comp. 3 COTS Exception handler Comp. 5 Custom Injection of SW faults Measuring impact of faults in SW components • Inject SW faults in the component • Measure system response (failure modes) to failures in the target component. • Evaluate how critical component failures are

Comp. 1 Comp. 2 Custom COTS Comp. 4 Custom Comp. 3 COTS Exception handler Comp. 5 Custom Injection of software faults Software complexity metrics Measuring risk of faults in SW components What’s the risk of using Component 3 in my system? Risk = prob. of bug * prob. of bug activation * impact of bug activation

Comp. 1 Comp. 2 Custom COTS Comp. 4 Custom Comp. 3 COTS Exception handler Comp. 5 Custom Injection of SW faults Software fault injection • Goal: improve system dependability through: • Measuring the impact of software faults in software components. • Evaluating the risk of faults in software components • Used when there is a systems prototype (real or even simulated). • Require representative: • Operational profile • Injected faults

Two research problems • How to inject realistic software faults? • Is it really possible to estimate meaningful component risk using fault injection + software complexity metrics?

Correctness from the end user point of view: too vague What is a software fault? Software development process (in theory...) Requirements Specification Design Code development Test Deployment OK OK The requirements + specification are correct but the deployed code is not

Characterization of software faults A SW fault is characterized by the change in the code that is necessary to correct it (Orthogonal Defect Classification). Defined according two parameters: • Fault triggerconditions that make the fault to be exposed • Fault type type of mistake in the code

Types of software faults (ODC) • Assignmentvalues assigned incorrectly or not assigned • Checkingmissing or incorrect validation of data, or incorrect loop, or incorrect conditional statement • Timing/serializationmissing or incorrect serialization of shared resources • Algorithmincorrect or missing implementation that can be fixed without the need of design change • Functionincorrect or missing implementation that requires a design change to be corrected

Representative software faults? • Field data on real software errors is the most reliable information source on which faults should be injected • Typically, this information is not made public • Open source projects provide information on past (discovered) software faults

Open source field data survey

Additional step over basic ODC • Hypothesis: Faults are considered as programming elements (language constructs) that are either: • Missing E.g. Missing part of a logical expression • Wrong E.g. Wrong value used in assignment • Extraneous E.g. Surplus condition in a test

Fault characterization

Fault distribution across ODC types • There is a clear trend in fault distribution • Previous research (not open source) confirms this trend • Some faults types are more representative than others: Assignment, Checking, Algorithm

Fault nature across ODC • Missing and wrong elements are the most frequent ones • This trend is consistent across the ODC types tested

Fault characterization per program 1 – Missing constructs faults are the more frequent ones 2 – Extraneous constructs are relatively infrequent 3 – This trend is consistent across the programs tested

The “Top-N” software faults

01011000100X001 010110001X01001 010110X01001001 01X110001001001 G-SWFIT technique Target executable code Low-level code mutation engine Low level mutated versions 010110001001001 ... Library of low level mutation operators Emulate common programmer mistakes The technique can be applied to binary files prior to execution or to in-memory running processes

Fault operator example 1 missing and-expression in condition Target source code (avail. not necessary) Code with intended fault if ( a==3 && b==4 ) { do something } if ( a==3 && b==4) { do something } Original target code (executable form) Target code with emulated fault\ cmp dword ptr off_a[ebp],3 jne short ahead cmp dword ptr off_b[ebp],4 jne short ahead ; ... do something ... ahead: ... ; remaining prog. code cmp dword ptr off_a[ebp],3 jne short ahead nop nop nop ; ... do something ... ahead: ... ; remaining prog. code The actual mutation is performed in executable (binary) code. Assembly mnemonics are presented here for readability sake

Fault operator example 2 assignment instead equality comparison Target source code (avail. not necessary) Code with intended fault if (v1 == v2) { ... } if (v1 = v2) { ... } Original target code (executable form) Target code with emulated fault MOV reg, mem1 CMP reg, mem2 JNE ahead ; ... ahead: ; ... MOV reg, mem2 MOV mem1, reg CMP reg, 0 JE ahead ; ... ahead: Some restrictions are enforced (e.g. it must not be preceded by a function call pattern to avoid func() = = val becoming func() = val) This fault is not the most common one, but it illustrates a mutation more complex than the previous one

Definition of the low-level mutation operators library

Validation of the technique • Accuracy? Are the low-level faults actually equivalent to the high-level bugs? • Generalization and portability Is the technique dependent on the compiler, optimization settings, high-level language, processor architecture, etc?

Ex. 1 of results on accuracy validation Operator: Missing or bad return statement LZari Camelot GZip Low-level High-level

Ex. 2 of results on accuracy validation Operator: Assignment instead equality comparison LZari Camelot GZip Low-level High-level

Generalization of the technique • Use of different compiler optimization settings • Use of different compilers (Borland C++, Turbo C++, Visual C++) • Use of different high‑level languages (C, C++, Pascal) • Different host architectures (Intel 80x86, Alpha AXP). The library of fault operators (code patterns + mutations) depends essentially on the target architecture.

Current use of G-SWFIT • Dependability benchmarking • DBench-OLTP: database and OLTP systems Already used to benchmark Oracle 8i, Oracle9i, and PostgreSQL running on top of Windows 2K, Windows XP, and Linux. • WEB-DB: web servers Already used to benchmark Apache and Abyss web servers running on top of Windows 2K, Windows XP, and Windows 2003. • Independent verification and validation in NASA IV&V case-studies.

Two research problems • How to inject realistic software faults? • Is it really possible to estimate meaningful component risk using fault injection + software complexity metrics?

Risk estimation Statistical Model based on Logistic Regression Riskc = prob(fc) * cost(fc) Software Fault Injection prob(factivated) * c(failure)

x Commands … … DHS PR PL DHS PR PL Ground Ground DHS DHS PR PR PL PL Ground Ground Software fault Injection (G-SWFIT) Control Control Control Control x Telemetry Linux Linux Linux Linux RTEMS or RTLinux RTEMS RTEMS RTEMS CDMS CDMS RS232 Case study: satellite data handling system • Evaluate feasibility of risk assessment approach • Compare measured risk of using RTEMS versus RTLinux

Steps to Obtain Residual Fault Estimation • Evaluate the complexity metrics of each component (LOC, Cyclomatic Complexity, etc) • Obtain the preliminary estimation of fault density combining standard estimation and component LOC to replace field observation • Use binomial distribution estimating the number of lines with residual faults • Apply the regression using the value obtained from natural logarithm of the preliminary fault density and the chosen metrics aim to obtain the coefficients • Estimate the probability of fault of each component by using the computed coefficients in the logistic equation

Probability of residual fault • Xi represents the product metrics •  and  are estimated logistic regression coefficients • exp is the base of the natural logarithms (2.718 …) From: - M.-H. Tang, M.-H. Kao, M.-H. Chen, “An Empirical Study on Object-Oriented Metrics”, In: Proceedings of the Sixth International Software Metrics Symposium, pp. 242-249, 1999. - D. Hosmer, S. Lemeshow, “Applied Logistic Regression”. John Wiley & Sons, 1989

Estimation of the probability of residual fault

Impact of software faults in the each OS Correct – there are no errors reported and the result is correct Wrong – the workload terminates but the results are not correct Crash – the application terminates abruptly before the workload complete Hang – when the application is not able to terminate in the pre-determined time

Risk evaluation example Risk = prob. of bug * prob. of bug activation * impact of bug activation prob(f) impact(f)

Conclusion • Using software fault injection to measure the impact of components faults is an established technique to: • Assess the impact of component faults (experimental criticality analysis). • Evaluate dependability improvements after system changes or patches. • Experimental risk assessment still requires further research, especially in what concerns to the validation of the approach.

henrique@dei.uc.pt

Injection of Faults for SW Validation