Classifying Software Faults to Improve Fault Detection Effectiveness

Classifying Software Faults to Improve Fault Detection Effectiveness Executive Briefing NASA OSMA Software Assurance Symposium September 9-11, 2008 Allen P. Nikora, JPL/Caltech This research was carried out at the Jet Propulsion Laboratory, California Institute of Technology under a contract with the National Aeronautics and Space Administration. The work was sponsored by the NASA Office of Safety and Mission Assurance under the Software Assurance Research Program led by the NASA Software IV&V Facility. This activity is managed locally at JPL through the Assurance and Technology Program Office SAS08_Classify_Defects_Nikora

Agenda • Problem/Approach • Relevance to NASA • Accomplishments and/or Tech Transfer Potential • Next Steps SAS08_Classify_Defects_Nikora

Problem/Approach • All software systems contain faults • Different types of faults exhibit different types of failure behavior • Different types of faults require different identification techniques • Some faults are easier to find than others. • Likelihood of detecting and removing software faults during development and testing, as well as the possible strategies for dealing with residual faults during mission operations depend on the fault type. • Goals are to • Determine the relative frequencies of specific types of faults and to identify trends in those frequencies • Develop effective techniques for identifying and removing faults or making their effects. • Develop guidelines, based on the analysis of faults and failures, for applying the techniques based on the context of current and future missions. SAS08_Classify_Defects_Nikora

Problem/Approach (cont’d) • What must be done? • Analyze software failure data (test and operations) from historical, current JPL and NASA missions and classify the underlying software faults. • Further classify the faults by criticality (e.g., non-critical, significant mission impact, mission critical), and detection phase. • Perform statistical analysis • Proportions of faults of each category. • Conditional frequencies (e.g., percentage of critical faults among aging-related bugs, percentage of aging-related bugs among the critical faults). • Trends in conditional frequencies (within and across missions). • Determine criteria for further classifying faults (e.g., for the aging-related bugs: faults causing round-off errors, faults causing memory leaks, etc.) to identify classes of faults with high criticality and low detectability. • For highly critical faults that are difficult to detect prior to release, develop techniques for: • Identifying component(s) most likely to contain these types of faults. • Improving the detectability of the faults with model-based verification or static analysis tools, as well as during testing. • Masking the faults via fault-tolerance (e.g, software rejuvenation for aging-related faults) Such techniques must be able to accurately distinguish between behavioral changes resulting from normal changes in the system’s operating environment input space and those brought about by aging-related faults. • Develop guidelines for implementing techniques in the context of current, future missions. SAS08_Classify_Defects_Nikora

Relevance to NASA • Different types of faults have different types of effects. Choose fault identification/mitigation strategies based on types of failures encountered in system being developed • Bohrbugs • Deterministically cause failures • Easiest to find during testing • Fault-tolerance of the operational system can mainly be achieved with design diversity • Mandelbugs • difficult to find, isolate, and correct during testing • Re-execution of an operation that failed because of a Mandelbug will generally not result in another failure • Fault-tolerance can be achieved by simple retries or more sophisticated approaches like checkpointing, and recovery-oriented computing • Aging-related • Tendency of causing a failure increases with the system run-time • Proactive measures that clean the internal system state (softgware rejunvenation) and thus reduce the failure rate are useful • Aging can be a significant threat to NASA software systems, (e.g., continuously operating planetary exploration spacecraft flight control systems), since aging-related faults are often difficult to find during development • Related work • Rejuvenation has been implemented in many different kinds of software systems, including telecommunications system], transaction processing systems], and cluster servers • Various types of software systems, like web servers and military systems, have been found to age SAS08_Classify_Defects_Nikora

Accomplishments and/or Tech Transfer Potential • Collected over 40,000 failure records from JPL problem reporting system • Operational failures and failures observed during system test, ATLO operations • All failures (software and non-software) • Over 2 dozen projects represented • Planetary exploration • Earth-orbiter • Instruments • Continued analysis of software failures • Classified flight software failures for 18 projects • Classification of ground software failures for same 18 missions in progress • Completed statistical analysis of flight software failure data • Started application of machine learning/data mining techniques to improve classification accuracy: • Software vs. non-software failures • Types of software failures • Supervised vs. unsupervised learning SAS08_Classify_Defects_Nikora

Next steps • Complete analysis of failures • Complete analysis of ground software ISAs by end of September, 2008. • Complete statistical analyses for all failures to identify trends: • Proportions of software failures • Proportions of Bohrbugs vs. Mandelbugs vs. aging-related bugs • Complete experiments with machine learning/data mining; identify most appropriate failure data representations and learning models to distinguish between: • Software and non-software failures – find additional software failures in problem reporting system and classify them. Can improve accuracy of software failure type classification • Different types of software failures • Based on analyses of proportions and trends in failure data, identify/develop appropriate fault prevention/mitigation strategies (e.g., software rejuvenation) • Other software improvement/defect analysis tasks and organizations at JPL have expressed interest in collaborating with this effort: • JPL Software Product and Process Assurance Group • JPL Software Quality Improvement project SAS08_Classify_Defects_Nikora

Backup Information

Fault Classifications • Classification Scheme: • The following definitions of software fault types are based on [Grottke05a, Grottke05b]: • Mandelbug := A fault whose activation and/or error propagation are complex, where “complexity” can either be caused by interactions of the software application with its system-internal environment (hardware, operating system, other applications), or by a time lag between the fault activation and the occurrence of a failure. Typically, a Mandelbug is difficult to isolate, and/or the failures caused by it are not systematically reproducible. (Sometimes, Mandelbugs are – incorrectly – referred to as Heisenbugs.) • Bohrbug := A fault that is easily isolated and that manifests consistently under a well-defined set of conditions, because its activation and error propagation lack “complexity.” Complementary antonym of Mandelbug. • Aging-related bug := A fault that leads to the accumulation of internal error states, resulting in an increased failure rate and/or degraded performance. Sub-type of Mandelbug. • According to these definitions, the classes of Bohrbugs, aging-related bugs, and non-aging-related Mandelbugs partition the space of all software faults. • References: • [Grottke05a] M. Grottke and K. S. Trivedi, “Software faults, software aging and software rejuvenation,” Journal of the Reliability Engineering Association of Japan 27(7):425–438, 2005. • [Grottke05b] M. Grottke and K. S. Trivedi, “A classification of software faults,” Supplemental Proc. Sixteenth International Symposium on Software Reliability Engineering, 2005, pp. 4.19-4.20. Accomplishments Next Slide SAS08_Classify_Defects_Nikora

Mission Characteristics Summary Accomplishments SAS08_Classify_Defects_Nikora

Analysis Results Fault type proportions for the eight projects with the largest number of unique faults Accomplishments Next Slide SAS08_Classify_Defects_Nikora

Analysis Results (cont’d) Proportion of Bohrbugs for the four earlier missions Proportion of non-aging-related Mandelbugs for the four earlier missions Accomplishments Next Slide SAS08_Classify_Defects_Nikora

Analysis Results (cont’d) Proportion of Bohrbugs for missions 3 and 9, and 95% confidence interval based on the four earlier missions Proportion of Bohrbugs for missions 6 and 14, and 95% confidence interval based on the four earlier missions Accomplishments Next Slide SAS08_Classify_Defects_Nikora

Analysis Results (cont’d) Proportion of non-aging-related Mandelbugs for missions 3 and 9, and 95% confidence interval based on the four earlier missions Proportion of non-aging-related Mandelbugs for missions 6 and 14, and 95% confidence interval based on the four earlier missions Accomplishments SAS08_Classify_Defects_Nikora

Machine Learning/Text Mining Results Flight software failures vs. all other failures Flight software failures vs. all other failures Accomplishments Next Slide SAS08_Classify_Defects_Nikora

Machine Learning/Text Mining Results Ground software failures vs. all other failures Accomplishments Next Slide SAS08_Classify_Defects_Nikora

Machine Learning/Text Mining Results Flight and ground software failures vs. all other failures Accomplishments Next Slide SAS08_Classify_Defects_Nikora

Machine Learning/Text Mining Results Procedural/process errors vs. all other failures Accomplishments SAS08_Classify_Defects_Nikora

Classifying Software Faults to Improve Fault Detection Effectiveness

Classifying Software Faults to Improve Fault Detection Effectiveness

Presentation Transcript

Software Faults and Fault Injection Models

Software Fault-Tolerance

Software faults reliability

Line Fault Detection

Faults in Circuits and Fault Diagnosis

Fault detection

Software Testing: Finding Software Faults

Data Mining Applied To Fault Detection

Fault Detection

Strategies to Consider to Improve ACCV Effectiveness

Sophistocation of Fault Detection

Faults and fault-tolerance

Soft-Error Detection Through Software Fault-Tolerance Techniques

Terminology and empirical measures General methods to mask faults . Software-fault tolerance

Software Fault-Tolerance

Faults and fault-tolerance

Classifying fault-tolerance

Fault detection

Fault Detection and Diagnosis

Terminology and empirical measures General methods to mask faults . Software-fault tolerance

Faults and fault-tolerance