Performance Analysis for Software Risk Mitigation

7/22/04 Report Back:Performance Analysis Track Dr. Carol SmidtsWes Deadrick

Track Members • Carol Smidts (UMD) – Track Chair • Integrating Software into PRA • Ted Bennett and Paul Wennberg (Triakis) • Empirical Assurance of Embedded Software Using Realistic Simulated Failure Modes • Dolores Wallace (GSFC) • System and Software Reliability • Bojan Cukic (WVU) • Compositional Approach to Formal Models • Kalynnda Berens (GRC) • Software Safety Assurance of Programmable Logic • Injecting Faults for Software Error Evaluation of Flight Software • Hany Ammar (WVU) • Risk Assessment of Software Architectures

Agenda • Characterization of the Field • Problem Statement • Benefits of Performance Analysis • Future Directions • Limitations • Technology Readiness Levels

Characterization of Field • Goal: Prediction and Assessment of Software Risk/Assurance Level (Mitigation optimization) • System Characteristics of interest • Risk (Off-nominal situations) • Reliability, availability, maintainability = Dependability • Failures - general sense • Performance Analysis Techniques - modeling and simulation, data analysis, failure analysis, design analysis focused on criticality

Problem Statement • Why should NASA do performance analysis? - We care if things fail! • Successfully conducting SW and System Performance Analysis gives us the data necessary to make informed decisions in order to improve performance and overall quality • Performance analysis permits: • Ability to determine if/when system meets requirements • Risk reduction and quantification • Application of new knowledge to future systems • A better understanding of the processes by which systems are developed and therefore enables NASA to exercise continual improvement

Benefits of Performance Analysis • Reduced development and operating costs • Manage and optimize current processes thereby resulting in more efficient and effective processes • Defined and repeatable process – reduced time to do same volume of work • Reduces risk and increases safety and reliability • Better software architecture designs • More maintainable systems • Enable NASA to handle more complex systems in the future • Put the responsibility where it belongs from a organizational perspective - focuses accountability

Future Directions for Performance Analysis • Automation of modeling and data collection – increased efficiency and accuracy • A more useful, better reliability model • useful = user friendly (enable the masses not just the domain experts), increased usability of the data (learn more from what we have) • better = greater accuracy and predictability • Define and follow repeatable methods/processes for data collection and analysis including: • education and training • use of simulation • gold nugget = accurate and complete data

Future Directions for Performance Analysis (Cont.) • Develop a method for establishing accurate performance predictions earlier in life cycle • Evolve to refine system level assessment • factor in the human element • Establish and define an approach to performing trade-off of attributes – reliability, etc. • Need for early guidance on criticality of components • Optimize a defect removal model • Methods and metrics for calculating/defending return on investment of conducting performance analysis

Why not • Standard traps - Obstacles • Uncertainty about scalability • User friendliness • Lack of generality • “Not invented here” syndrome • Costs and benefits • Difficult to assess and quantify • Long term project benefit tracking recommended

Technology Readiness Level • Integrating Software into PRA – Taxonomy (7) • Test-Based Approach for Integrating SW in PRA (3) • Empirical Assurance of Embedded Software Using Realistic Simulated Failure Modes (5) • Maintaining system and SW test consistency (8) • System Reliability (3) • Software Reliability (9) • Compositional Approach to Formal Models (2) • Software Safety Assurance of Programmable Logic (2) • Injecting Faults for Software Error Evaluation of Flight Software (9) • Risk Assessment of Software Architectures (5)

Research Project Summaries

Integrating Software Into PRADr. Carol Smidts, Bin Li Objective: • PRA is a methodology to assess the risk of large technological systems • The objective of this research is to extend current classical PRA methodology to account for the impact of software onto mission risk

Integrating Software Into PRA (Cont) Achievements • Developed a software related failure mode taxonomy • Validated the taxonomy on multiple projects (ISS, Space Shuttle, X38) • Proposed a step-by-step approach to integration in the classical PRA framework with quantification of input and functional failures.

TRIAKIS Corporation Analyze/Test/V&V Analyze/Test/Verify Requirements Build Model,Simulate,Prototype,ES, etc. SYSTEM SW Interpretation Integration Testing Design/Debug Design/Debug Problem Most embedded SW faults found at integ. test traceable to Rqmts. & interface misunderstanding Disconnect exists between System and software development loops

TRIAKIS Corporation Approach • Develop & simulate entire system design using executable specifications (ES) • Verify total system design with suite of tests • Simulate controller hardware • Replace controller ES with simulated HW running object (flight) software • Test SW using system verification tests When SW passes all system verification tests, it has correctly implemented all of the tested requirements

Problem: FMEA Limitations Expensive & time-consuming List of possible failure modes extensive Focuses on prioritized subset of failure modes Approach: Test SW w/sim’d Failures Create pure virtual simulation of Mini-AERCam HW & flight environment running on PC Induce realistic component/subsystem failures Observe flight SW response to induced failures TRIAKIS Corporation IV&V Facility Mini-AERCam Empirical Assurance of Embedded SWUsing Realistic Simulated Failure Modes • Can we improve coverage by testing SW resp. to sim’d failures? • Compare results with project-sponsored FMEA, FTA, etc.: #Failure modes evaluated? #Issues uncovered? Effort involved?

Software and System ReliabilityDolores Wallace, Bill Farr, Swapna Gokhale • Addresses the need to evaluate and assess the reliability and availability of large complex software intensive systems by predicting (with associated confidence intervals): • The number of software/system faults, • Mean time to failure and restore/repair, • Availability, • Estimated release time from testing.

2003 & 2004 Research 2003 (Software Based) • Literature search completed • New models were selected: 1) Enhanced Schneidewind (includes risk assessment and trade-off analysis) and 2) Hypergeometric Model • Incorporated the new software models into the established public domain tool SMERFS^3 • Applied the new models on a Goddard software project • Made the latest version of SMERFS^3 available to the general public 2004 (System Based) • Conducted similar research effort for System Reliability and Availability • Will enhance SMERFS^3 and validate the system models on a Goddard data set

A Compositional approach to Validation of Formal Models Dejan Desovski, Bojan Cukic • Problem • Significant number of faults in real systems can be traced back to specifications. • Current methodologies of specification assurance have problems: • Theorem Proving: Complex • Model Checking: State explosion problems • Testing: Incomplete. • Approach • Combine them! • Use test coverage to build abstractions. • Abstractions reduce the size of the state space for model checking. • Develop visual interfaces to improve the usability of the method.

Identify Interfaces and Critical Sections Obtain Source Code and Documentation Error/Fault Research Start Sufficient time and funds? Estimate Effort Required Select Subset Importance Analysis Yes Feedback to FCF Project Fault Injection Testing Test Case Generation End Document Results, Metrics, Lessons Learned Software Fault Injection ProcessKalynnda Berens, Dr. John Crigler, Richard Plastow • Standardized approach to test systems with COTS and hardware interfaces • Provides a roadmap of where to look to determine what to test

Programmable Logic at NASAKalynnda Berens, Jacqueline Somos • Issues • Lack of good assurance of PLCs and PLDs • Increasing complexity = increasing problems • Usage and Assurance Survey - SA involved in less than 1/3 of the projects; limited knowledge • Recommendations • Trained SA for PLCs • PLDs – determine what is complex; use process assurance (SA or QA) • Training Created • Basic PLC and PLD training aimed at SA • Process assurance for hardware QA

Year 2 of Research • What is industry and other government agencies doing for assurance and verification? • An intensive literature search of white papers, manuals, standards, and other documents that illustrated what various organizations were doing. • Focused interviews with industry practitioners. Interviews were conducted with assurance personnel (both hardware and software) and engineering practitioners in various industries, including biomedical, aerospace, and control systems. • Meeting with FAA representatives. Discussions with FAA representatives lead to a more thorough understanding of their approach and the pitfalls they have encountered along the way. • Position paper, with recommendations for NASA Code Q

Current Effort • Implement some of the recommendations • Develop coursework to educate software and hardware assurance engineers • Three courses • PLCs for Software Assurance personnel • PLDs for Software Assurance personnel • Process Assurance for Hardware QA • Guidebook • Other recommendations • For Code Q to implement if desired • Follow-up CSIP to try software-style assurance on complex electronics

Severity Analysis MethodologyHanny Ammar, Katerina Goseva-Popstojanova, Ajith Guedem, Kalaivani Appukutty, Walib AbdelMoez, and Ahmad Hassan • We have developed a methodology to assess severity of failures of components, connectors, and scenarios based on UML models • This methodology is applied on NASA’s Earth Observing System (EOS)

Requirement Risk Analysis Methodology • We have developed a methodology for assessing requirements based risk using normalized dynamic complexity and severity of failures. This can be used in the DDP process developed at JPL. Risk factor of scenario S1 in Failure mode FM2 • According to Dr. Martin Feather’s DDP Process, “The Requirements matrix maps the impacts of each failure mode on each requirement.” • Requirements are mapped to UML use case and scenarios • A failure mode refers to the way in which a scenario fails to achieve its requirement

What to Read • Key works in the field • Tutorials • Web sites • Will be completed at a later time

Performance Analysis for Software Risk Mitigation