SENG 521 Software Reliability & Testing

SENG 521Software Reliability & Testing Defining Necessary Reliability (Part 3)

Contents • Steps in defining necessary reliability • Failure severity class (FSC) • Failure intensity objective (FIO) • Strategies to meet FIO • System reliability • Reliability economics

SRE: Process /1 • 5 steps in SRE process: • Define necessary reliability • Develop operational profiles • Prepare for test • Execute test • Apply failure data to guide decisions Define Necessary Reliability Develop Operational Profile Prepare for Test Execute Test Apply Failure Data to Guide Decisions

Part 3 Section 1 How to define Necessary Reliability?

Necessary Reliability: How to • Define failure with “failure severity classes (FSC)” for the product. • Set a “failure intensity objective (FIO)” for each system to be tested. • Choose a common scale for all associated systems. • Find the developed software failure intensity objective. • Engineer strategies to meet the software failure intensity objective.

1. Failure Severity Classes • Failures usually differ by their impact on the system • A failure Severity Class (FSC) is a set of failures that have the same per-failure impact on users using a failure classification criteria • Common classification criteria: • cost, system capability, human life, environment • Failure severity is different from its complexity • Severity can change with the time of failure

FSC: Common Classification • Common classification criteria: Cost • What does this failure cost in terms of operational cost, repair cost, loss of business, disruption, etc. • Severity classes based on cost may be scaled by a factor of 10. • Usually 4 ranges are enough.

FSC: Common Classification • Common classification criteria: System capability (Services) • May include factors such as loss of data, downtime, recoverability, etc.

FSC: Common Classification • Common classification criteria: Environment • May include factors such as harmful to environment, loss of wild life, etc. • Applicable to nuclear, chemical industry, etc.

FSC: Common Classification • Common classification criteria: Human life • May include factors such as harmful to human or environment, loss of human life, etc. • Applicable to aeronautical, automotive, nuclear, health care industry, military systems, etc.

How to Define FSC? • Experience based: ask users/ stakeholders/ developers/ compare to similar products • List all factors that may be considered as failure severity for the project • Narrow the list down to the most critical and/or measurable ones • Some factors may be hard to measure, such as impact on company reputation, etc.

FSC: Conflicting Concerns • Conflicting viewpoints (concerns) between the software developer and customer regarding failure severity class (FSC) should be resolved before proceeding to set target failure intensity objective • Comparison of the FSC for the software with a similar product is usually useful

Documenting FSC Define classes for each criterion separately

2. Failure Intensity Objective (FIO) • Failure intensity objective (FIO) reflects an estimation of the “bugs” allowed to be remained in the product at the release time. • FIO is an alternative way of expressing reliability.

Failure Intensity Objective • Failure intensity is usually given in terms of number of failure per time (or some other defined units), e.g., • 3 alarms per 100 hours of operation. • 5 failures per 1000 print jobs, etc. • Failure intensity of a system is the sum of failure intensities for all of the components of the system (assuming exponential model).

How to Set FIO /1 • Mainly experience based and depends on the project. • Depends on the trade-off among quality characteristics (development time and development cost) and functionality and technology. • Rule of thumb: Estimate the project’s total cost (C), e.g., using COCOMO’s Early Design Model, etc., and set FIO to be 1 over C (i.e., C units of operation, assuming that the cost of highest impact is equal to the total development costs)

How to Set FIO /2 • Typical FIO for various projects

How to Set FIO: Reliability • Setting FIO in terms of reliability  is failure intensity R is reliability t is natural unit (time, etc.) • For reliability around 0.992 for 8 hours of operation,  is set to0.001

Reliability & Failure Intensity

How to Set FIO: Availability • Setting FIO in terms of system availability (A) for the exponential model :  is failure intensity is downtime per failure • e.g., if a product must be available 99% of time and downtime is 6 min, then FIO is about 1 per 10 hours.

How to Set FIO: MTTF • Using MTTF  failure intensity MTTR meantime to repair MTTF meantime to failure • Another definition of availability:

How to Set FIO: Hazard Rate • Hazard Rate z(t): The probability that the component will fail in a given time interval given that it has not failed prior to the interval • Hazard rate of 0.05 means that there is a 5% chance that the first failure will occur in the specified time interval and not before • For exponential distribution, z(t) is 

Reliability vs. Availability • Why specify reliability when availability is better understood and has better intuitive appeal? • Availability has a subjective appeal to the user and there are usually workarounds to make the system available without increasing the intrinsic reliability of it. • Example:Using a replica server in case the domain server goes down increases the availability of the system but it does not necessarily increase the reliability of the server software.

Developed Software Product • Developed software product is usually only a part of the whole system Interface to other systems Acquired components Developed components OS, System software Hardware

3. Choose a Common Scale • There may be various scales for expressing FIO for various project parts. • Example: • System failure intensity objective = 30 failure/1,000,000 transactions • MTTF for OS is 3,000 hours for 10 million transactions • MTTF for hardware is 1 per 30 hours of operation • One must define a unique scale for all FIOs

FIO for Developed Product • How to compute failure intensity objective for the developed software? • Set FIO for the whole system • Set a common measurement unit for failure intensity for the whole system • Subtract expected failure intensity for acquired components from the FIO. • Subtract expected failure intensity for the environment (OS, interface systems) that the developed software will run on • The remaining will be failure intensity objective for the developed software components.

Computing Developed FIO Example 1: • System failure intensity objective = 100 failure/1,000,000 transactions • Failure intensity for hardware = 0.1 failure/hour • OS failure for a load of 100,000 transactions = 0.4 failure/hour • Therefore, developed software FIO = 95 failure/1,000,000 transactions

Computing Developed FIO Example 2: Database system running on Win 2K • System failure intensity objective = 30 failure/1,000,000 transactions • MTTF for Win 2K is around 3,000 hours for 10 million transactions • Average hardware failure is 1 per 30 hours • Failure rate for other systems is 9 for one million transactions • What is FIO for the developed software?

Computing Developed FIO

4. Strategies to Meet FIO • Engineer strategies to meet the software failure intensity objective for the developed software. • 4 main strategies: • Fault prevention • Fault removal • Fault tolerance • Fault/failure forecasting

Fault Prevention • To avoid fault occurrences by construction. • Activities: • Requirement review • Design review • Clear code • Establishing standards (ISO 9000-3, etc.) • Using CASE tools with built-in check mechanisms • Effectiveness factor: • Proportion of the faults remaining after prevention activities.

Fault Removal • To detect, by verification and validation, the existence of faults and eliminate them. • Activities: • Code review • Test • Effectiveness factor: • Reduction of failure intensity due to code review. • Ratio of failure intensity after test and before test.

Fault Tolerance • To provide, by redundancy, service complying with the specification in spite of faults occurrences. • Activities: • Designing and implementing redundancy • Effectiveness factor: • Reduction of failure intensity as a result of redundant design.

Fault / Failure Forecasting • To estimate, by evaluation, the presence of faults and the occurrences of failures • Activities: • Establishing reliability model • Collecting failure data • Analysis and interpretation of results • Effectiveness factor: • Reduction of failure intensity as a result of applying reliability engineering

Part 3 Section 2 System Reliability

System Reliability /1 • A system usually consists of components. • Each component consists of sub-components. • Components may have • Different reliability • Different dependencies among each other • System reliability is a function of the reliabilities of the (sub-) components and of the relationships between the components.

... R1/1 R2 /2 Serial System Reliability • System is composed of n independent serially connected components. • Failure of any component has a cross system effect, i.e., results in failure of the whole system. • A serial system has always smaller reliability than its components (because Rk 1).

Combining Reliabilities /1 • Serial system reliability can be calculated from component reliabilities, if the components fail independently of each other. • For serial systems: • Components reliabilities (Rk) must be expressed with respect to a common interval. Qp number of components Rk component reliability

Combining Reliabilities /2 • Using relation between reliability and failure intensity: • Will lead to: • i.e., total failure intensity is the sum of failure intensity of components

Example: Serial System • The system is composed of 4 independent serially connected components • R1 = 0.95 • R2 = 0.87 • R3 = 0.82 • R4 = 0.73 Rsystem = 0.95  0.87  0.82  0.73 = 0.4947 • Serial system reliability is smaller than any individual reliability of the components

R1 R2 ... Parallel System Reliability • System is composed of n independent components connected in parallel. • Failure of all components results in the failure of the whole system (principle of active redundancy).

Example: Parallel System • The system is composed of 4 independent components connected in parallel • R1 = 0.95 • R2 = 0.87 • R3 = 0.82 • R4 = 0.73 Rsystem = 1 – ((1 – 0.95)  (1 – 0.87)  (1 – 0.82)  (1 – 0.73)) = 0.9996 • Parallel system reliability is greater than any individual reliability of the components

R11 R12 R1j R1n R21 R22 R2j R2n Ri1 Ri2 Rij Rin Rm1 Rm2 Rmj Rmn Parallel-Series System path

R11 R12 R1j R1n R21 R22 R2j R2n Ri1 Ri2 Rij Rin Rm1 Rm2 Rmj Rmn Series-Parallel System subsystem

Other Constructs • One-way bridge • Two-way bridge

Active Redundancy • Employs parallel systems. • All components are active at the same time. • Each component is able to meet the functional requirements of the system. • Only one component is required to meet the functional requirements of the system. • Each component satisfies the minimum reliability condition for the system. • System only fails if all components fail.

m – out of – n System • System has n components. • At least m components need to work correctly for the system to function properly (m  n). • m=n: serial system • m=1: parallel system • e.g.: airplane with 4 engines can fly with only 2 engines. R1 R2 m/n Ri Rn Assumption: All components have the same reliability.

Reliability Block Diagram (RBD) • Reliability Block Diagram (RBD) is a graphical representation of how the components of a system are connected from reliability point of view. • The most common configurations of an RBD are the series and parallel configurations. • In a serial system configuration, the elements must all work for the system to work and the system fails if one of the components fails. The overall reliability of a serial system is lower than the reliability of its individual components. • In parallel configuration, the components are considered to be redundant and the system will still cease to work if all the parallel components fail. The overall reliability of a parallel system is higher than the reliability of its individual components. • A system is usually composed of combinations of serial and parallel configurations. • RBD analysis is essential for determining reliability, availability and down time of the system.

RBD: Example /1

SENG 521 Software Reliability & Testing