E N D
1. Safety and Software Engineering Jim Land
BSAE WVU, 1962
MSAE USC, 1965
Member SAE
Co-Founder, High Integrity Solutions, Ltd.
President, Irvine Labs, Inc.
Associate USC CSE
2. A View of Safety A life without adventure is likely to be unsatisfying, but a life in which adventure is allowed to take whatever form it will, is likely to be short. Bertrand Russell
Whatever we do, we are exposed to degrees of unsafe situations
There are extremes
Fear of leaving the house
Unrestrained high-danger sports
In general, society sets levels of acceptable risk
Individually, we operate within these levels as set by society
In these next two classes we will look at aspects of safety and its impact on systems and software engineering
3. High Assurance We are really interested in a class of systems called High Assurance Systems
Types of HAS
Safe Systems
Secure Systems
Systems in which the financial impact of failure is large (banking systems)
System in which the environmental impact of failure is large (nuclear, waste water, etc.)
For ease, we will limit our consideration to safe systems
4. Our Field of View Where software is likely to be involved, and
Software is involved in systems that demand High Integrity safety:
Preventative -- Software is used to assure an acceptable level of safety in the system architecture or,
Causative -- Software could be a contributing cause of a systems failure to deliver an acceptable level of safety
5. A Two Day Overview Day One Engineering For Safety, a Perspective
High Integrity Overview
Definitions and Concepts
Examples of High Integrity Systems
Popular examples of Safety failures
Specifying High Integrity Systems
Domain, Requirements and Stakeholder Needs
Analyzing High Integrity Systems
Methods and Techniques
Industrial Standards Day Two Developing Safer Systems
Architectures
Redundancy
Fault Tolerance
The Byzantine Generals Problem
Development Processes
Assuring Safer Systems
Safety Engineering
Verification and Validation Concepts
Certification
A Compendium of Tools
6. Some good references
7. What is safe?(See IEC 61508) Safe measured in the eyes of the beholder
Condition of exposure under which there is a practical certainty that no harm will result to exposed individuals.
Free from danger or the risk of harm.
Freedom from unacceptable risk of harm IEC 61508
Harm (in the context of a system)
Physical injury, or the damage to health, property or the environment, that may be caused by the system
Economic harm
Risk is the probable rate of occurrence of a hazard causing harm and the degree of severity of the harm. Risk has two elements:
the frequency with which a hazard occurs, and
the consequences of the hazardous event.
8. Safety Integrity Level Level of Safety
a level of how far safety should be pursued in a given context, assessed to an acceptable level of risk, is based on the values of society. In order to achieve an acceptable level of risk, we need to determine whether:
the risk is so great that it must be shunned; or
the risk is, or has been made, so small as to be insignificant, or
the risk falls between (a) and (b), and that it has been reduced to the lowest level practicable (bearing in mind the benefits flowing from its acceptance and taking into account the costs of any further reduction).
Safety Integrity Level (SIL)
9. FAA AC 25.1309-1A
10. Safety Integrity Level (SIL)
11. More Definitions
Error: A design flaw or deviation from a desired or intended state
May or may not lead to a hazard
A state
Fault: A higher order safety related event
All failures are faults
Not all faults are failures (The 9/11 airplanes worked perfectly as they were flown in to the twin towers.)
Hazard: State or condition of a system that leads to an accident
Accident: an event that results in at least a specified level of loss
Failure: inability of the system to perform its intended function; a behavior
Reliability: The probability that the system will perform its intended function satisfactorily for a prescribed time under stipulated environmental conditions
The aggregate probability of failure of the system
Reliability determined by bottoms-up Failure modes and effects analysis
Numerical approach
Dependable: (not usually used for safety analysis) the trustworthiness of a system which allows reliance to be justifiably placed on the service it delivers (IFIP definition)
12. The Concept of System Safety System Safety
Uses Systems Theory and Systems Engineering
Prevent foreseeable accidents
Minimize impact of unforeseen events
Emphasis on Loss
Life and injury, but also,
environmental, economic, mission Note that in HAZOP or FHA, the emphasis is on a team of experts performing the core functions. However in System Safety, it is more the practice of Systems Theory and General Systems Engineering performing an analysis using the underlying theories. The experienced domain expert may be relied on to provide the anticipation events, however.Note that in HAZOP or FHA, the emphasis is on a team of experts performing the core functions. However in System Safety, it is more the practice of Systems Theory and General Systems Engineering performing an analysis using the underlying theories. The experienced domain expert may be relied on to provide the anticipation events, however.
13. Tenants of System Safety
Build it in rather than:
Build it on or
Test it out
Considers the whole system, not just elements
Takes a large scope on Hazards
Not all failures lead to a hazard
Not all hazards are caused by failures
Emphasis on analysis rather than experience
Anticipate how hazards can occur
Emphasis on qualitative rather than quantitative
14. Validation and Verification Validation answers the question, Have we got the requirements right?
Major errors in system development are made because we dont get the requirements or the domain description right
Issues include: completeness and consistency, conformance to standards, conflicting requirements, errors, ambiguity, can it be built
Verification answers the question, Have we implemented a system that satisfies the requirements?
The verification effort can take as much of 70% to 80% of the development effort Note that it is in the area of getting the requirements right (including domain description) where many costly mistakes are made. Right means: clear, precise, consistent, unambiguous, complete, ability to build and maintain it, etc. Note that it is in the area of getting the requirements right (including domain description) where many costly mistakes are made. Right means: clear, precise, consistent, unambiguous, complete, ability to build and maintain it, etc.
15. Modern Examples of High Integrity Systems Airplane
fly by wire systems
Air traffic control systems
Terminal navigation system
TCAS Traffic Alert and Collision Avoidance System
Automotive
Automatic car following cruise control
Automotive automatic braking systems
Passenger protection systems (New Lexus)
Other Transportation Modes
Rail and Transit
Waterway
Industrial
Nuclear power plant control system
Hydro-electric or fossil power plant control systems
Water distribution systems
Chemical processing facilities
Waste processing facilities
Medical Systems
Banking transaction systems
16. Well Known Examples of Failure to Keep Safe Nuclear
Three Mile Island
Chernobyl
Aviation
Domestic
International
Rail
Recent Los Angeles Commuter Rail
Subway Systems
Railway Accidents
Highway Systems
Bridges Tacoma Narrows
Industrial Bhopal, India Chemical Plant Accident Question for the Software Engineer
Did software contribute to the accident?
Could better use of systems and software help to avoid the accident?
These are the two ways in which software is involved in safety systems engineering
17. North American Aviations Worst
18. Internationals Worst Air Accidents
19. AA 191Takeoff From Ohare Under normal circumstances an aircraft losing an engine would be able to fly on the remaining power plants still functioning,
When the engine separated, it took a 3 foot section of the wing with, it ripping out vital hydraulic and electric lines.
The starboard slats stayed extended but the port slats retracted because of the leaking fluid, causing a stall.
The crew was unaware of the retraction due to the fact that the no.1 generator powered the Captain's instrument panel, and thus the slat disagreement system.
The stick-shaker had also been disabled.
a 10 inch fracture on the rear bulkhead on the pylon.
8 weeks before the accident, the aircraft went through a major check and the self aligning bearings on the bulkhead to wing attachment joints were changed.
Normal procedures would involve removing the engine and pylon from the wing separately, by use of a special cradle to lower the engine, but to save on time, a new idea was adapted using a forklift truck to take the whole assembly off as one unit. A combination of issues: design, maintenance, operations, lack of operator, manufacturer, FAA coordination and communication.
Usually the case, there are a number of cascading events that result in disasters.A combination of issues: design, maintenance, operations, lack of operator, manufacturer, FAA coordination and communication.
Usually the case, there are a number of cascading events that result in disasters.
20. The last thirty seconds of Tenerif A lot of contributing factors: overly busy airport due to diversions, bomb in the terminal, fog, language confusion
Worst air disaster in history 553 deathsA lot of contributing factors: overly busy airport due to diversions, bomb in the terminal, fog, language confusion
Worst air disaster in history 553 deaths
21. Causes of Air Accidents Airplane Design
ATC and Navigation
Cargo
Collisions
External Factors
Flight Crew
Fire
Landing and Takeoff Maintenance
Result CFIT, emergency, etc.
Security
Weather
Unknown
Usually, a combination of factors
22. Some Observations Accidents in which large numbers of people die are the ones that get our attention DREAD FACTOR
But there are typically 100s of accidents in air travel each year in which no deaths occur or a few result
Many of these could have ended with a different result
A lot of accidents occur over water and out of contact with land
Most are attributable to mechanical failure
Human error is the root cause in most
New generation aircraft are using more computers and offer more opportunity for software failure; yet they are safer
Systems and software can impact safety in a number of ways design, support to maintenance and operations, etc.
23. Keeping things in perspective Commercial Air Transport Travel is safer by the minute than driving to the local market by far
The only safer mode of transportation is the elevator!
Not the bicycle
Not even walking
It keeps getting safer by the year, in spite of increased air travel
Third Generation aircraft (777,737NG, AB330, etc.) are three times safer than earlier generations
Aircraft design
Air operations and emphasis on safety
But we have the dread factor
We arent in control
A lot of people go at once
The long wait to die
Fear of the Unknown
24. Safety is improving with time Number of years to death at one cross country round trip per week = 33,000.
From 1984 to 2003 the mileage flown has doubled (6B in 2003) challenge to keep technology insertion.
How to view the trend: straight line or throw out the gross outliers and we have an increase from 84 to early 90s followed by a decrease new technology insertion? IFATC, Terminal Operations, Airplane design?Number of years to death at one cross country round trip per week = 33,000.
From 1984 to 2003 the mileage flown has doubled (6B in 2003) challenge to keep technology insertion.
How to view the trend: straight line or throw out the gross outliers and we have an increase from 84 to early 90s followed by a decrease new technology insertion? IFATC, Terminal Operations, Airplane design?
25. Causes of Failure in Safety
26. Example of Failure in Domain Description Tacoma Narrows
27. Bhopal, India Disaster 1984, worst industrial disaster in the world
15,000 deaths attributed
100,000s affected long after
Caused by the introduction of water into MIC holding tanks. The resulting reaction generated many large surges of toxic gas, forcing the emergency release of pressure.
The gas escaped while the chemical 'scrubbers' which should have treated the gas were off-line for repairs.
Claimed that several other safety procedures were bypassed [Wikepedia Encyclopedia]
A modern computer-based control system should/would have built in safety monitoring features and controls and could have prevented the accident conditions from occurring. Human error was the primary factor. But there were other factors: switch from US to Indian operators, experienced staff quit because of poor conditions, plant being run by EE rather than Chemical Engineer, etc.
A Well designed computer management system could have avoided this accident, even with the old physical plant.Human error was the primary factor. But there were other factors: switch from US to Indian operators, experienced staff quit because of poor conditions, plant being run by EE rather than Chemical Engineer, etc.
A Well designed computer management system could have avoided this accident, even with the old physical plant.
28. Three Mile Island Three Mile Island Unit 2 (TMI-2) nuclear power plant near Middletown, Pennsylvania, on March 28, 1979
Most serious nuclear power plant accident in US history
Brought about major changes in:
Emergency Response Planning Emergency Response Facility Data System (ERFDS)
Plant operations training
Human Factors Engineering
Government regulatory oversight
Deployment of nuclear power in the USA
The sequence of certain events - - equipment malfunctions, design related problems and worker errors - - led to a partial meltdown of the TMI-2 reactor core but only very small off-site releases of radioactivity. NRC
29. TMI-2 Plant Diagram
30. Sequence of Events at TMI-2 4 AM, March 28, 1979, main feedwater pump stops working
Turbine shuts down
Reactor shuts down
Pressure build up in primary section, leads to pilot relief valve opening at top of pressurizer
Should have closed after pressure relief but it didnt
Indicators failed to show operators that valve still open
Excessive loss of water in system
Uninformed operators reduced flow of water to the core, resulting in core meltdown
After affects
Release of radiation from secondary building to relieve pressure on the core
Hydrogen bubble buildup in the containment facility
Fortunately, no rupture in containment building
1993 14 years later the cleanup was completed, site monitored A cascading of errors led to the hazard and accident.
Some of the after affects were:
1. the imposition of the ERFDS an independent monitoring and response system.
2. more attention to the hazards assessment and use of Probabilistic Risk Assessment (PRA) like being used by NASA for manned space flight.A cascading of errors led to the hazard and accident.
Some of the after affects were:
1. the imposition of the ERFDS an independent monitoring and response system.
2. more attention to the hazards assessment and use of Probabilistic Risk Assessment (PRA) like being used by NASA for manned space flight.
31. Modern Rail Accidents AMAGASAKI, Japan
52 died, 417 injured
April 25, 2005
Worst rail accident in Japan in 40 years
32. Ways of Assessing Risk System Safety Assessment (SSA)
Failure Modes, Effects and Criticality Analysis (FMEA/FMECA)
Probabilistic Risk Assessment (PRA)
33. Probabilistic Risk Assessment PRA as an analytical tool includes consideration of the following:
Identification and delineation of the combinations of events that, if they occur, could lead to an accident (or other undesired event);
Estimation of the chance of occurrence for each combination; and
Estimation of the consequences associated with each combination.
In Nuclear; focuses on damage to reactor core and containment facility
Applied to the total fuel cycle
Four questions answered by PRA (NASA Study results)
1. What can go wrong, or what are the initiators or initiating events (undesirable starting events) that lead to adverse consequences?
2. What and how severe are the potential adverse consequences that the technological entity and the extended environment on the crew may be eventually subjected to as a result of the occurrence of the initiator?
3. How likely to occur are these undesirable consequences, or what are their probabilities or frequencies?
4. How confident are we about our answers to the above questions?
34. Issues for the Systems and Software Professional Ethical Issues
Our obligations to exhibit our concern for safety when engineering a system
Our personal roles in raising safety related issues during project reviews
Our obligations when we believe a system presents an unacceptable level of risk
Basis of our perceptions
Our Role
Professional Issues
Our professional obligations
Our professional society obligations
Our legal obligations
The role of independent assessment of system safety
Removes conflicts in the development organization
Independent assessment must be funded, staffed and empowered
Internal and external independence
In the FAA, role of the DER
For civil aviation, FAA conducts independent, in-process audits
35. Specifying High Integrity Systems
Domain Description
A system perspective
A software perspective
Requirements
Understanding Stakeholder Needs
Who are the stakeholders
Getting stakeholder requirements
Bridging the gap between needs and The Specification
36. Domain Description Specifications should describe the domain explicitly; they should distinguish domain properties that are independent of the system from those that the system is required to enforce.
An ordinary domain description is in the indicative mood:
it asserts certain truths about the domain.
A requirement, on the other hand, while describing the domain, is in the optative mood.
It describes the desired state of affairs that the machine produces
37. Software Requirements and Specificaitons To develop software is to build a Machine, simply by describing it.
E.g., software development is engineering
Application Domain the parts of the world that will affect the machine and will be affected by it
The problem is in the application domain, the machine is the solution
The Application Domain must be explicitly and precisely described
38. The Domain is separate from the Machine
39. Simple Example of Domain Error APPLICATION OF THRUST REVERSERS UPON LANDING
Requirement
REVERSE_ENABLED if an only if MOVING_ON_RUNWAY
WRONG Software Specification
REVERSE_ENABLED if an only if WHEEL_PULSES_ON
Domain Property: WHEEL_PULSES_ON if and only if WHEELS_TURNING
NOT Domain Property: WHEEL_PULSES_ON if and only if MOVING_ON_RUNWAY
The problem was one of domain error; i.e., water on the runway and aquaplaning
DOOR LOCKING MECHANISMS DURING CRASH AND ENGINES RUNNING
AUTOMATIC BRAKING SYSTEM ON AUTOMOBILE
40. Safety Stakeholders Need to separate each stakeholders safety requirements level of acceptable risk
Understand safety requirements in relationship to other system needs
For example; a weapon system such as a helicopter has acceptable levels of risk a lot higher than a commercial airliner
Resolve safety requirements among stakeholders early on
Reach Agreement
among Stakeholders and
between Stakeholders and System Implementers
41. The Satisfaction Argument
42. General Methods for Safety Analysis Safety Case
HAZOP
FMECA
Fault Tree Analysis
Numerical Methods
Industrial Standards of the IEEE, IEC, ANSI, SAE, and others
Government Standards
Probabilistic Reliability Analysis (PRA)
43. Industry Specific Methods Commercial Aviation
SAE ARP 4754
SAE ARP 4761
RTCA DO 178B and 254
FAA Software Mega Order
Military MIL STD 882, UK MOD DEF STAN 00-55 and 00-56
NASA and Space
Nuclear
European Safety Standard IEC 61508
Medical
44. Safety Case UK MOD:
A structured argument, supported by a body of evidence that provides a compelling, comprehensible and valid case that a system is safe for a given application in a given operating environment.
UK Railroad
The safety case documents must set out how the rail operators will manage and control the health and safety of staff and the public and their contingency plans for dealing with emergencies and other abnormal situations. This includes:
safety policy and objectives
a risk assessment
safety management systems
risk control measures
Industrial source
A safety case is a comprehensive, written justification that a system or operation will be safe throughout its lifecycle from inception to eventual decommissioning.
The Safety Case is the integration of arguments and evidence that describe, quantify and substantiate the safety, and the level of confidence in the safety, of a facility or activity.
45. Goal Structured NotationComponents
46. Using the GSN Notation
47. An Example of GSN
48. UK Ministry of Defense and Risk Management Hazard Identification.
Hazard Analysis.
Risk Estimation.
Risk and ALARP Evaluation.
Risk Reduction.
Risk Acceptance.
49. HAZOP The Hazard and Operability Study, known as HAZOP, is a standard hazard analysis technique used in the preliminary safety assessment of new systems or modifications to existing ones.
The HAZOP study is a detailed examination, by a group of specialists, of components within a system to determine what would happen if that component were to operate outside its normal design mode.
Each component will have one or more parameters associated with its operation such as pressure, flow rate or electrical power.
The HAZOP study looks at each parameter in turn and uses guide words to list the possible off-normal behavior such as 'more', 'less', 'high', 'low' or 'no'.
The effects of such behavior is then assessed and noted down on study forms. The categories of information entered on these forms can vary from industry to industry and from company to company
50. SAE ARP 4761 Overview Safety Assessment Process
Functional Hazards Assessment (FHA)
Preliminary System Safety Assessment (PSSA)
System Safety Assessment (SSA)
Safety Assessment Analysis Methods
Fault Tree Analysis/Dependency Diagrams/Markov Analysis
Failure Modes and Effects Analysis (FMEA)
Failure Modes and Effects Summary
Common Cause Analysis (CCA)
Zonal Safety Analysis (ZSA)
Particular Risk Analysis
Common Mode Analysis
51. FHA Conducted at the beginning of development cycle
Identifies and classifies failure conditions associated with aircraft functions and combinations of functions
Classification
Minor (Level 1 or D)
Major
Severe
Catastrophic (Level 4 or A)
These lead to the establishment of safety objectives
52. PSSAPRELIMINARY SYSTEM SAFETY ASSESSMENT Systematic examination of the proposed system architecture to determine how failures can cause the functional hazards identified in the FHA
Objective is to establish the safety requirements of the system and verify that the proposed architecture can reasonably be expected to meet the objectives identified in the FHA
Usually takes the form of a Fault Tree Assessment (FTA) and includes Common Cause Analysis (can use Dependency Diagrams or Markov Analysis)
Includes hardware and software failures
53. SSA A systematic, comprehensive evaluation of the implemented system to show that the safety objectives of the FHA and derived safety requirements of the PSSA are met.
Usually based on the FTA of the PSSA (may use DD or MA)
Uses quantitative results of the FMES
54. FMEA Systematic, bottoms-up method of identifying the failure modes of the system, item or function
Determines the effects on the next higher level
Software can be analyzed qualitatively as part of FMEA
Typically used to analyze failure effects from single point failures
55. CCA Common Cause Analysis Includes
Zonal Safety Analysis (installation, interference, maintenance)
Particular Risk Analysis
Fire
High energy devices
Leaking fluids
Hail, Ice, Snow,
Etc.
Common Mode Analysis
Hardware Error
Software Error (multiple, identical software)
Etc.
56. Commercial Tools Available A large and growing number of software-based tools are available to assist in safety analysis
Isograph Reliability WorkBench, SAIC CAFTA, Item Software (particularly for PRA), etc.
Government tools from NASA and NRC
The Key Ingredients to Safety Analysis requires
Domain Expertise
Practical Experience
Knowledge of the processes and tools