810 likes | 2k Views
Using Root Cause Analysis To Understand Failures & Accidents. 7 th Military and Aerospace Programmable Logic Devices (MAPLD) September 8, 2004. Faith Chandler Office of Safety & Mission Assurance. Agenda. What’s a NASA Mishap? Types of Mishaps Purpose of Mishap Investigation
E N D
Using Root Cause Analysis To Understand Failures & Accidents 7th Military and Aerospace Programmable Logic Devices (MAPLD) September 8, 2004 Faith Chandler Office of Safety & Mission Assurance
Agenda What’s a NASA Mishap? Types of Mishaps Purpose of Mishap Investigation Investigating Causes of Mishaps and Failures Definitions Steps in Root Cause Analysis Generating Recommendations For More Information Stop and ask yourself… Did you really find the causes of the failure or accident?
What’s A Mishap? NASA Mishap. An unplanned event that results in at least one of the following: Injury to non-NASA personnel, caused by NASA operations. Damage to public or private property (including foreign property), caused by NASA operations or NASA-funded development or research projects. Occupational injury or occupational illness to NASA personnel. NASA mission failure. Destruction of, or damage to, NASA property.
“What can go wrong?” Equipment will fail Software will contain errors Humans will make mistakes Humans will deviate from accepted policy and practices There is a lot at stake! Human life One-of-a-kind hardware Government equipment & facilities Scientific knowledge Public confidence Types of Mishaps Challenger Mars Climate Orbiter Columbia NOAA N Prime Payload Canister Helios
Purpose of NASA Mishap Investigation The purpose of NASA mishap investigation process is solely to determine cause and develop recommendations to prevent recurrence. This purpose is completely distinct from any proceedings the agency may undertake to determine civil, criminal, or administrative culpability or liability, including those that can be used to support the need for disciplinary action.
Investigating Causes of Failures & Mishaps Often investigators: Identify the part or individual that failed. Identify the type of failure. Identify the immediate cause of the failure. Stop the investigation. Problem with this approach: The underlying causes may continue to produce similar problems or mishaps in the same or related areas. Satellite System Failure Navigation Equipment Failure Circuit Shorted Bent Pin Incorrect Installation
Investigating Causes of Failures & Mishaps Like icebergs, most of the problem is usually below the surface!
Investigating Causes of Failures & Mishaps When performing an investigation, it is necessary to look at more than just the immediately visible cause, which is often the proximate cause. There are underlying organizational causes that are more difficult to see, however, they may contribute significantly to the undesired outcome and, if not corrected, they will continue to create similar types of problems. These areroot causes. Per NPR 8621.1: NASA Procedural Requirements for Mishap reporting, Investigating, and Recordkeeping, all NASA mishap and close call investigations must identify the proximate causes(s), root causes(s) and contributing factor(s).
Definitions Proximate Cause(s) (Direct Cause) The event(s) that occurred, including any condition(s) that existed immediately before the undesired outcome, directly resulted in its occurrence and, if eliminated or modified, would have prevented the undesired outcome. Examples of proximate causes: Equipment Human Arched • Pushed incorrect button Leaked • Fell Over-loaded • Dropped tool Over-heated • Connected wires
Root Cause(s) One of multiple factors (events, conditions or organizational factors) that contributed to or created the proximate cause and subsequent undesired outcome and, if eliminated, or modified would have prevented the undesired outcome. Typically multiple root causes contribute to an undesired outcome. Organizational factors Any operational or management structural entity that exerts control over the system at any stage in its life cycle, including but not limited to the system’s concept development, design, fabrication, test, maintenance, operation, and disposal. Examples: resource management (budget, staff, training); policy (content, implementation, verification); and management decisions. Definitions
Definitions Root Cause Analysis (RCA) A structured evaluation methodthat identifies the root causes for an undesired outcome and the actions adequate to prevent recurrence. Root cause analysis should continue until organizational factors have been identified, or until data are exhausted. RCA is a method that helps professionals determine: What happened. How it happened. Why it happened. Allows learning from past problems, failures, and accidents.
Root Cause Analysis - Steps Identify and clearly define the undesired outcome. Gather data. Create a timeline. Place events & conditions on an event and causal factor tree. Use a fault tree or other method/tool to identify all potential causes. Decompose system failures down to a basicevents or conditions (Further describe what happened) Identify specific failure modes (Immediate Causes) Continue asking “WHY” to identify root causes. Check your logicand your facts. Eliminate items that are not causes or contributing factors. Generate solutions that address both proximate causes and root causes.
Root Cause Analysis - Steps Clearly define the undesirable outcome. Describe the undesired outcome. For example: “satellite failed to deploy,” “employee broke his arm,” or “XYZ project schedule significantly slipped.” Gather data. Identify facts surrounding the undesired outcome. When did the undesired outcome occur? Where did it occur? What conditions were present prior to its occurrence? What controls or barriers could have prevented its occurrence but did not? What are all the potential causes? What actions can prevent recurrence? What amelioration occurred? Did it prevent further damage or injury?
Root Cause Analysis - Steps Create a timeline (sequence diagram) Illustrate the sequence of events in chronological order horizontally across the page. Depict relationships between conditions, events, and exceeded or failed barriers/controls. Condition Undesired Outcome Event Event Exceeded- Failed Barrier Or Control Event
Root Cause Analysis - Steps Create a timeline (sequence diagram) If amelioration occurred (e.g., put out the fire, cared for the injured), this should be included in the evaluation to ensure that it did not contribute to the undesired outcome. Example: In the case of a death, the investigation should ensure that the death was the result of the mishap and not a delay in medical care or inappropriate medical care at the scene. Condition Undesired Outcome Exceeded- Failed Amelioration Event Event Exceeded- Failed Barrier Or Control Event
Root Cause Analysis - Steps Example: simple timeline. Lost High Speed Data Stream (Mission Failure) Satellite Powered Up Thrusters Oriented Space Craft Satellite Failed to Deploy Antenna Tech. Used Wrong Method To Correct Poor Line of Sight
Root Cause Analysis - Steps Create an event and causal factor tree. (A visual representation of the causes that led to the failure or mishap.) Place the undesired outcome at the top of the tree. Add all events, conditions, and exceeded/failed barriers that occurred immediately before the undesired outcome and might have caused it. Lost High Speed Data Stream From Satellite (Mission Failure) Satellite Power Up Thrusters Oriented Space Craft Technician Used Wrong Method to Correct Satellite Failed To Deploy Antenna Poor Line of Sight
Root Cause Analysis - Steps Create an event and causal factor tree. Brainstorm to ensure that all possible causes are included, NOT just those that you are sure are involved. Be sure to consider people, hardware, software, policy, procedures, and the environment. Lost High Speed Data Stream From Satellite (Mission Failure) Satellite Power Up Thrusters Oriented Space Craft Technician Used Wrong Method to Correct Satellite Failed To Deploy Antenna Poor Line of Sight Meteoroid Impacted Satellite Logic Control Failed Power Supply Failed Arming Relay Failed Firing Relay Failed Sabotage
Root Cause Analysis - Steps Create an event and causal factor tree continued... If you have solid data indicating that one of the possible causes is not applicable, it can be eliminated from the tree. Caution: Do not be too eager to eliminate early on. If there is a possibility that it is a causal factor, leave it and eliminate it later when more information is available. Lost High Speed Data Stream From Satellite (Mission Failure) X Satellite Power Up Thrusters Oriented Space Craft Technician Used Wrong Method to Correct Satellite Failed To Deploy Antenna Poor Line of Sight Meteoroid Impacted Satellite Logic Control Failed Power Supply Failed Arming Relay Failed Firing Relay Failed Sabotage
Root Cause Analysis - Steps Create an event and causal factor tree continued… You may use a fault tree to determine all potential causes and to decompose the failure down to the “basic event” (e.g., system component level). Lost High Speed Data Stream From Satellite (Mission Failure) Poor Line of Sight Thrusters Oriented Space Craft Technician Used Wrong Method to Correct Satellite Failed To Deploy Antenna Meteoroid Impacted Satellite Logic Control Failed Power Supply Failed Arming Relay Failed Firing Relay Failed Sabotage Current Detection Circuit Failed Battery Failed Converter Failed
Root Cause Analysis - Steps Create an event and causal factor tree continued… A fault tree can also be used to identify all possible types of human failures. Lost High Speed Data Stream From Satellite (Mission Failure) Thrusters Oriented Space Craft Poor Line of Sight Technician Used Wrong Method to Correct Satellite Failed To Deploy Antenna Didn’t Perceive System Feedback Correct Decision But Incorrect Action Didn’t Understand System Feedback Correct Interpretation Incorrect Decision Action-Execution Error Interpretation Error Perception Error Decision-Making Error Knowledge-Based Error Skill-Based Error Rule-Based Error
Root Cause Analysis - Steps Create an event and causal factor tree continued… After you have identified all the possible causes, ask yourself “WHY” each may have occurred. Be sure to keep your questions focused on the original issue. For example “Why was the condition present?”; “Why did the event occur?”; “Why was the barrier exceeded?” or “Why did the barrier fail?” Undesired Outcome Event #2 Condition Event #1 Failed or Exceeded Barrier or Control WHY Event #1 Occurred WHY Failed Exceeded Barrier or Control WHY Failed Exceeded Barrier or Control WHY Event #1 Occurred WHY Event #1 Occurred WHY Failed Exceeded Barrier or Control WHY Condition Existed or Changed WHY Condition Existed or Changed WHY Condition Existed or Changed WHY Event #2 Occurred WHY Event #2 Occurred WHY Event #2 Occurred
Root Cause Analysis – Steps Continue to ask “why” until you have reached: Root cause(s) - including all organizational factors that exert control over the design, fabrication, development, maintenance, operation, and disposal of the system. A problem that is not correctable by NASA or NASA contractor. Insufficient data to continue.
Root Cause Analysis- Steps The resultant tree of questions and answers should lead to a comprehensive picture of POTENTIALcauses for the undesired outcome Undesired Outcome Event #2 Condition Event #1 Failed or Exceeded Barrier or Control X WHY Failed Exceeded Barrier or Control WHY Failed Exceeded Barrier or Control WHY Failed Exceeded Barrier or Control WHY Condition Existed or Changed WHY Condition Existed or Changed WHY Condition Existed or Changed WHY Event #2 Occurred WHY Event #2 Occurred WHY Event #2 Occurred WHY Event #1 Occurred WHY Event #1 Occurred WHY Event #1 Occurred WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY
Root Cause Analysis- Steps Check your logic with a detailed review of each potential cause. Verify it is a contributor or cause. If the action, deficiency, or decision in question were corrected, eliminated or avoided, would the undesired outcome be prevented or avoided? If no, then eliminate it from the tree. Undesired Outcome Event #2 Condition Event #1 Failed or Exceeded Barrier or Control X WHY Condition Existed or Changed X WHY Condition Existed or Changed X WHY Condition Existed or Changed WHY Event #2 Occurred WHY Event #2 Occurred WHY Event #2 Occurred WHY Failed Exceeded Barrier or Control WHY Failed Exceeded Barrier or Control X WHY Failed Exceeded Barrier or Control WHY Event #1 Occurred WHY Event #1 Occurred WHY Event #1 Occurred X X X X X X X WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY X X X X X X X X X WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY
Create an event and causal factor tree continued… The remaining items on the tree are the causes (or probable causes). necessary to produce the undesired outcome. Proximate causes are those immediately before the undesired outcome. Intermediate causes are those between the proximate and root causes. Root causes are organizational factors or systemic problems located at the bottom of the tree. Root Cause Analysis - Steps Undesired Outcome PROXIMATE CAUSES Failed or Exceeded Barrier or Control Event #2 Condition Event #1 WHY Event #1 Occurred WHY Event #1 Occurred WHY Condition Existed or Changed WHY Condition Existed or Changed WHY Event #2 Occurred WHY Event #2 Occurred WHY Failed/Exceeded Barrier or Control WHY Failed/Exceeded Barrier or Control INTERMEDIATE CAUSES WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY ROOT CAUSES WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY
Some people choose to leave contributing factors on the tree to show all factors that influenced the event. Contributing factor: An event or condition that may have contributed to the occurrence of an undesired outcome but, if eliminated or modified, would not by itself have prevented the occurrence. If this is done, illustrate them differently (e.g., dotted line boxes and arrows) so that it is clear that they are not causes. Root Cause Analysis- Steps Undesired Outcome Failed or Exceeded Barrier or Control Event #2 Condition Event #1 WHY Event #1 Occurred WHY Event #1 Occurred WHY Condition Existed or Changed WHY Condition Existed or Changed WHY Event #2 Occurred WHY Event #2 Occurred WHY Failed/Exceeded Barrier or Control WHY Failed/Exceeded Barrier or Control Contributing Factors WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY
Investigating Causes of Failures & Mishaps Lost High Speed Data Stream From Satellite (Mission Failure) Thrusters Oriented Space Craft Poor Line of Sight Technician Used Wrong Method to Correct Satellite Failed To Deploy Antenna Power Supply Failed Tells What Failed Battery Dead Tells Types of Failures Installed Improperly Beyond Shelf Limit Root Cause is Much Deeper Keep Asking Why Tells Why It Failed
Investigating Causes of Failures & Mishaps Lost High Speed Data Stream From Satellite (Mission Failure) Thrusters Oriented Space Craft Poor Line of Sight Technician Used Wrong Method to Correct Satellite Failed To Deploy Antenna MMOD Hit Space Craft After Oriented Correct Interpretation Incorrect Decision Power Supply Failed Decision-Making Error Battery Failed Insufficient Anomaly Training New Task Installed Improperly Beyond Shelf Limit Training Does Not Exist Procedure Incorrect No Quality Inspection Insufficient Training Budget Not Updated Insufficient Quality Staff Organization Under Estimates Importance of Anomaly Training Not Under Configuration Mgmt Insufficient Budget
Root Cause Analysis- Steps • Generating Recommendations: • At a minimum corrective actions should be generated to eliminate proximate causes and eliminate or mitigate the negative effects of root causes. • When multiple causes exist, there is limited budget, or it is difficult to determine what should be corrected: • Quantitative analysis can be used to determine the total contribution of each cause to the undesirable outcome (see NASA Fault Tree Handbook, Version 1.1, for more information). • Fishbone diagrams (or other methods) can be used to arrange causes in order of their importance. • Those causes which contribute most to the undesirable outcome should be eliminated or the negative effects should be mitigated to minimize risk.
For More Information NASA PBMA Mishap Investigation Website (http://ai-pbma-kms.intranets.com/login.asp?link=) Includes: Links (e.g, to Root Cause Analysis Software, a RCA Library). Documents (e.g., Methods, Techniques, Tools, Publications and Presentations). Threaded Discussions and Polls. HQ Office of Safety & Mission Assurance Faith.T.Chandler@nasa.gov