Aviation reliability: Programs & calculation

Aviation reliability:Programs & calculation Chapter 19 text 393SYS

Reliability Definition (in statistical term): ‘the probability of failure free operation of an item in a specified environment for a specified amount of time’ Examples: If eight delays and cancellations are experienced in 200 flights, that means 96% of flights dispatched on time for the airline. Effective February 15, 2007, the FAA ruled that US-registered ETOPS-207 operators can fly over most of the world provided that the IFSD rate is 1 in 100,000 engine hours. This limit is more stringent than ETOPS-180 (2 in 100,000 engine hours). 393SYS

Two main approaches of reliability in the aviation industry • First approach is the overall airline reliability, essential means the dispatch reliability, that is, how often the airline achieves an on-time departure of its scheduled flights. The reasons of delay are categorized as maintenance, procedures, personnel, flight operations, air traffic control (ATC). etc. • Second approach is to consider reliability as programs specifically designed to address the problems of maintenance-whether or not they cause delays and provide analysis of and corrective actions for those items to provide the overall reliability of equipment. This contributes to the dispatch reliability as well as the overall operation. 393SYS

Reliability Program (for maintenance) A set of rules and practices for managing and controlling a maintenance program. The main function is to monitor the performance of the vehicles and their associated equipment and call attention to any need for corrective action. Additional functions: • Monitor the effectiveness of those corrective actions • Provide data to justify adjusting the maintenance interval or maintenance program procedure as appropriate 393SYS

Maintenance programs have four types of reliability • Statistical reliability • Historical reliability • Event-oriented reliability • Dispatch reliability 393SYS

Statistical reliability • Based upon collection and analysis of ‘events’ such as failure, removal, and repair rates of systems or components. 393SYS

Historical reliability • Comparison of current event rates with those of past experience. Commonly used when new equipment is introduced and no established statistic is available. 393SYS

Event-oriented reliability Events like bird strikes, hard landing, in-flight shutdowns (IFSD), lighting strikes or other accidents that do not occur on a regular basis and therefore produce no useable statistical or historical data. In ETOPS, FAA designated certain events to be tracked as ‘event-oriented reliability program’. Each occurrence of the events must be investigated to determinate the cause to prevent recurrence. IFSD causes; for example: due to flameout, internal failure, crew-initiated shutoff, foreign object ingestion, icing, inability to obtain and/or control desired thrust. 393SYS

Dispatch reliability Measurement of an airline operation respect to on-line departure. It receives considerable attention from regulatory authorities(e.g. FAA), airlines and passengers. Actually, it is just a special form of the event-oriented reliability approach. 393SYS

Danger of misinterpreted reliability data (1) A pilot experienced a rudder control problem and called in two hours from arriving an airport. He writes up the problem in the aircraft logbook and reports it by radio to the flight operation unit at the airport. Upon arrival, the maintenance crew check the log and find the write-up and begin troubleshooting. The repair actions take a little longer then scheduled turnaround time and cause delay. Since maintenance is at work and rudder is the problem, the delay is charged to the maintenance department. If the pilot and the flight operation unit knew the problem and informed the maintenance two hours before landing, the maintenance people can spent the time prior to landing to perform troubleshooting analysis and the delay could have been prevented. So, an alter in airline procedure can avoid the delay. A good reliability program should avoided same delay in the future by altering the procedure, not regardless of who or what is to blame. 393SYS

Danger of misinterpreted reliability data (2) If there were 12 write-ups of rudder problems during the month and only one of them caused a delay, there is actually two problems to investigate. • The delay, which may/or may not be caused by rudder the problems • The 12 rudder write-ups that may ,in fact, be related to an underlying maintenance problem. Dispatch delay constitutes one problem and the rudder system malfunction constitutes another. They may overlap but they are two different problems. Delay is a event-oriented reliability that must be investigated on its own; the 12 rudder problems should be addressed by the statistical (or historical) reliability problem separately. 393SYS

Elements of a Reliability Program • Data collection • Problem area alerting • Data display • Data analysis • Corrective actions • Follow-up analysis • Monthly report 393SYS

Data Collection: allows operator to compare present performance with the past, typical data type are: • Flight time and cycle for each aircraft • Cancellations and delays over 15 minutes • Unscheduled component removals • Unscheduled engine removals • In-flight shutdowns of engines • Pilot reports or logbook write-ups • Cabin logbook write-up • Component failures (shop maintenance) • Maintenance check package findings • Critical failures 393SYS

Problem detection: alerting systems alerting systems for quick identify areas where performance is significantly different from normal so that possible problems can be investigated. Standards for event rates are set according to past performance. 393SYS

Problem detection 2: setting & adjusting alert levels alert levels recalculation (yearly) and filtering of false alarms 393SYS

Quality Control Charts and the Seven Run Rule • A control chart is a graphic display of data that illustrates the results of a process over time. It helps prevent defects and allows you to determine whether a process is in control or out of control • The seven run rule states that if seven data points in a row are all below the mean, above the mean, or increasing or decreasing, then the process needs to be examined for non-random problems 393SYS

Control Chart of 12” ruler 393SYS

Control Chart contiu. • The output of a production process will fluctuate. The causes of fluctuation can just be random or non-random due to desirable/undesirable process change. Control charts graph and measure process data against control limits. Control charts can distinguish the random variation from assignable causes or non-random causes. • We cannot adjust random variation out of a process. Process adjustments for random variation are neither necessary nor desirable. This is over-adjustment or tempering, and it makes the process worse. • We can and must investigate assignable causes (or non-random causes). Points outside the control limits are evidence of process problems. Analyst must investigate every out of control point for an assignable cause. They must record their findings and any corrective actions. For example, a tool adjustment, or change in Formal Technical Review format or worn tooling, may correct the problem. 393SYS

Pattern analyzing of Control Chart 7-Run rule 7-run-rule is used to filter out the random variation in a production process. shows the ‘trends’ that are caused by the ‘assignable causes’ or non-random causes that required investigation and possible corrective action to be taken. 7-run-rule pattern: • seven points above mean value; • seven points below mean value; • seven points or all increasing ; or • seven points all decreasing the patterns are indicators of non-random problems which can be symptom of process out of control. 393SYS

To develop a Control Chart to determine project stability • Plot individual metric values on a chart. • Compute the mean value for the metrics value and plot the line. • Plot the Upper Control Limit and Lower Control Limit. • Compute a standard deviation as (Upper-control-limit - mean)/3. Plot lines one and two standard deviation above and below Am. If any of the standard deviation lines is less than 0.0, it need not be plotted unless the metric being evaluated takes on values that are less than 0.0. • The Std Dev.# is then plotted on the control chart. 393SYS

Other Pattern analyzing of Control Chart 7 run rule is just another method to filter false alarms. Other pattern analyzing methods include but not limited to : • metric value lay outside UCL or LCL • 2 out of 3 successive metrics values lay more than 2 standard deviations away from the mean; • 4 out of 5 successive metrics values lay more than 1 standard deviations away from the mean; • others… 393SYS

Reliability : Basic Calculation & Application 393SYS

CHARACTERIZING FAILURE OCCURRENCES IN TIME Four general ways: • time of failure • time interval between failures • cumulative failures experienced up to a given time • failures experienced in a time interval 393SYS

Time-based failure specification Failure time (sec) Failure number Failure interval (sec) 10 19 32 43 10 9 13 11 1 2 3 4 393SYS

Failure-based failure specification Cumulative failures Failure in interval Time (sec) 2 5 7 8 2 3 2 1 30 60 90 120 393SYS

TABLE 3 Typical probability distribution of failures Product of value and probability Value of random variable (failures in time period) Probability 0.10 0.18 0.22 0.16 0.11 0.08 0.05 0.04 0.03 0.02 0.01 0 1 2 3 4 5 6 7 8 9 10 0 0.18 0.44 0.48 0.44 0.40 0.30 0.28 0.24 0.18 0.1 Mean failures 3.04 393SYS

TIME VARIATION • Mean value function - represents the average cumulative failures associated with each time point. • Failure intensity function - is the rate of change of the mean value function or the no. of failures per unit time. 393SYS

Probability distributions at times tA and tB Value of random variable (failures in time period) Probability Elapsed time ta = 1 hr Elapsed time tB = 5 hr 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.10 0.18 0.22 0.16 0.11 0.08 0.05 0.04 0.03 0.02 0.01 0 0 0 0 0 0.01 0.02 0.03 0.04 0.05 0.07 0.09 0.12 0.16 0.13 0.10 0.07 0.05 0.03 0.02 0.01 Mean failures 3.04 7.77 393SYS

10 Mean Failure function F(t) Mean Failures 5 Failure intensity (failures/hr) failure intensity function f(t) Time Time tA = 1 Time tB =5 10 Figure showing the Mean value function and failure intensity function 393SYS

f(x) -- probability density function F(x) -- probability cumulative distribution function eg : When we toss three coins, there will be eight events If we set x is the number of coins which head side face up, we have: Event HHH HHT HTH THH HTT THT TTH TTT x 3 2 2 2 1 1 1 0 H = Head side up; T = Tail side up : X 0 1 2 3 f(x) 1/8 3/8 3/8 1/8 F(x) 1/8 1/2 7/8 1 393SYS

DISCRETE FAILURE FUNCTION f(t), the failure density function over a time interval [t1, t2] and is defined as the ratio of the number of failures occurring in the interval to the size of the original population, divided by the length of the time interval: Where n(t) is the number of the fault survivors at time t The f(t) can measure the overall speed at which failures are occurring 393SYS

DISCRETE HAZARD FUNCTION Z(t), the failure rate, or the Hazard function is the probability that a failure occurs in some time interval [t1, t2], given that the system has survived up to time t. It is the ratio of the number of failures occurring in the interval to the size of the original population, divided by the length of the time interval: The Z(t) can measure the instantaneous speed of failure 393SYS

FAILURE CURVES 393SYS

PROBABILITY OF SUCCESS F(t) is the probability of failure (= Cumulative Distribution) R(t) is the probability of success (= Reliability) Þ F(t) + R(t) = 1 393SYS

DISCRETE FUNCTION EXAMPLE Failure data for 10 hypothetical electrical components Failure Number Operating time, h 1 8 2 20 3 34 4 46 5 63 6 86 7 111 8 141 9 186 10 266 393SYS

Failure density/hrf(t) Hazard rate/hrZ(t) Overall MTTF Intervalt1 to t2 F(t) R(t) MTTF [10-9]/10 / (8-0)= 1//10/8= 0.0125 0-8 1/10 9/10 8*10 =80 [9-8]/9/(20-8)= 1/9/12= 0.093 [9-8]/10/(20-8)= 1/10/12= 0.0083 8-20 2/10 8/10 12 *9 =108 [8-7]/10/(34-20)= 1/10/14 = 0.0071 [8-7]/8/(34-20)= 1/8/14= 0.089 20-34 3/10 7/10 14*8 =112 34/3 *10 = 113 - [ n ( t ) n ( t )] / N 1 2 = f ( t ) - ( t t ) 2 1 [10-9]/10 / (8-0)= 1//10/8= 0.0125 8/1 *10 = 80 20/2 *10 = 100 393SYS

Achieving a reliable systemref. Ian Summerville, 7e Ch20 • Three basic strategies to achieve reliability • Fault Avoidance • Build fault-free systems from the start • Fault Tolerance • Build facilities into the system to let the system continue when faults cause system failures • Fault Detection • Use software validation techniques to discover faults prior to the system being put into operation For most systems, fault avoidance and fault detection suffice to provide the required level of reliability 393SYS

Implementing Fault Avoidance • Availability of a formal and unambiguous system specification • Adoption of a quality philosophy by developers. Developers should be expected to write bug-free systems • … 393SYS

Implementing Fault Tolerance • Even if somehow we build a fault-free system, we still need fault-tolerance in critical systems • Fault-free does not mean failure-free • Fault-free means that the system correctly meets its specifications • Specifications may be incomplete or faulty or unaware of a requirement of the environment • Can never conclusively prove that a system is fault-free 393SYS

Aspects of Fault Tolerance • Failure Detection • System must be able to detect that the current state of the system has caused a failure or will cause a failure • Damage Assessment • System must detect what damage the system failure has caused • Fault Recovery • System must change the state of the system to a known “safe” state • Can correct the damaged state (forward error recovery - harder) • Can restore to a previous known “safe” state (backwards error recovery - easier) • Fault Repair • Modifying the system so that the failure does not recur • Many software failures are transient and need no repair and normal processing can resume after fault recovery 393SYS

Implementing Fault Tolerance • Hardware - Triple-Modular Redundancy (TMR) • Hardware unit is replicated three (or more) times • Output is compared from three units • If one unit fails, its output is ignored • Space Shuttle is a classic example Machine 1 Machine 2 Output Comparator Machine 3 393SYS

Implementing Fault Tolerance (2) • Using Software • N-Version programming • Have multiple teams build different versions of the software and then execute them in parallel • Assumes teams are unlikely to make the same mistakes • Not necessarily a valid assumption, if teams all work from the same specification • … 393SYS

N-Version Programming Commonly used approach in railway signaling, aircraft systems & reactor protection system 393SYS

System Configuration for Failure Event Diagram • Divide system into a hierarchy set of components. The reliability of the components should be known or is easy to estimate or measure. • Each component represented as a switch. If component is functioning, the switch is viewed as CLOSED and if not functioning, as OPEN. System success occurs if there is a continuous path through the configuration. • The components are described as combination of 2 types - AND & OR configuration with independent failures representation. • Express the reliability relationship between the components with Failure Even diagram. 393SYS

EVENT DIAGRAM A B D E C Event diagram for AND-OR configuration 393SYS

EVENT EXPRESSION RS = (A + B + C) * D * E where RS = Reliability of system = Probability of System success 393SYS

TRUTH TABLE [ OR ] [ AND ] A B A+B A*B 1 1 1 1 1 0 1 0 0 1 1 0 0 0 0 0 393SYS

.AND. CONFIGURATION Rs = Reliability of the system. = R1XR2where R1 & R2 are the reliability of components C1 & C2. Rs with n components arranged in logical .AND. then, Rs = R1 X R2 X … X Rn R1 R2 Rs 393SYS

R1 R2 Rs .OR. CONFIGURATION Rs = Reliability of the system. R1, R2, ..Rn = Reliability of component 1, 2, ..n. in this case, it is easier to calculate by the probability of failure F Fs = (1 - Rs) F1= (1 - R1); F2= (1 - R2) Fs = F1 * F2 = (1 - R1) * (1 - R2) Rs = 1 - Fs = 1 - [(1 - R1) * (1 - R2) * … (1 - Rn)] for n components arranged in logical .OR. 393SYS

Reliability Acronyms • MTBF - Mean Time Between Failures • MTTF - Mean Time To Failure • MTTR - Mean Time To Repair • MTBF = MFFT + MTTR • Many people consider it to be far more useful than measuring fault rate per LOC 393SYS

Aviation reliability: Programs & calculation