280 likes | 293 Views
Explore reliability modeling in fault-tolerant computing, including quantitative analysis, failure rate computation, and Mean Time To Failure concepts.
E N D
ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Reliability Modeling and Analysis Lectures 8-10
Overview • Recap • Introduction • Reliability Modeling • reliability block diagram • combinatorial model • Markov model • Other Parameters and analysis • General remarks and Summary ECE 753 Fault Tolerant Computing
Recap • Course introduction • Fundamental principles - Four types of redundancy • FEF and breaking FEF chain • Fault modeling • models at different levels, error models, process failure models • Testing and Test Generation • test generation, fault simulation, DFT and BIST concepts • Simple concepts in fault-tolerance • hardware redundancy, information redundancy, time redundancy, and software redundancy methods ECE 753 Fault Tolerant Computing
Introduction • References • [prad:96] • [john:89] • [triv:82] • These three books contain sufficient material covering this part of the course • Recap of definitions • Importance of analysis and analytical model • Mathematical formulation for quantitative analysis ECE 753 Fault Tolerant Computing
Introduction (contd.) • Recap of definitions • Reliability R(t) • Availability A(t) • Performability and Dependability • Importance of analysis and analytical model • to evaluate a design • a metric to compare different designs • to provide feedback to the designer during early design stages • use a model for performance analysis • used for quantitative and qualitative analysis ECE 753 Fault Tolerant Computing
Introduction (contd.) • Mathematical formulation for quantitative analysis • consider a large experiment with N systems • observation at time t • N0(t) - number of correctly operating systems • Nf(t) - number of failed systems • Hence • Reliability R(t) = N0(t)/N(t) = 1 - Nf(t)/N • Unreliability Q(t) = 1 - R(t) • Derivative of reliability: dR/dt = -(1/N)(dNf(t)/dt) • dNf(t)/dt is called instantaneous failure rate of the component ECE 753 Fault Tolerant Computing
Introduction (contd.) • Mathematical formulation (contd.) • Also • failure rate at time t • (instantaneous failure rate at time t) / N0(t) • (1/No(t))(dNf(t)/dt) - called z(t) • this and the previous expressions together reduce to • z(t) = -(1/R(t))(dR(t)/dt) • Z(t) is called failure rate, hazard function or hazard rate • We can solve the above for R(t) provided we know instantaneous failure rate • Bath tub curve for failure rate • implies constant failure rate during useful life • infant mortality and wear out periods have variable failure rates ECE 753 Fault Tolerant Computing
Introduction (contd.) • Mathematical formulation (contd.) • Reliability computation - constant failure rate • solve the equations - exponential function for reliability and for unreliability, R(t) = 1- Q(t) = exp(-λt) • Reliability computation - time varying failure rate • Waibull distribution z(t) = αλ(λt)**(α-1) • solve the equations - exponential function for reliability and for unreliability • Failure rate computation - military standard • function of - learning factor, quality factor, temperature factor, environmental factor, and # of pins on IC ECE 753 Fault Tolerant Computing
Introduction (contd.) • Mathematical formulation (contd.) • Reliability computation - mean time to failure (MTTF) • Definition: expected time that a system will operate before the first failure occurs • Probability measure: S-sample space, E-event space • for A in E P(A) >= 0 • P(S) = 1 • P(AB) = P(A) + P(B), when A and B are non-intersecting • Random Variable (RV) - X maps events of S to real-numbers • Probability distribution function of a RV • Probability density function (pdf) - derivative of the distribution function ECE 753 Fault Tolerant Computing
Introduction (contd.) • Mathematical formulation (contd.) • Reliability computation - mean time to failure • Probability density function - properties • always >= 0 • integrates to 1 (between limits) • Expectation • Integrate xf(x) • Σ xi p(xi) in discrete case • Application in our case • unreliability Q(t) is a probability distribution function of failure - in fact it is cumulative probability that system fails in time [0,t] ECE 753 Fault Tolerant Computing
Introduction (contd.) • Mathematical formulation (contd.) • Reliability computation - MTTF and MTTR • Application in our case (contd.) • derivative of Q(t) , written as f(t), is pdf of failure - or failure density function • Expected value can be computed using integration and is Mean Time To Failure (MTTF) • constant failure rate • MTTF = 1/λ • Mean time to repair - MTTR • assume constant repair rate (μ) and arguments similar to those used for failure analysis and conclude MTTR = 1/ μ ECE 753 Fault Tolerant Computing
Introduction (contd.) • Mathematical formulation (contd.) • Reliability computation - mean time between failure (MTBF) • Mean time between failure - MTBF • use heuristic arguments to conclude • MTBF = (total time T)/(average number of failures) • can also argue MTBF = MTTF + MTTR • Note: often λ << μ and hence MTTF >> MTTR , therefore the words MTTF and MTBF are used interchangeably by some practioners ECE 753 Fault Tolerant Computing
Reliability Modeling • Application of the previous analysis to system models • Assumptions • system consists of modules • each module assigned a probability of working R(t), a function of time • once a module fails it is assumed to yield incorrect results • module failures are independent ECE 753 Fault Tolerant Computing
Reliability Modeling • Application of the previous analysis to system models • Reliability block diagrams • consider a system - microP, controller, mem, bus, … • the system will fail if any of the components fails • Rsys = P(all subsystems work correctly) • = P(bus correct).P(mem correct)…. Etc. • (follows from the assumption that component • failures are independent) • Rsys = Rbus.Rmem.Rmicro.Rcont ECE 753 Fault Tolerant Computing
Reliability Modeling • Reliability block diagrams - Series Systems • Assume system has n components • All components should survive for system to operate • Reliability of system • R sys = Pi Ri (t) • For exponential distributions of each component • R sys = Pi e - l i t = e - (l1 + l2 + . . . + ln)t =exp(- Slit) • Effect is that the system failure rate is the summation of failure rates of components • Note these are nonredundant systems R1 R2 Rn ECE 753 Fault Tolerant Computing
Reliability Modeling • Reliability block diagrams - Parallel Systems • Assume system with spares • faulty component is replaced by a spare as fault occurs • only one component needs to survive for the system to operate • Model is to represent all components connected in parallel • P(sys fail) = P(M1 fails).P(M2 fails). .. .P(Mn fails) • Rsys = 1 - P(sys fail) = 1- (1-R1)(1-R2) …(1-Rn) ECE 753 Fault Tolerant Computing
Reliability Modeling • Reliability block diagrams - Series-Parallel Systems • straight forward • Reliability block diagrams - MTTF of system • 1/(system failure rate) • Series systems - 1/(sum of individual falure rates) • Parallel systems and series parallel systems – work out by integration from the reliability or unreliability equations ECE 753 Fault Tolerant Computing
Reliability Modeling • Reliability block diagrams -Non series parallel systems • Bayes rule: consider a sample space S. Partitions this into space B andB (complement of B). Now consider an event that falls partly in B and partly inB. We can write: A = (AB)(AB) P(A) = P[(AB)(AB)] = P[(AB)] + P[(AB)] = P(A/B)P(B) + P(A/B)P(B) • In general the set S can be partitioned into (B1, B2, … ,Bn) P(A) = Σ P(A/Bi)P(Bi) This can be viewed graphically also (draw a tree) ECE 753 Fault Tolerant Computing
C1 C4 C2 C3 C5 Reliability Modeling • Reliability block diagrams -Non series parallel systems • Example - consider the following non series parallel system • list all paths for system to survive, namely c1c4, c2c4, c2c5, c3c5 • These paths are not disjoint, sum of reliabilities of all path gives an upper bound on the system reliability • Exact computation is possible using Bayes rule – complete in class ECE 753 Fault Tolerant Computing
Reliability Modeling • Combinatorial model • Consider an NMR system • Assume voter reliability to be 1 • Divide all events for success to disjointed events • Compute probability of each event and add them • Example – TMR system • Can be used to compute MTTF • Can also analyze other systems such as an m-of-n system ECE 753 Fault Tolerant Computing
Reliability Modeling • Markov model • Difficulty with the previous models • incorporating repairs in the model and analysis • Incorporation of coverage factor – such as in duplicates system we may be less than 100% certain that only faulty unit will be eliminated when system is re-configured • Markov modeling - basic • Define the concept of state using TMR system example (8 states) • Transitions between states occur with certain probabilities • Markov model – assumption • Probability of transition from a state si to sj is independent of the method of arrival into state si • Example – develop a Markov model for a TMR in class ECE 753 Fault Tolerant Computing
Reliability Modeling • Markov model • Markov model for a TMR – all details not shown 011 001 λΔt 1-3λΔt 000 111 101 010 λΔt λΔt 100 110 ECE 753 Fault Tolerant Computing
Reliability Modeling • Markov model- Reduced • Reduced Markov model for a TMR system • Previous eight state model can be reduced to a three state model by merging states and re-computing the transition probabilities • Markov model- accounting for repairs • We can include links between states knowing the repair rates of components ECE 753 Fault Tolerant Computing
Reliability Modeling • Markov model- analyzing systems • Consider a duplicate compare system – no repairs • Develop Markov model with 3 states • Develop a difference equation for computing probabilities for being in different states of the system • Develop a differential equation model • Solution methods • Numerical approach • Solving differential equation • direct approach • Using Laplace transforms ECE 753 Fault Tolerant Computing
Reliability Modeling • Markov model- analyzing systems • Consider a duplicate compare system – with repairs • Develop Markov model with 3 states • Develop a differential equation model • Solve using Laplace transforms • Yet one more example • duplicate compare system – with imperfect coverage • Develop Markov model with 5 states • Reduce model for different scenarios ECE 753 Fault Tolerant Computing
Other Parameters and analysis • Markov model- Can use other parameters • Safety – • Availability • Consider a simplex system • Develop Markov model with 2 states • Solve the system for probability of system being in available state • Define and compute steady state availability • Provide a intuitive explanation of the computed value of steady state availability and its relation of MTTF and MTTR • Maintainability ECE 753 Fault Tolerant Computing
General remarks • Voter reliability issue • Performance and states with degraded performance • Mission time improvement • Redundancy Ratio • Law of diminishing return ECE 753 Fault Tolerant Computing
Summary • Introduction of mathematical models • Solving models to carry out analysis • Example systems • Duplicate • Duplicate with repair • Simplex with repair for avialability ECE 753 Fault Tolerant Computing