1 / 28

ECE 753: FAULT-TOLERANT COMPUTING

Explore reliability modeling in fault-tolerant computing, including quantitative analysis, failure rate computation, and Mean Time To Failure concepts.

phyllise
Download Presentation

ECE 753: FAULT-TOLERANT COMPUTING

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Reliability Modeling and Analysis Lectures 8-10

  2. Overview • Recap • Introduction • Reliability Modeling • reliability block diagram • combinatorial model • Markov model • Other Parameters and analysis • General remarks and Summary ECE 753 Fault Tolerant Computing

  3. Recap • Course introduction • Fundamental principles - Four types of redundancy • FEF and breaking FEF chain • Fault modeling • models at different levels, error models, process failure models • Testing and Test Generation • test generation, fault simulation, DFT and BIST concepts • Simple concepts in fault-tolerance • hardware redundancy, information redundancy, time redundancy, and software redundancy methods ECE 753 Fault Tolerant Computing

  4. Introduction • References • [prad:96] • [john:89] • [triv:82] • These three books contain sufficient material covering this part of the course • Recap of definitions • Importance of analysis and analytical model • Mathematical formulation for quantitative analysis ECE 753 Fault Tolerant Computing

  5. Introduction (contd.) • Recap of definitions • Reliability R(t) • Availability A(t) • Performability and Dependability • Importance of analysis and analytical model • to evaluate a design • a metric to compare different designs • to provide feedback to the designer during early design stages • use a model for performance analysis • used for quantitative and qualitative analysis ECE 753 Fault Tolerant Computing

  6. Introduction (contd.) • Mathematical formulation for quantitative analysis • consider a large experiment with N systems • observation at time t • N0(t) - number of correctly operating systems • Nf(t) - number of failed systems • Hence • Reliability R(t) = N0(t)/N(t) = 1 - Nf(t)/N • Unreliability Q(t) = 1 - R(t) • Derivative of reliability: dR/dt = -(1/N)(dNf(t)/dt) • dNf(t)/dt is called instantaneous failure rate of the component ECE 753 Fault Tolerant Computing

  7. Introduction (contd.) • Mathematical formulation (contd.) • Also • failure rate at time t • (instantaneous failure rate at time t) / N0(t) • (1/No(t))(dNf(t)/dt) - called z(t) • this and the previous expressions together reduce to • z(t) = -(1/R(t))(dR(t)/dt) • Z(t) is called failure rate, hazard function or hazard rate • We can solve the above for R(t) provided we know instantaneous failure rate • Bath tub curve for failure rate • implies constant failure rate during useful life • infant mortality and wear out periods have variable failure rates ECE 753 Fault Tolerant Computing

  8. Introduction (contd.) • Mathematical formulation (contd.) • Reliability computation - constant failure rate • solve the equations - exponential function for reliability and for unreliability, R(t) = 1- Q(t) = exp(-λt) • Reliability computation - time varying failure rate • Waibull distribution z(t) = αλ(λt)**(α-1) • solve the equations - exponential function for reliability and for unreliability • Failure rate computation - military standard • function of - learning factor, quality factor, temperature factor, environmental factor, and # of pins on IC ECE 753 Fault Tolerant Computing

  9. Introduction (contd.) • Mathematical formulation (contd.) • Reliability computation - mean time to failure (MTTF) • Definition: expected time that a system will operate before the first failure occurs • Probability measure: S-sample space, E-event space • for A in E P(A) >= 0 • P(S) = 1 • P(AB) = P(A) + P(B), when A and B are non-intersecting • Random Variable (RV) - X maps events of S to real-numbers • Probability distribution function of a RV • Probability density function (pdf) - derivative of the distribution function ECE 753 Fault Tolerant Computing

  10. Introduction (contd.) • Mathematical formulation (contd.) • Reliability computation - mean time to failure • Probability density function - properties • always >= 0 • integrates to 1 (between limits) • Expectation • Integrate xf(x) • Σ xi p(xi) in discrete case • Application in our case • unreliability Q(t) is a probability distribution function of failure - in fact it is cumulative probability that system fails in time [0,t] ECE 753 Fault Tolerant Computing

  11. Introduction (contd.) • Mathematical formulation (contd.) • Reliability computation - MTTF and MTTR • Application in our case (contd.) • derivative of Q(t) , written as f(t), is pdf of failure - or failure density function • Expected value can be computed using integration and is Mean Time To Failure (MTTF) • constant failure rate • MTTF = 1/λ • Mean time to repair - MTTR • assume constant repair rate (μ) and arguments similar to those used for failure analysis and conclude MTTR = 1/ μ ECE 753 Fault Tolerant Computing

  12. Introduction (contd.) • Mathematical formulation (contd.) • Reliability computation - mean time between failure (MTBF) • Mean time between failure - MTBF • use heuristic arguments to conclude • MTBF = (total time T)/(average number of failures) • can also argue MTBF = MTTF + MTTR • Note: often λ << μ and hence MTTF >> MTTR , therefore the words MTTF and MTBF are used interchangeably by some practioners ECE 753 Fault Tolerant Computing

  13. Reliability Modeling • Application of the previous analysis to system models • Assumptions • system consists of modules • each module assigned a probability of working R(t), a function of time • once a module fails it is assumed to yield incorrect results • module failures are independent ECE 753 Fault Tolerant Computing

  14. Reliability Modeling • Application of the previous analysis to system models • Reliability block diagrams • consider a system - microP, controller, mem, bus, … • the system will fail if any of the components fails • Rsys = P(all subsystems work correctly) • = P(bus correct).P(mem correct)…. Etc. • (follows from the assumption that component • failures are independent) • Rsys = Rbus.Rmem.Rmicro.Rcont ECE 753 Fault Tolerant Computing

  15. Reliability Modeling • Reliability block diagrams - Series Systems • Assume system has n components • All components should survive for system to operate • Reliability of system • R sys = Pi Ri (t) • For exponential distributions of each component • R sys = Pi e - l i t = e - (l1 + l2 + . . . + ln)t =exp(- Slit) • Effect is that the system failure rate is the summation of failure rates of components • Note these are nonredundant systems R1 R2 Rn ECE 753 Fault Tolerant Computing

  16. Reliability Modeling • Reliability block diagrams - Parallel Systems • Assume system with spares • faulty component is replaced by a spare as fault occurs • only one component needs to survive for the system to operate • Model is to represent all components connected in parallel • P(sys fail) = P(M1 fails).P(M2 fails). .. .P(Mn fails) • Rsys = 1 - P(sys fail) = 1- (1-R1)(1-R2) …(1-Rn) ECE 753 Fault Tolerant Computing

  17. Reliability Modeling • Reliability block diagrams - Series-Parallel Systems • straight forward • Reliability block diagrams - MTTF of system • 1/(system failure rate) • Series systems - 1/(sum of individual falure rates) • Parallel systems and series parallel systems – work out by integration from the reliability or unreliability equations ECE 753 Fault Tolerant Computing

  18. Reliability Modeling • Reliability block diagrams -Non series parallel systems • Bayes rule: consider a sample space S. Partitions this into space B andB (complement of B). Now consider an event that falls partly in B and partly inB. We can write: A = (AB)(AB) P(A) = P[(AB)(AB)] = P[(AB)] + P[(AB)] = P(A/B)P(B) + P(A/B)P(B) • In general the set S can be partitioned into (B1, B2, … ,Bn) P(A) = Σ P(A/Bi)P(Bi) This can be viewed graphically also (draw a tree) ECE 753 Fault Tolerant Computing

  19. C1 C4 C2 C3 C5 Reliability Modeling • Reliability block diagrams -Non series parallel systems • Example - consider the following non series parallel system • list all paths for system to survive, namely c1c4, c2c4, c2c5, c3c5 • These paths are not disjoint, sum of reliabilities of all path gives an upper bound on the system reliability • Exact computation is possible using Bayes rule – complete in class ECE 753 Fault Tolerant Computing

  20. Reliability Modeling • Combinatorial model • Consider an NMR system • Assume voter reliability to be 1 • Divide all events for success to disjointed events • Compute probability of each event and add them • Example – TMR system • Can be used to compute MTTF • Can also analyze other systems such as an m-of-n system ECE 753 Fault Tolerant Computing

  21. Reliability Modeling • Markov model • Difficulty with the previous models • incorporating repairs in the model and analysis • Incorporation of coverage factor – such as in duplicates system we may be less than 100% certain that only faulty unit will be eliminated when system is re-configured • Markov modeling - basic • Define the concept of state using TMR system example (8 states) • Transitions between states occur with certain probabilities • Markov model – assumption • Probability of transition from a state si to sj is independent of the method of arrival into state si • Example – develop a Markov model for a TMR in class ECE 753 Fault Tolerant Computing

  22. Reliability Modeling • Markov model • Markov model for a TMR – all details not shown 011 001 λΔt 1-3λΔt 000 111 101 010 λΔt λΔt 100 110 ECE 753 Fault Tolerant Computing

  23. Reliability Modeling • Markov model- Reduced • Reduced Markov model for a TMR system • Previous eight state model can be reduced to a three state model by merging states and re-computing the transition probabilities • Markov model- accounting for repairs • We can include links between states knowing the repair rates of components ECE 753 Fault Tolerant Computing

  24. Reliability Modeling • Markov model- analyzing systems • Consider a duplicate compare system – no repairs • Develop Markov model with 3 states • Develop a difference equation for computing probabilities for being in different states of the system • Develop a differential equation model • Solution methods • Numerical approach • Solving differential equation • direct approach • Using Laplace transforms ECE 753 Fault Tolerant Computing

  25. Reliability Modeling • Markov model- analyzing systems • Consider a duplicate compare system – with repairs • Develop Markov model with 3 states • Develop a differential equation model • Solve using Laplace transforms • Yet one more example • duplicate compare system – with imperfect coverage • Develop Markov model with 5 states • Reduce model for different scenarios ECE 753 Fault Tolerant Computing

  26. Other Parameters and analysis • Markov model- Can use other parameters • Safety – • Availability • Consider a simplex system • Develop Markov model with 2 states • Solve the system for probability of system being in available state • Define and compute steady state availability • Provide a intuitive explanation of the computed value of steady state availability and its relation of MTTF and MTTR • Maintainability ECE 753 Fault Tolerant Computing

  27. General remarks • Voter reliability issue • Performance and states with degraded performance • Mission time improvement • Redundancy Ratio • Law of diminishing return ECE 753 Fault Tolerant Computing

  28. Summary • Introduction of mathematical models • Solving models to carry out analysis • Example systems • Duplicate • Duplicate with repair • Simplex with repair for avialability ECE 753 Fault Tolerant Computing

More Related