1 / 66

HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK

HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK. Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and E-Business) Lecture 2 DEPENDABILITY CONCEPTS, MEASURES AND MODELS Wintersemester 2000/2001 Leitung: Prof. Dr. Miroslaw Malek

zonta
Download Presentation

HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and E-Business) Lecture 2 DEPENDABILITY CONCEPTS, MEASURES AND MODELS Wintersemester 2000/2001 Leitung: Prof. Dr. Miroslaw Malek http://www.informatik.hu-berlin.de/~rok/zs DS - II - DCMM - 1

  2. DEPENDABILITY CONCEPTS, MEASURES AND MODELS • OBJECTIVES • TO INTRODUCE BASIC CONCEPTS AND TERMINOLOGY IN FAULT-TOLERANT COMPUTING • TO DEFINE MEASURES OF DEPENDABILITY • TO DESCRIBE MODELS FOR DEPENDABILITY EVALUATION • TO CHARACTERIZE BASIC DEPENDABILITY EVALUATION TOOLS • CONTENTS • BASIC DEFINITIONS • DEPENDABILITY MEASURES • DEPENDABILITY MODELS • EXAMPLES DEPENDABILITY EVALUATION TOOLS DS - II - DCMM - 2

  3. ADDING A THIRD DIMENSION DS - II - DCMM - 3

  4. FAULT INTOLERANCE • PRIOR ELIMINATION OF CAUSES OF UNRELIABILITY • fault avoidance • fault removal • NO REDUNDANCY • MANUAL / AUTOMATIC SYSTEM MAINTENANCE • FAULT INTOLERANCE ATTAINS RELIABLE SYSTEMS BY: • very reliable components  • refined design techniques • refined manufacturing techniques • shielding • comprehensive testing DS - II - DCMM - 4

  5. DEGREES OF DEFECTS • FAILURE • occurs when the delivered service deviates from the specified service: failures are caused by errors • FAULT • incorrect state of hardware or software  • ERROR – • manifestation of a fault within a program or data structure forcing deviation from the expected result of computation (incorrect result) DS - II - DCMM - 5

  6. TYPICAL FAULT MODELS FORPARALLEL / DISTRIBUTED SYSTEMS • CRASH  • OMISSION  • TIMING  • INCORRECT COMPUTATION /COMMUNICATION • ARBITARARY (BYZANTINE) DS - II - DCMM - 6

  7. FAULT TOLERANCE: BENEFITS & DISADVANTAGE • FAULT TOLERANCE • ACCEPT THAT AN IMPLEMENTED SYSTEM WILL NOT BE FAULT-FREE • FAULT TOLERANCE IS ATTAINED BY REDUNDANCY IN TIME AND/OR REDUNDANCY IN SPACE  • AUTOMATIC RECOVERY FROM ERRORS  • COMBINING REDUNDANCY AND FAULT INTOLERANCE  • BENEFITS OF FAULT TOLERANCE • HIGHER RELIABILITY  • LOWER TOTAL COST  • PSYCHOLOGICAL SUPPORT OF USERS  • DISADVANTAGE OF FAULT TOLERANCE • COST OF REDUNDANCY DS - II - DCMM - 7

  8. DS - II - DCMM - 8

  9. CAUSE specification design implementation component external NATURE hardware software analog digital DURATION permanent temporary transient intermittent latent EXTENT local distributed VALUE determinate indeterminate FAULT CHARACTERIZATIONS DS - II - DCMM - 9

  10. HARDWARE REDUNDANCY Static (Masking) Redundancy Dynamic Redundancy SOFTWARE REDUNDANCY Multiple Storage of Programs and Data Test and Diagnostic Programs Reconfiguration Programs Program Restarts BASIC TECHNIQUES/REDUNDANCY TECHNIQUES (1) DS - II - DCMM - 10

  11. BASIC TECHNIQUES/REDUNDANCY TECHNIQUES (2) TIME (EXECUTION REDUNDANCY) Repeat or acknowledge operations at various levels Major Goal - Fault Detection and Recovery DS - II - DCMM - 11

  12. DEPENDABILITY MEASURES • DEPENDABILITY IS A VALUE OF QUANTITATIVE MEASURES SUCH AS RELIABILITY AND AVAILABILITY AS PERCEIVED OR DEFINED BY A USER • DEPENDABILITY IS THE QUALITY OF THE DELIVERED SERVICE SUCH THAT RELIANCE CAN JUSTIFIABLY BE PLACED ON THIS SERVICE • DEPENDABILITY IS THE ABILITY OF A SYSTEM TO PERFORM A REQUIRED SERVICE UNDER STATED CONDITIONS FOR A SPECIFIED PERIOD OF TIME DS - II - DCMM - 12

  13. RELIABILITY • RELIABILITY R(t) OF A SYSTEM IS THE PROBABILITY THAT THE SYSTEM WILL PERFORM SATISFACTORILY FROM TIME ZERO TO TIME t, GIVEN THAT OPERATION COMMENCES SUCCESSFULLY AT TIME ZERO (THE PROBABILITY THAT THE SYSTEM WILL CONFORM TO ITS SPECIFICATION THROUGHOUT A PERIOD OF DURATION t) • HARDWARE: Exponential distribution e-t Weibull distribution e-(t)  - shape parameter  - failure rate • SOFTWARE:Exponential, Weibull, normal, gamma or Bayesian R(t) = e- t • A constant failure rate is assumed during the life of a system DS - II - DCMM - 13

  14. MORTALITY CURVE: FAILURE RATE VS. AGE DS - II - DCMM - 14

  15. EXPANDED RELIABILITY FUNCTION FOR EARLY LIFEle(t) = le(t) + lu + lw(t)  BUTlw(t) IS NEGLIGIBLE DURING EARLY LIFE.  FOR USEFUL LIFElu(t) = le(t) + lu + lw(t)  BUT BOTHle(t) AND lw(t) ARE NEGLIGIBLE FOR USEFUL LIFE. FOR WEAR OUTlw(t) = le(t) + lu + lw(t)  WITHle(t) NEGLIGIBLE. THUS, THE GENERAL RELIABILITY FUNCTION BECOMES R(t) = Re(t) Ru(t) Rw(t) OR ] dt} DS - II - DCMM - 15

  16. R (t) R (t) R (t) 1 2 3 SERIES SYSTEMS - the failure of any one module causes the failure of the entire system Rs(t) = R1(t) R2(t) R3(t) = e-(l1+l2+l3)t In general for n serial modules: WHERE l-SYSTEM FAILURE RATE li - INDIVIDUAL MODULE FAILURE RATE For n identical modules: DS - II - DCMM - 16

  17. PARALLEL SYSTEMS - assume each module operates independently and the failure of one module does not affect the operation of the other RP(t) = RA(t) + RB(t) - RA(t)RB(t) EXAMPLE: TWO IDENTICAL SYSTEMS IN PARALLEL EACH CAN ASSUME THE ENTIRE LOAD. RA(T) = RB(T) = 0.5 RP(T) = 0.5 + 0.5 - (0.5)(0.5) = 0.75 DS - II - DCMM - 17

  18. PARALLEL SYSTEMS (II) In general for n parallel modules: For n identical modules in parallel: In our example: DS - II - DCMM - 18

  19. SOFTWARE EXAMPLE: JELINSKI - MORANDA MODEL • ASSUMPTION: • A HAZARD RATE FOR FAILURES IS A PIECEWISE CONSTANT FUNCTION AND IS PROPORTIONAL TO THE REMAINING NUMBER OF ERRORS. z(t) = C [N - (i - 1)]  • C – PROPORTIONALITY CONSTANT • N - THE NUMBER OF FAULTS INITIALLY IN THE PROGRAM • z(t) IS TO BE APPLIED IN THE INTERVAL BETWEEN DETECTION OF ERROR (i - 1) AND DETECTION OF ERROR i. R(t) = exp {- C •[N-(i-1)]•t]} MTTF = 1/ [ C •[N-(i-1)]]  • STRONG CORRELATION WITH HARDWARE DESIGN RELIABILITY FUNCTION DS - II - DCMM - 19

  20. HARDWARE Design and production oriented HW wears out SOFTWARE Design orientedSW does not wearout (t) l t Design errors are very expensive to correct SIMILARITIES AND DIFFERENCES l (t) t Design errors are cheaper to correct DS - II - DCMM - 20

  21. WHAT IS MORE RELIABLE? DS - II - DCMM - 21

  22. AVAILABILITY • AVAILABILITY • A(t) of a system is the probability that the system is operational (delivers satisfactory service) at a given time t.  • STEADY-STATE AVAILABILITY • As of a system is a fraction of lifetime that the system is operational.  • As= UPTIME/TOTAL TIME = m/ (l + m) = MTTF/( MTTF+MTTR) l (failure rate) m(repair rate) • MTTF (Mean Time to Failure) • MTTR (Mean Time to Repair) • MTBF (Mean Time Between Failures) MTBF = MTTF + MTTR (for exponential distribution) DS - II - DCMM - 22

  23. MISSION TIME • MISSION TIME • MT(r) gives the time at which system reliability falls below the prespecified level r. MT(r) = -ln r/l COVERAGE • a) Qualitative: List of classes of faults that are recoverable (testable, diagnosable)  • b) Quantitative: The probability that the system successfully recovers given that a failure has occurred.  • c) Quantitative: Percentage of testable/diagnosable/ recoverable faults.   • d) Quantitative: Sum of the coverages of all fault classes, weighted by the probability of occurrence of each fault class. C = p1C1 + p2C2 + ....+ pnCn DS - II - DCMM - 23

  24. DIAGNOSABILITY • A system of n units is one step t-fault diagnosable (t-diagnosable) if all faulty units within the system can be located without replacement, provided the number of faulty units does not exceed t.(Preparata-Metze-Chien, 12/67) 1) 2t+1n 2) At least t units must test each unit FAULT TOLERANCE • Fault tolerance • is the ability of a system to operate correctly in presence of faults or • a system S is called k-fault-tolerant with respect to a set of algorithms {A1,A2,...,Ap} and a set of faults {F1,F2,...Fq} if for every k-fault F in S, Ai is executable by SF when 1  i  p. (Hayes, 9/76)  or  • Fault tolerance is the use of redundancy (time or space) to achieve the desired level of system dependability.  SF is a subsystem of a system S with k-faults. DS - II - DCMM - 24

  25. RELIABILITY DIFFERENCE RELIABILITY GAIN   MISSION TIME IMPROVEMENT   SENSITIVITY R2(t) - R1(t) Rnew(t)/Rold(t) MTI=MTnew(r)/MTold(r) dR2(t)/dt vs dR1(t)/dt COMPARATIVE MEASURES • OTHER MEASURES • MAINTAINABILITY (SERVICEABILITY) • is the probability that a system will recover to an operable state within a specified time. • SURVIVABILITY • is the probability that a system will deliver the required service in the presence of a defined a priori set of faults or any of its subset. DS - II - DCMM - 25

  26. RESPONSIVENESS AN OPTIMIZATION METRIC PROPOSAL responsiveness = ri (t) = aipi where ri (t) reflects the responsiveness of a task at time t ai denotes i-th task availability pi represents probability of timely completion of the i-th task DS - II - DCMM - 26

  27. QUESTION • IN MANY PRACTICAL SITUATIONS, ESPECIALLY IN REAL-TIME SYSTEMS, WE FREQUENTLY NEED TO ANSWER A QUESTION:  WILL WE ACCEPT  LESS PRECISE RESULT  IN SHORTER TIME? • PROPOSED METRICS:  • 1) WEIGHTED SUM  • (a • PRECISION / b • TIME) AVAILABILITY  • 2) QUOTIENT-PRODUCT  • [PRECISION / log (TIME)] AVAILABILITY DS - II - DCMM - 27

  28. INTEGRITY(PERFORMANCE + DEPENDABILITY) • TOTAL BENEFIT DERIVED FROM A SYSTEM OVER A TIME t. •  HOW TO MEASURE?  • 1) TOTAL NUMBER OF USEFUL MACHINE CYCLES OVER THE TIME t. • 2) P - performance index  R - integrity level (probability that an expected service is delivered)  • Example:  • In a multistage network performance index could be P = N2 (the number of paths in the network) • L = number of levels in a network DS - II - DCMM - 28

  29. COMPUTE OVERALL SYSTEM's MTTF MTTR MTBF AVAILABILITY EXAMPLE DS - II - DCMM - 29

  30. SERIES ELEMENT SUBSYSTEM (1) • Given MTTF and MTTR of each element • Total Failure Rate • ls = l1 + l2 + l3 • Series MTTF DS - II - DCMM - 30

  31. Availability A1 = = 0,99938499 A2 = = 0.99866844 MTTR MTTR = 1.4976 Hours MTBF MTBF = 693.2 + 1.5 = 694.7 Hours 3 8 0 0 A = = 0 . 9 9 9 7 8 9 5 2 3 3 8 0 0 + 0 . 8 0 3 Õ A = A = 0 . 9 9 7 8 4 4 1 8 s i i = 1 SERIES ELEMENT SUBSYSTEM (2) DS - II - DCMM - 31

  32. PARALLEL ELEMENT SUBSYSTEM (1) • All elements must fail to cause subsystem failure • MTTF and MTTR known for each element • Unavailability for entire subsystem is • Availability is • MTTR • MTTF DS - II - DCMM - 32

  33. PARALLEL ELEMENT SUBSYSTEM (2) • Availability DS - II - DCMM - 33

  34. PARALLEL ELEMENT SUBSYSTEM (3) • MTTR MTTRs = 0.3947368 Hour • MTTF MTTFs = 6,456,914,809 Hours DS - II - DCMM - 34

  35. PARALLEL ELEMENT SUBSYSTEM (4) • Paralleling modules is a technique commonly used to significantly upgrade system reliability   • Compare one universal power supply (UPS) with availability of  0.9997627   • to the parallel combination of power supplies with availability of 0.99999999994  • In practical systems availability ranges  (2 to 12 9‘s) 0.99 to 0.999999999999 DS - II - DCMM - 35

  36. K of N PARALLEL ELEMENT SUBSYSTEM (1) • Have N identical modules in parallel (assume all have the same MTTF and MTTR)  • Only K elements are required for full operation  • K = 1 is the same as parallel  • K = N is the same as series  • Reliability • Rs = Prob (system works for time T) = Prob (N modules work or N - 1 modules work or . . . or K modules work) • Note: The above conditions are mutually exclusive  • Rs = Prob (N modules work) + Prob (N - 1 modules work) + . . . + Prob (K modules work) DS - II - DCMM - 36

  37. K of N PARALLEL ELEMENT SUBSYSTEM (2) • RELIABILITY where Rm is the individual modules reliability. Therefore For 3 of 4 and Rm = 0.9 DS - II - DCMM - 37

  38. K of N PARALLEL ELEMENT SUBSYSTEM (3) • MTTR • MTTF • AVAILABILITY • Example (3 out of 4 subsystem) • N = 4 K = 3  • MTTF = 1800 Hours  • MTTR = 4.50 Hours DS - II - DCMM - 38

  39. K of N PARALLEL ELEMENT SUBSYSTEM (4) • MTTR • MTTF MTTFs = 60,000 Hours DS - II - DCMM - 39

  40. K of N PARALLEL ELEMENT SUBSYSTEM (5) • AVAILABILITY • MODULE AVAILABILITY DS - II - DCMM - 40

  41. OVERALL SYSTEM (1) • Three series elements • 1. Main Computer  • 2. Power  • 3.Disks • MTTF MTTF = 685.25 Hours DS - II - DCMM - 41

  42. OVERALL SYSTEM (2) • AVAILABILITY A = (0.997844) (0.99999999994) (0.9999635) A = 0.9978033 • MTTR MTTR = 1.51 Hours DS - II - DCMM - 42

  43. OVERALL SYSTEM (3) • Reliability Function  • R (t) = e -t/MTTF • R (2080) = e-2080/685.25 • R (2080) = 0.0480567 • MTBF • MTBF = MTTF + MTTR • MTBF = 685.25 + 1.51 = 686.76 Hours DS - II - DCMM - 43

  44. DEPENDABILITY (RELIABILITY /AVAILABILITY) Whether or not it worksand for how long ? PERFORMANCE (Throughput, response time, etc.) Given that it works, how well does it work ? PERFORMABILITY For degradable Systems performance evaluation of systems subject to failure/repair RESPONSIVENESS Does it meet deadlines in presence of faults ? HARP,CARE III, SURE ADVISER, SPADE ARIES, SURF SAVE SPADE DEEP RESQ SAVE DEEP KNT MODEL MEYER IMPORTANT ATTRIBUTES OF COMPUTER SYSTEMS DS - II - DCMM - 44

  45. DEPENDABILITY EVALUATION ANALYTICAL SIMULATION AND MEASUREMENT, FAULT INJECTION TESTING AND (SAVE, HARP, AvSim+) FAULT INJECTION (EXPERIMENTAL) ANALYTICAL FAULT RELIABILITY MODELING TREES BLOCK (SHARPE FaultTree+) DIAGRAMS (SHARPE, SUPER) COMBINATORIAL MARKOV ESPN (METASAN, (ADVISER, CARE, (ARIES, ARM, SAN) CARE II, SPADE CARE III,GRAMP, SURE) GRAMS, HARP, MARK1, SHARPE, SURE, SURF, SURF II) DS - II - DCMM - 45

  46. RELIABILITY MODELS • ANALYTICAL • Independent failures • Constant failure rates • Mainly non-repairable or assuming successful fault recovery • Block diagrams • SIMULATION • Markov chains model • PETRI NETS MODEL • places, tokens and transitions DS - II - DCMM - 46

  47. EXTENDED STOCHASTIC PETRI NET MODEL - AN EXAMPLE OF A TMR SYSTEM DS - II - DCMM - 47

  48. FAULT TREES • Fault tree analysis is an application of deductive logic to produce a fault oriented pictorial diagram which allows one to analyze system safety and reliability.  • Fault trees may serve as a design aid for identifying the general fault classes.  • Fault trees were traditionally used in evaluation of hardware reliability but they may help in designing fault-tolerant software and in developing a top-down view of the system.  • Complex event such as a system failure is consecutively broken down into simpler events such as subsystem failures, individual components and block failures, down to single element failures. These simple events are linked together by "and" or "or" Boolean functions.  • The probability of higher level events can be calculated by combining probabilities of the lower level events. DS - II - DCMM - 48

  49. MARKOV MODELSARIES, CARE III, HARP, SAVE, SURE, SURF • Different Fault Types  • Transient, Intermittent, Permanent • Common-Mode • Near-Coincident • Details of Fault-Handling (Coverage) Behavior • Dynamic and Static Redundancy • Hierarchy is difficult to handle • State Explosion  DS - II - DCMM - 49

  50. FAULT TREE EXAMPLE System Failure P P P P P P P P P Voter 1 1 2 3 2 1 3 2 3 DS - II - DCMM - 50

More Related