1 / 184

From Anonymity to Ubiquity: A Study of Our Increasing Reliance on Fault-Tolerant Computing

From Anonymity to Ubiquity: A Study of Our Increasing Reliance on Fault-Tolerant Computing. Elwin Ong MIT SERL NASA Goddard OLD December 9, 2003. Abstract.

guido
Download Presentation

From Anonymity to Ubiquity: A Study of Our Increasing Reliance on Fault-Tolerant Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From Anonymity to Ubiquity: A Study of Our Increasing Reliance on Fault-Tolerant Computing Elwin Ong MIT SERL NASA Goddard OLD December 9, 2003 1

  2. Abstract This presentation will introduce the role of fault tolerance in major computing systems. A literature review will be conducted, outlining some fundamental elements of the field. A comparison and discussion of the application of fault tolerance in the three safety-critical systems will follow. Aerospace systems to be discussed in addition to those already mentioned include the Space Shuttle, Hubble Space Telescope, Galileo, Landsat7, ST-5, New Horizons, and C-17. There will also be a short overview of the Time Triggered protocols TTP/C and FlexRay to be used in automotive drive-by-wire systems. 2

  3. Background • How I came to be at Goddard and OLD • Educational Background • UCLA Aerospace Engineering • Boeing Satellite Systems • MIT Aero/Astro • Systems Engineering Research Lab • Nancy Leveson • Safety-Critical Systems • Fault Tolerant Systems 3

  4. Purpose of Study • What I hope to gain for myself • In depth review of fault tolerance • Catch up on State-of-the-Art • Investigate applications of fault tolerance • Become more familiar with spacecraft design process 4

  5. Purpose of Study • What I hope you will gain • A review of fault tolerance • An overview of fault tolerance in various safety-critical industries • Opportunities to learn and improve upon existing techniques 5

  6. Purpose of Study • What I hope to gain from you • An active discussion of fault tolerance as it is currently practiced in your projects • What are good practices? What works? What doesn’t? • Suggestions for advancements in the field 6

  7. Presentation Outline • Literature Review • Spacecraft Fault Tolerance • Aircraft Fault Tolerance • Automotive Fault Tolerance • Discussion & Conclusion 7

  8. Literature Review Outline • What is Fault Tolerance? • Define scope of study • Fault Tolerance Techniques • Fault Intolerance • Fault Detection and Reconfiguration • Fault Masking and Reconfiguration • What about Software? 8

  9. What is a Fault? • There are various definitions • Must first identify scope: • Computationally intensive systems • Real Time and Safety-Critical (and Distributed) • Spacecraft • Modern Aircraft Systems • Automotive x-by-Wire, drive train controllers • Nuclear and Chemical Processing, Maritime systems, IT Networks, etc. 9

  10. Definition of a Fault Fault: An incorrect state of hardware or software resulting from failures of components, physical interference from the environment, operator error, or incorrect design. Error: The manifestation of a fault. Failure: A result of a delivered service deviating from the specified service caused by an error or fault. 10

  11. Fault Classifications There are various classification methods: Based on Lala & Harper, IEEE 1994 11

  12. Fault Classifications 12

  13. Fault Distribution Models • Permanent Fault Distribution Models • Exponential Distribution • Weibull Distribution • Geometric Distribution • Must match sampled data to distribution models • MIL-HDBK-217 Model • Various Intermittent and Transient Fault Models 13

  14. How to Defeat Faults • Fault Intolerance/Prevention Methods • Fault Tolerant Methods • Redundancy • Fault Detection and Reconfiguration • Fault Masking • Software Fault Tolerance 14

  15. Fault Tolerance Taxonomy 15

  16. Fault Intolerant Techniques • Increase Signal to Noise Ratio • Lower Power Dissipation • Burn in Testing • Factors that most affect failure rates • Environment • Quality • Complexity • See MIL-HDBK-217E, NASA Standards 16

  17. Fault Tolerant Systems • Redundancy • Fault Detection & Reconfiguration • Duplication, Error Detecting Codes, Self-tests, Self-Checking Pairs, etc. • Fault Masking & Reconfiguration • Error Correcting Codes, TMR, NMR • Issues related to Fault Tolerant Systems 17

  18. Redundancy • All Fault Tolerant Systems employ redundancy • Forms of Redundancy • Temporal (Retry, Restart) • Physical (Duplication) • Functional (Analytical Modeling) • “The only thing (redundancy) guarantees is a higher fault arrival rate compared to a non-redundant system…” [Lala & Harper, IEEE 1994] 18

  19. Fault Detection & Reconfig. • Based on simplex systems with active or passive backups. • Requires accurate fault detection • Employs all 3 types of redundancy • Common on unmanned spacecraft 19

  20. Duplication • Simplest technique • Compare two identical copies • Fault identified when copies diverge • Does not identify which copy has failed • Use in conjunction with other techniques 20

  21. Error Detecting Codes • Employ physical redundancy • Use extra bits in transmission • Hamming Distance: • The number of bit positions on which two code words differ. • Minimum distance, d, of a code is defined as the minimum Hamming distance found between any 2 code words. • Number of errors detectable = t < d 21

  22. Hamming Distance 22

  23. Parity Checks • Use 1 extra bit at the end of a word • Simplest and least expensive • Detects all single bit errors and all errors that involve an odd number of bits • Odd parity or even parity check • All 0’s failure • All 1’s failure • Ex. MIL-STD-1553 23

  24. Checksums • Form block of s words by adding together all of the words in the block modulo-n, n is arbitrary. • Takes a long time to detect faults, not well suited to online processing. • Low diagnostic resolution, fault can be in the block of s words, the stored checksum, or the checking circuitry. • Ex. Hard Drives 24

  25. Checksum Example 25

  26. Cyclic Codes • Cyclic Redundancy Check (CRC) • Easy to Implement with XOR gates • Detects all single errors, all burst errors of length b  (n-k) • Ex. CDs, TTP/C, FlexRay Protocols 26

  27. Control Flow Monitoring • Used to detect Sequential Errors 27

  28. Self-Tests • Built-in-Tests • Exercise part or all of circuit and logic and compare to oracle • Extensive use in aerospace systems • Consistency & Sanity Checks • Capability Checks • Watchdog Timers • Implemented in Hardware or Software 28

  29. Self-Checking Pairs • Combination of Duplication and Self Tests 29

  30. Self-Checking Variations 30

  31. Model-Based Diagnosis • Employs Analytic Redundancy • Compare actual components with an analytic model (mathematical model) • Depends on the validity of the model, and the ability to accurately model a system • Relatively straight forward for linear systems, difficult for nonlinear systems (most software-based systems) 31

  32. Analytical Redundancy 32

  33. Model-Based Diagnosis • Residual Generation & Decision-Making 33

  34. Parameter Estimation • Based on assumption that faults are reflected in the physical system parameters such as friction, mass, viscosity, resistance, capacitance, etc. • Compare online estimations and measurements with parameters of model to identify faults. 34

  35. Livingstone Engine • Developed at NASA AMES • Livingstone accepts a model of the components of a complex system such as a spacecraft or chemical plant and infers from them the overall behavior of the system. 35

  36. Fault Masking Techniques • Mask faults by “out-voting” failed components • Error Correcting Codes • Triple Modular Redundancy (TMR) • NMR • Extensive applications in aircraft and manned spacecraft 36

  37. Error Correcting Codes • Hamming SEC/DED Codes • Extensive usage in memories • High performance vs. cost ratio • Reed-Solomon • There are other more advanced ECCs employed including convolution codes (communication, coding theory) 37

  38. Hamming SEC Code 38

  39. TMR & NMR • Very simple concept, includes many different variations 39

  40. TMR & NMR Variations 40

  41. Redundancy Issues • Large Overhead? • More difficult to validate • Asynchronous vs. Synchronous • Near Coincidence Errors • Generic Faults 41

  42. Asynchronous Issues • Voted value is mean, median, or some other heuristic-based value. • Must set thresholds so that failures are caught, but also limit false alarms • Can be very difficult to guarantee robustness • Requires extensive analyses and testing • Ex. F-16B FBW 42

  43. Synchronous Issues • Inputs must be the same for each channel • Each channel must be synchronized • Fault detection is simple, unless… • Interactive Consistency • Near Coincidence • Generic Faults • Most systems are what are termed “loosely synchronous” 43

  44. Byzantine Generals • Affects inputs to synchronous system as well as cross-channel voting • Stop and restart errors • Babbling Idiot Problem • Failed component sends different outputs to voting elements, confuses good components. • Intentional or intelligent malicious attacks • See Lamport et al. ACM 1982 44

  45. Interactive Consistency 45

  46. Byzantine Resiliency • Fault Containment Region (FCR) • A FCR is a collection of components that operate correctly regardless of any arbitrary logical fault outside the region. • Each FCR requires at least an independent power supply and clock signal. • May also need to be physically separated 46

  47. Byzantine Resiliency • To tolerate f Byzantine faults requires: • 3f+1 FCRs • FCRs must be interconnected through 2f+1 disjoint paths • Inputs must be exchanged f+1 times between FCRs • FCRs must be synchronized to bounded skew • Simple TMR majority voter circuit is not Byzantine Resilient 47

  48. Near Coincidence • Possibility that a second fault will occur before the system can recover from the first fault. • Must be accounted for in the design of redundancy management, eg. 777 FBW 48

  49. Generic Faults • Externally Induced • Physical damage • Lightning strike • Power transients • Internally Induced • Hardware & Firmware defects, COTS O/S • Latent failures • Clock anomalies • Bad Design? 49

  50. What about Software? • Software faults are much more difficult to characterize • Software is • an abstract mathematical object or • a concept of “how to make a group of hardware (system) work together in order to perform a specified function” • includes Hardware design as well • Software fault = Design fault 50

More Related