1 / 16

Fault Tolerance Some background

Fault Tolerance Some background. Claudio Pinello ( pinello@eecs.berkeley.edu ). Some Terminology. A fault is the cause of an error; an error is the part of the system state which may cause a failure; a failure is the deviation of the system from the specification.

joycedavis
Download Presentation

Fault Tolerance Some background

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fault ToleranceSome background Claudio Pinello (pinello@eecs.berkeley.edu) DRAFTS

  2. Some Terminology • A fault is the cause of an error; • an error is the part of the system state which may cause a failure; • a failure is the deviation of the system from the specification Adapted from: J.C. Laprie, “Dependability : basic concepts and terminology in English, French, German, Italian, and Japanese”, Springer-Verlag 1992, Series title: Dependable computing and fault-tolerant systems. DRAFTS

  3. Example • Office Desk • lamp bulb fails (fault) • light level drops (error) • I can’t get work done (failure) • unless… DRAFTS

  4. One Good Idea: Redundancy DRAFTS

  5. One Bad Idea: Redundancy DRAFTS

  6. Structure • System-level fault tolerance • avoid single point of failure • avoid common-mode failure (e.g. same bug in replicated software, all power supplies fail above 50oC, etc.) • fault isolation • cross fingers! DRAFTS

  7. Fault Model • Silent Faults • faults result in omission errors • Crash Faults (fail-stop) • faults result in crashes: no more data, ever! • Non-silent Faults • faults result in value errors • Byzantine Faults • malicious attacks, non-silent faults, bounded delays, etc… DRAFTS

  8. Fault Detection • Typically check for errors • Silent Faults: no errors? • “omission” errors! Easy for synchronous systems, otherwise use timeouts. • Question: You are sick in bed. How do you know if your door bell is broken? DRAFTS

  9. Fault Detection • Typically check for errors • Non-silent faults: how do you know if result is wrong? • e.g. your calculator computes sin(), how do you know if it is faulty? • BTW: what time is it? DRAFTS

  10. Fault Detection • Non-silent faults: try voting • you can tolerate up to n/2 -1 faults DRAFTS

  11. Fault Detection • Typically check for errors • Byzantine faults: oh my! • you can’t trust people on chatlines… • can you ask them the time? • the account number of the red cross for a donation? • would you ask them what medicine to take? DRAFTS

  12. Byzantine Generals • question: “attack or retreat?” • message passing (oral/written) • there are traitors • goal: determine consensus among non-traitors DRAFTS

  13. Byzantine Generals • Basic algorithm (by Lamport et al.) • n rounds of oral message passing • use majority voting, decide • Tolerates up to < 1/3 traitors • If you can use signed messages, reduced number or rounds All methods require bounded asynchrony, i.e. bounded delays DRAFTS

  14. What model to use? • Depends on your application • internet transactions? • probably Byzantine • embedded systems? • usually non-silent faults are sufficient, but… • more networked applications…. • channel transmission? • using CRC one “approximates” fail silence • HW faults or SW faults? DRAFTS

  15. Recovery • You detected a fault, now what? • Isolate fault to avoid further errors • Recover from fault • backtrack to known good checkpoint • start another agent to compute result • use another already available result • reduce functionality (e.g. slow down) • bring system to safe state (e.g. turn off engine) DRAFTS

  16. Conclusions • Faults do occur, do you care? • Model them • Use redundancy right! • System-level fault tolerance • Techniques exist, some are complex to get right DRAFTS

More Related