160 likes | 191 Views
Learn about fault tolerance, errors, failures, and fault isolation in systems. Explore fault models, detection methods, and recovery strategies to ensure system resilience and reliability. Understand redundancy and system-level fault tolerance to mitigate consequences of faults.
E N D
Fault ToleranceSome background Claudio Pinello (pinello@eecs.berkeley.edu) DRAFTS
Some Terminology • A fault is the cause of an error; • an error is the part of the system state which may cause a failure; • a failure is the deviation of the system from the specification Adapted from: J.C. Laprie, “Dependability : basic concepts and terminology in English, French, German, Italian, and Japanese”, Springer-Verlag 1992, Series title: Dependable computing and fault-tolerant systems. DRAFTS
Example • Office Desk • lamp bulb fails (fault) • light level drops (error) • I can’t get work done (failure) • unless… DRAFTS
One Good Idea: Redundancy DRAFTS
One Bad Idea: Redundancy DRAFTS
Structure • System-level fault tolerance • avoid single point of failure • avoid common-mode failure (e.g. same bug in replicated software, all power supplies fail above 50oC, etc.) • fault isolation • cross fingers! DRAFTS
Fault Model • Silent Faults • faults result in omission errors • Crash Faults (fail-stop) • faults result in crashes: no more data, ever! • Non-silent Faults • faults result in value errors • Byzantine Faults • malicious attacks, non-silent faults, bounded delays, etc… DRAFTS
Fault Detection • Typically check for errors • Silent Faults: no errors? • “omission” errors! Easy for synchronous systems, otherwise use timeouts. • Question: You are sick in bed. How do you know if your door bell is broken? DRAFTS
Fault Detection • Typically check for errors • Non-silent faults: how do you know if result is wrong? • e.g. your calculator computes sin(), how do you know if it is faulty? • BTW: what time is it? DRAFTS
Fault Detection • Non-silent faults: try voting • you can tolerate up to n/2 -1 faults DRAFTS
Fault Detection • Typically check for errors • Byzantine faults: oh my! • you can’t trust people on chatlines… • can you ask them the time? • the account number of the red cross for a donation? • would you ask them what medicine to take? DRAFTS
Byzantine Generals • question: “attack or retreat?” • message passing (oral/written) • there are traitors • goal: determine consensus among non-traitors DRAFTS
Byzantine Generals • Basic algorithm (by Lamport et al.) • n rounds of oral message passing • use majority voting, decide • Tolerates up to < 1/3 traitors • If you can use signed messages, reduced number or rounds All methods require bounded asynchrony, i.e. bounded delays DRAFTS
What model to use? • Depends on your application • internet transactions? • probably Byzantine • embedded systems? • usually non-silent faults are sufficient, but… • more networked applications…. • channel transmission? • using CRC one “approximates” fail silence • HW faults or SW faults? DRAFTS
Recovery • You detected a fault, now what? • Isolate fault to avoid further errors • Recover from fault • backtrack to known good checkpoint • start another agent to compute result • use another already available result • reduce functionality (e.g. slow down) • bring system to safe state (e.g. turn off engine) DRAFTS
Conclusions • Faults do occur, do you care? • Model them • Use redundancy right! • System-level fault tolerance • Techniques exist, some are complex to get right DRAFTS