Fault Tolerance in Embedded Systems

Fault Tolerance in Embedded Systems dshap092@uottawa.ca http://site.uottawa.ca/~dshap092 Daniel Shapiro

Fault Tolerance • This presentation is based upon [1] • Focus is on the basics as applied to embedded systems with processors • This presentation does not rely on Wikipedia. • See Byzantine fault tolerance on wiki

Overview • Trends Problems • Fault Tolerance Definitions • Fault Hiding • Fault Avoidance • Error Models • # Simultaneous Errors • Fault Tolerance Metrics • Error Detection • Error Recovery • Fault Diagnosis • Self-Recovery

Trends Problems • Fault Tolerance • Goal = safety + liveness • Safe: Hide faults from hurting the user, even in failure • Live: performs the desired task • Better to fail than to do harm Cosmic rays and alpha particles

Trends Problems • More devices/processor means more units can fail • Think CISC v.s. RISC • More complex designs mean more failure cases exist • Think AVX v.s. MMX • Cache faults and more generally memory faults • Recharging DRAM is “easier” than reloading a destroyed cache line

Fault Tolerance Definitions • Fault • Physical faults • Software faults • May manifest as error • Masked fault does not show up as an error • Errors may also be masked • Otherwise the error results in a failure • Logical mask - 0 AND error bit • Architectural mask – NOP reg destination error • Application mask – silent fault like writing garbage to an unused address … produces no failure

Fault Hiding • Some faults are automatically recovered already: branch prediction can recover from faulty branches • Dangerous cases are the faults that are NOT masked • Goal: mask all faults • E.g. HDD faults are common but hidden • Transient fault – signal glitch • Permanent fault – wire burns • Intermittent fault – cold soldered wire  • Fault tolerance scheme – design a system for masking the expected fault type (transient/permanent/intermittent)

Fault Avoidance • Fault avoidance is just as good as fault tolerance • Error detection and correction is the alternative • Permanent faults • Physical wear-out • Fabrication defects • Design bugs

Error Models • We only care about errors, since masked faults are innocuous • Error models • For improving fault tolerance • E.g. stuck at 0/1 model tells us that there is a potential error • Many many stuck at 0 errors can mean that there is NO PROBLEM  • Reduces the need to evaluate all sources of error. Design space size↓↓ • 3 main error model parameters • Type of error – bridging/coupling error (e.g. short, cross-talk), stuck-at error, fail-stop error, delay error • Error duration – transient, intermittent, permanent • # simultaneous errors – errors are rare, how many wars can you fight at once?

# Simultaneous Errors • Maybe 1 error hides another error • E.g. 2-bit flip parity checker • Reasons for resolving: • Mission critical • High error rate • Latent errors (undetected and lingering)may overlap with other errors. Think about an incorrectly stored word: the error occurs upon NEXT read of the word • Better to detect the first error AND to have double error correction since the error rate trends are against us.

Fault Tolerance Metrics • Availability • 99.999% = five nines of availability • Reliability • P(time t and still no failure) • Most errors are not failures • Mean != probability • Variance (2 and 20 v.s. 11 and 12) • MTTF – Mean Time to Failure • MTTR – Mean Time To Repair • MTBF = MTTF+MTTR

Fault Tolerance Metrics • Failures in Time (FIT) • Rate • # failures / 1 billion hours • Additive • α 1/MTTF • Arbitrary • Raw rate includes masked failures • Effective rate excludes masked failures • Effective FIT = FIT*AVF • Helps locate transient error vulnerability • Shown to be a good lower bound on reliability • Architectural Vulnerability Factor (AVF) • Architecturally Correct Execution =ACE state • Otherwise = un-ACE state • E.g. PC state = ACE; branch pred=un-ACE • Fraction of time in ACE state • Component AVF = • avg # ACE bits per cycle / # state bits. • If many ACE bits reside in a structure for a long time, that structure is highly vulnerable.  Large AVF

Error Detection • Helps to provide safety • Without redundancy we cannot detect errors • What kind of redundancy do we need? • Redundancy • Physical (majority gate = TMR, dual modular redundancy =DMR, NMR where N is odd>3) • Temporal (run twice & compare results) • Information (extra bits like parity) • Boeing 777 uses “triple-triple” modular redundancy, 2 levels of triple voting, where each vote is from a different architecture DMR

Error Detection • Physical Redundancy • Heterogeneous hardware units can provide physical redundancy • E.g. Watchdog timer • E.g. Boeing 777 different architectures running same program and then voting on results. • Design Diversity • Unit replication • Gate level • Register level • Core level • Wastes lots of area & power • NMR impractical for PCs • False error reporting becomes more likely • Using different hardware for the voters avoids the possibility of design bugs

Error Detection Temporal Redundancy Information Redundancy Error-Detecting Code (EDC) Words mapped to code words like checksums and CRC Hamming Distance (HD) Single-Error Correcting (SEC) Double-Error Detecting (DED) with HD of 4 • Twice the active power but not twice the area • Can find transient but not permanent errors • Smart pipelining can have the votes arrive 1 cycle apart, but wastes pipeline slots

Error Detection

Error Detection • For ALU we can compare bitcount of inputs out outputs, but this is not common • Many other techniques exist like BIST or calculating a known quantity and comparing to a ROM with the answer in it. • ReExecution with Shifted Operands (RESO) finds permanent errors. • Redundant multithreading: use empty slots to run redundancy threads • Checking invariant conditions • Anomaly detection like behavioural antivirus (look at data and/or traces) • Error Detection by Duplicated Instructions (EDDI) – let software look into the hardware using randomly inserted dummy code • Way way more stuff about caches, CAMs, consistency, and more.

Error Recovery • Safety from detection but what about liveness? • Forward Error Recovery • FER • Once detected, the error is seamlessly corrected • FER implemented using physical, information, or temporal redundancy • More HW needed to correct than detect • E.g. DMR can detect but TMR or triple-triple can correct (spatial) • HD=k (information redundancy) • k-1 bit errors detection • (k-1)/2 error correction • (HD,Detect,correct) • (5,4,2) • TMR by repetition (temporal)

Error Recovery • Backwards Error Recovery • BER • Rollback / Safe point • Restore point • Recovery line for multicore (cool!) • How do we model communication in MP /w caches?? • Just log everything?Nope, save it distributed and in the caches. Possibly use software. • Way more crazy algorithm selection magic…. • The Output Commit Problem • Sphere of recoverability • Don’t let bad data out • Wait for error detection hardware to complete • Latency is usually hidden • Processor state is difficult to store/restore

Error Recovery FER when DRAM module fails – RAID-M/chipkill

Fault Diagnosis • Diagnosis hardware • FER and BER do not solve livelock • E.g. mult fails, recover, mult again.. livelock • Idea: be smart, figure out what components are toast • BIST • Compare boundary scan data or stored tests to a ROM with the right answers • Run BIST at fixed intervals or at end of context switch • Commit changes if error free, otherwise restore • Try to test all components in system, ideally all gates in the system • MPs/NoC typically have dedicated diagnosis hardware

Self-Repair • BIST can tell you what broke, but not how to fix it. • i7 can respond to errors on the on-chip busses at runtime. Partial bus shorts do not kill the system. Data is transferred like a packet (NoC) • Because of all the prediction, lanes, and issue logic, superscalar has much more redundancy than RISC • For RISC just steal a core from the grid and mark the old core dead • CISC has some very crazy metrics for triggering self-repair • Remember the infinite loop mult we diagnosed? • Alternative: notice that mult is dead, use shift-add booth • Another cool idea: if shift breaks use the mult with base 2 inputs (hot spare) • A cold spare would be a fully dedicated redundant unit • CellBE only uses 7 cores and has an 8th cold spare SPE! So cool!

Conclusions • Things are getting a bit crazy in error detection and correction • Multicore and caches complicated everything • Although up until now this fault stuff was known, it is only now entering the PC market because the error rate is increasing with process technology • Like the byzantine generals problem, we start to worry about who to trust in the running but broken chip • Voting works best for transient errors. For permanent errors too, but land the plane or you will end up crashing. • You can prove that it is easier to detect a problem than fix it.

References [1] Daniel J. Sorin, “Fault Tolerant Computer Architecture (Synthesis Lectures on Computer Architecture),” 2010.

Questions?

Fault Tolerance in Embedded Systems

Fault Tolerance in Embedded Systems

Presentation Transcript

Fault Tolerance in Distributed Systems

Fault Tolerance

Fault Tolerance in Distributed Systems

Fault Tolerance in Distributed Systems

Fault Tolerance

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault-tolerance in Component-based Systems

Impact: Fault Tolerance and High Confidence Embedded Systems Design

Fault Tolerance in Distributed Systems

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Impact: Fault Tolerance and High Confidence Embedded Systems Design

Fault Tolerance in Distributed Systems

Fault Tolerance in Distributed Systems

Fault Tolerance