320 likes | 746 Views
Fault Tolerance in Embedded Systems. dshap092@uottawa.ca http://site.uottawa.ca/~dshap092. Daniel Shapiro. Fault Tolerance. This presentation is based upon [1 ] Focus is on the basics as applied to embedded systems with processors This presentation does not rely on Wikipedia .
E N D
Fault Tolerance in Embedded Systems dshap092@uottawa.ca http://site.uottawa.ca/~dshap092 Daniel Shapiro
Fault Tolerance • This presentation is based upon [1] • Focus is on the basics as applied to embedded systems with processors • This presentation does not rely on Wikipedia. • See Byzantine fault tolerance on wiki
Overview • Trends Problems • Fault Tolerance Definitions • Fault Hiding • Fault Avoidance • Error Models • # Simultaneous Errors • Fault Tolerance Metrics • Error Detection • Error Recovery • Fault Diagnosis • Self-Recovery
Trends Problems • Fault Tolerance • Goal = safety + liveness • Safe: Hide faults from hurting the user, even in failure • Live: performs the desired task • Better to fail than to do harm Cosmic rays and alpha particles
Trends Problems • More devices/processor means more units can fail • Think CISC v.s. RISC • More complex designs mean more failure cases exist • Think AVX v.s. MMX • Cache faults and more generally memory faults • Recharging DRAM is “easier” than reloading a destroyed cache line
Fault Tolerance Definitions • Fault • Physical faults • Software faults • May manifest as error • Masked fault does not show up as an error • Errors may also be masked • Otherwise the error results in a failure • Logical mask - 0 AND error bit • Architectural mask – NOP reg destination error • Application mask – silent fault like writing garbage to an unused address … produces no failure
Fault Hiding • Some faults are automatically recovered already: branch prediction can recover from faulty branches • Dangerous cases are the faults that are NOT masked • Goal: mask all faults • E.g. HDD faults are common but hidden • Transient fault – signal glitch • Permanent fault – wire burns • Intermittent fault – cold soldered wire • Fault tolerance scheme – design a system for masking the expected fault type (transient/permanent/intermittent)
Fault Avoidance • Fault avoidance is just as good as fault tolerance • Error detection and correction is the alternative • Permanent faults • Physical wear-out • Fabrication defects • Design bugs
Error Models • We only care about errors, since masked faults are innocuous • Error models • For improving fault tolerance • E.g. stuck at 0/1 model tells us that there is a potential error • Many many stuck at 0 errors can mean that there is NO PROBLEM • Reduces the need to evaluate all sources of error. Design space size↓↓ • 3 main error model parameters • Type of error – bridging/coupling error (e.g. short, cross-talk), stuck-at error, fail-stop error, delay error • Error duration – transient, intermittent, permanent • # simultaneous errors – errors are rare, how many wars can you fight at once?
# Simultaneous Errors • Maybe 1 error hides another error • E.g. 2-bit flip parity checker • Reasons for resolving: • Mission critical • High error rate • Latent errors (undetected and lingering)may overlap with other errors. Think about an incorrectly stored word: the error occurs upon NEXT read of the word • Better to detect the first error AND to have double error correction since the error rate trends are against us.
Fault Tolerance Metrics • Availability • 99.999% = five nines of availability • Reliability • P(time t and still no failure) • Most errors are not failures • Mean != probability • Variance (2 and 20 v.s. 11 and 12) • MTTF – Mean Time to Failure • MTTR – Mean Time To Repair • MTBF = MTTF+MTTR
Fault Tolerance Metrics • Failures in Time (FIT) • Rate • # failures / 1 billion hours • Additive • α 1/MTTF • Arbitrary • Raw rate includes masked failures • Effective rate excludes masked failures • Effective FIT = FIT*AVF • Helps locate transient error vulnerability • Shown to be a good lower bound on reliability • Architectural Vulnerability Factor (AVF) • Architecturally Correct Execution =ACE state • Otherwise = un-ACE state • E.g. PC state = ACE; branch pred=un-ACE • Fraction of time in ACE state • Component AVF = • avg # ACE bits per cycle / # state bits. • If many ACE bits reside in a structure for a long time, that structure is highly vulnerable. Large AVF
Error Detection • Helps to provide safety • Without redundancy we cannot detect errors • What kind of redundancy do we need? • Redundancy • Physical (majority gate = TMR, dual modular redundancy =DMR, NMR where N is odd>3) • Temporal (run twice & compare results) • Information (extra bits like parity) • Boeing 777 uses “triple-triple” modular redundancy, 2 levels of triple voting, where each vote is from a different architecture DMR
Error Detection • Physical Redundancy • Heterogeneous hardware units can provide physical redundancy • E.g. Watchdog timer • E.g. Boeing 777 different architectures running same program and then voting on results. • Design Diversity • Unit replication • Gate level • Register level • Core level • Wastes lots of area & power • NMR impractical for PCs • False error reporting becomes more likely • Using different hardware for the voters avoids the possibility of design bugs
Error Detection Temporal Redundancy Information Redundancy Error-Detecting Code (EDC) Words mapped to code words like checksums and CRC Hamming Distance (HD) Single-Error Correcting (SEC) Double-Error Detecting (DED) with HD of 4 • Twice the active power but not twice the area • Can find transient but not permanent errors • Smart pipelining can have the votes arrive 1 cycle apart, but wastes pipeline slots
Error Detection • For ALU we can compare bitcount of inputs out outputs, but this is not common • Many other techniques exist like BIST or calculating a known quantity and comparing to a ROM with the answer in it. • ReExecution with Shifted Operands (RESO) finds permanent errors. • Redundant multithreading: use empty slots to run redundancy threads • Checking invariant conditions • Anomaly detection like behavioural antivirus (look at data and/or traces) • Error Detection by Duplicated Instructions (EDDI) – let software look into the hardware using randomly inserted dummy code • Way way more stuff about caches, CAMs, consistency, and more.
Error Recovery • Safety from detection but what about liveness? • Forward Error Recovery • FER • Once detected, the error is seamlessly corrected • FER implemented using physical, information, or temporal redundancy • More HW needed to correct than detect • E.g. DMR can detect but TMR or triple-triple can correct (spatial) • HD=k (information redundancy) • k-1 bit errors detection • (k-1)/2 error correction • (HD,Detect,correct) • (5,4,2) • TMR by repetition (temporal)
Error Recovery • Backwards Error Recovery • BER • Rollback / Safe point • Restore point • Recovery line for multicore (cool!) • How do we model communication in MP /w caches?? • Just log everything?Nope, save it distributed and in the caches. Possibly use software. • Way more crazy algorithm selection magic…. • The Output Commit Problem • Sphere of recoverability • Don’t let bad data out • Wait for error detection hardware to complete • Latency is usually hidden • Processor state is difficult to store/restore
Error Recovery FER when DRAM module fails – RAID-M/chipkill
Fault Diagnosis • Diagnosis hardware • FER and BER do not solve livelock • E.g. mult fails, recover, mult again.. livelock • Idea: be smart, figure out what components are toast • BIST • Compare boundary scan data or stored tests to a ROM with the right answers • Run BIST at fixed intervals or at end of context switch • Commit changes if error free, otherwise restore • Try to test all components in system, ideally all gates in the system • MPs/NoC typically have dedicated diagnosis hardware
Self-Repair • BIST can tell you what broke, but not how to fix it. • i7 can respond to errors on the on-chip busses at runtime. Partial bus shorts do not kill the system. Data is transferred like a packet (NoC) • Because of all the prediction, lanes, and issue logic, superscalar has much more redundancy than RISC • For RISC just steal a core from the grid and mark the old core dead • CISC has some very crazy metrics for triggering self-repair • Remember the infinite loop mult we diagnosed? • Alternative: notice that mult is dead, use shift-add booth • Another cool idea: if shift breaks use the mult with base 2 inputs (hot spare) • A cold spare would be a fully dedicated redundant unit • CellBE only uses 7 cores and has an 8th cold spare SPE! So cool!
Conclusions • Things are getting a bit crazy in error detection and correction • Multicore and caches complicated everything • Although up until now this fault stuff was known, it is only now entering the PC market because the error rate is increasing with process technology • Like the byzantine generals problem, we start to worry about who to trust in the running but broken chip • Voting works best for transient errors. For permanent errors too, but land the plane or you will end up crashing. • You can prove that it is easier to detect a problem than fix it.
References [1] Daniel J. Sorin, “Fault Tolerant Computer Architecture (Synthesis Lectures on Computer Architecture),” 2010.