Spring 2008 CSE 591 Compilers for Embedded Systems

Spring 2008 CSE 591Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University

Lecture 3: Soft Errors Models and Techniques

Outline • Soft Errors Recap • Process Technology and Packaging Solutions • Gate-level and Circuit-level Solutions • Microarchitectural Solutions • Single-core • Multi-threaded • Software Solutions • Multi Bit Upsets (MBUs) • Single Event Latchup

Phenomenon of Soft Error • Transient Faults • Random and spontaneous bit-changes in system • Can be caused by • Circuit noise • Cross-talk • More than 50% due to radiation strike

Metrics • FIT: Failure in Time • No. of failures in 1 billion hours of operation • MTTF: Mean Time To Failure • 1000 FITs => MTTF of 114 years • 1 GByte of RAM @ 500 FIT/Mbit can expect an error every two weeks • ECC reduces failure rate by 2 orders of magnitude • hypothetical Terabyte system would experience a soft error every few minutes

Trends • DRAM • System error rate of DRAMs is fairly constant • SRAM • Increasing exponentially • Logic • Increasing exponentially

Masking Effects • Logic Masking • Occurs when particle strikes a portion of combinational logic that is blocked from affecting the output due to a subsequent gate whose result is completely determined by its other input values • Electrical Masking • Occurs when the pulse resulting from a particle strike is attenuated by subsequent logic gates, and does not affect the result of the circuit • Latching Window Masking • Occurs when the pulse resulting from a particle strike reaches a latch, but not at the clock transition where the latch captures its input values • Microarchitectural Masking • Occurs when the incorrect value in the latch is ignored in evaluation of a program variable • Software Masking • Occurs when an incorrect value of a variable is ignored by the software while computing the outputs

Faults, Errors, Failures(“Fault Tolerant Computer Systems”, by Pradhan) • Fault • Defect in hardware or software component • defect for cosmic ray = upset from high-energy neutron strike • Error • manifestation of a fault, resulting in deviation from accuracy • faults cause errors (but, not vice versa) • a masked fault is not an error! • vulnerability factor = fraction of faults that cause errors • Failure • non-performance of expected action • errors cause failures (but not vice versa) • a corrected error doesn’t cause a failure

Fault Tolerance in Microprocessors • Information Redundancy • Protecting data words with information coding • Parity or Hamming codes • ECC codes mainly in memory arrays • Cost is extra/additional storage for coding overhead, and checking logic • Space Redundancy • Carrying out the same computation on multiple independent hardware at the same time • Errors are exposed by checking the independent results • Cause large hardware overhead • Good for permanent faults • Time Redundancy • Execute the same computation on the same hardware at different times

The Soft Error Opportunity • Key differences with classical fault tolerance • FIT budget 100x – 1000x more than Tandem-style machines • Traditional “big hammer” solutions too expensive for volume market & can be an overkill • Why architecture plays a critical role? • error often defined in architecture & microarchitecture • e.g., strike on a branch predictor doesn’t cause an error • architectural solutions are often more cost-effective • one bit of parity can protect 64 bits, overhead < 2% • radiation-hardened cells can have overhead around 20-40%

Processing and Packaging Solutions • Reduce the number of particles that strike • Reduce upsets • Use of highly purified fabrication materials • Remove traces of boron and heavy metals • Surround by metallic frame • Reduce low-energy particles • But neutrons can pass through > 10 ft of concrete • Process Technology Solutions • Partially depleted SOI: no help after 250 nm • Fully depleted SOI: very expensive

Transistor Level Techniques • Normally CMOS inverter is scaled with 2:1 ratio between p- and n-channel devices • To compensate for electron and hole mobilities • Changing this ratio can increase the tolerance

Gate-Level Techniques • Some gates are more vulnerable than others • Radiation hardened designs use NAND gates • When all inputs are low, drive of p-stack is low, high leakage of n-transistors  rise in the output slow  functional failure • Gates vulnerability may change by 5X depending on the state • NAND gate • Extremely vulnerable when inputs 10 • Not vulnerable when inputs 00 • How to synthesize to minimize vulnerability

Circuit-Level Techniques • Adding resistance introduces additional time constants that filter out the very fast SEU-induced transients • High temperature coefficients of poly-silicon resistors • Difficult to control variation of resistance

Architectural Vulnerability Factor • AVF: Probability that a fault in a particular structure will results in system failure • AVF of branch predictor = 0% • AVF of PC = 100% • ACE-bit: “Architectural bits” that must be correct for “Correct Execution” • Count number of ACE-bits in a structure • Indentifying Un-ACE bits • Microarchitectural Un-ACE bits: Cannot influence correct instruction execution • Idle or Invalid state, e.g., inputs to un-chosen paths of mux • Mis-speculated state, e.g., wrong path instruction • Predictor structures, e.g., branch predictor • Ex-ACE state, e.g., registers • Architectural Un-ACE bits: Affect correct path execution, but does not change the output • NOP-instructions • Prefetch instructions • Predicated false instructions • Dynamically dead instructions, FDD, TDD • Computing AVF from a Performance Model • Gather the number of ACE-bits in each cycle

Vulnerability Contributions • DCache - largest contributor to vulnerability • Data + tags • ICache: Close second • Instructions only • Tags are (almost) not vulnerable • Register File, Pipeline • Rate of errors may be higher in Pipeline and RF • Compute Cache and Register File Vulnerability

Vulnerability Variations • System vulnerability changes with time • How can you use this information?

4x reduction in vulnerability D-Cache: Flushing

10x reduction in vulnerability D-Cache: Write Policy

3x reduction in vulnerability using write-thru (30x total) D-Cache: Refresh

DIVA Microarchitecture BPred I-$ Dec/Ren IQ Rename Regs ALU D-$ Arch Regs LR3 + LR7  LR15 4 8 12 If both checks succeed, write 12 into LR15 Storage Check Rd LR3 and LR7 from Arch Regs and confirm it equals 4 and 8 ALU Check Add 4+8 and confirm it equals 12

Microarchitecture Details • Instructions are fed to checker in order during commit • The logic and storage checks detect errors in ALUs • and datapath • The checker core is a simple in-order pipeline – easy to • design and verify • An error in an earlier stage (LR3 instead of LR2) can be • detected by also adding a ren/decode stage to the checker • In-order core has no stalls (need bypass for register file) • – no data dependences, cache misses, branch mispredicts • Contention for register file and data cache can degrade • primary thread

Recovery • The architected register file and data cache are ECC • protected – when an error is detected, it is assumed • that checker and architected state are correct • Primary core is re-started from faulting instruction • A fault in the primary core may result in deadlock: • e.g. instruction that produces R5 is waiting for R5 to be • produced (instead of R4) • A timeout in the checker signals an error

Page Mapping FNC FC PPC (Partially Protected Caches) Processor • 2 Caches at the same level of memory hierarchy • Main Cache, and the protected mini-cache • Mini-cache • low power, low latency • Timing slack to harden it • Compiler maps data to the two caches • Map Failure-Critical data to the protected mini-cache • Map Not Failure-Critical data to unprotected main cache • Intuition is to provide protection to only the FC data • In multimedia applications, the multimedia data is NOT failure critical • An error  Loss in Quality of Service • How to use PPCs for general applications? Processor Pipeline HPC PPC Unprotected Main Cache Protected Mini Cache Mini Cache Main Cache Memory Controller FNC FC Memory

Razor • Originally proposed to tolerate process variations • Shadow latch clocked with a delayed clock • If difference in values latched, raise error • How to use it to detect soft errors?

Spring 2008 CSE 591 Compilers for Embedded Systems