710 likes | 719 Views
Understand the impact and sources of soft errors in Advanced VLSI design, and learn techniques to make hardware robust and fault-tolerant. Covers causes, effects, and mitigation strategies.
E N D
ELEC 7770Advanced VLSI DesignSpring 2012Soft Errors and Fault-Tolerant Design Vishwani D. Agrawal James J. Danaher Professor ECE Department, Auburn University Auburn, AL 36849 vagrawal@eng.auburn.edu http://www.eng.auburn.edu/~vagrawal/COURSE/E7770_Spr12 ELEC 7770: Advanced VLSI Design (Agrawal)
Soft Errors • Soft errors are the errors caused by the operating environment. • They are not due to a permanent hardware fault. • Soft errors are intermittent or random, which makes their testing unreliable. • One way to deal with soft errors is to make hardware robust: • Capable of detecting soft errors • Capable of correcting soft errors • Both measures are probabilistic ELEC 7770: Advanced VLSI Design (Agrawal)
Some Early References • J. von Neumann, “Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components,” pp. 329-378, 1959, in A. H. Taub, editor, John von Neumann: Collected Works, Volume V: Design of Computers, Theory of Automata and Numerical Analysis, Oxford University Press, 1963. • M. A. Breuer, “Testing for Intermittent Faults in Digital Circuits,” IEEE Trans. Computers, vol. C-22, no. 3, pp. 241-246, March 1973. • T. C. May and M. H. Woods, “Alpha-Particle-Induces Soft Errors in Dynamic Memories,” IEEE Trans. Electron Devices, vol. ED-26, no. 1, pp. 2-9, 1979. ELEC 7770: Advanced VLSI Design (Agrawal)
Causes of Soft Errors • Interconnect coupling (crosstalk). • Power supply noise: IR-drop, power droop, ground bounce. • Ignition noise. • Electromagnetic pulse (EMP). • Effects generally attributed to alpha-particles: • Charged particles: electrons, protons, ions. • Radiation (photons): X-rays, gamma-rays, ultra-violet light. ELEC 7770: Advanced VLSI Design (Agrawal)
Sources of Alpha-Particles • Radioactive contamination in VLSI packaging material. • Ionosphere, magnetosphere and solar radiation. • Other electromagnetic radiation. ELEC 7770: Advanced VLSI Design (Agrawal)
Alpha-Particle • Helium nucleus: two protons and two neutrons, mass = 6.65 ×10-27kg, charge = +2e (e = 1.6 ×10-19C). • Energy = 3.73 GeV ELEC 7770: Advanced VLSI Design (Agrawal)
Soft Error Rate (SER) • Failures in time (FIT): One FIT is 1 error per billion hours of operation. • Alternative unit is mean time between failures (MTBF) or mean time to failure (MTTF). 1 year MTBF = 109/(365×24) = 114,155 FIT ELEC 7770: Advanced VLSI Design (Agrawal)
Particle Strike Ion or Charged particle - + + + + - - n p - substrate ELEC 7770: Advanced VLSI Design (Agrawal)
Induced Current current time I(t) = I0(e– t/a – e– t/b), a >> b ELEC 7770: Advanced VLSI Design (Agrawal)
Voltage Induced at a Node V = Q/C Where Q = ∫ I(t) dt C = node capacitance Smaller node capacitance will result in larger voltage swing. ELEC 7770: Advanced VLSI Design (Agrawal)
Effect on Digital Circuit Charged Particles Charged Particles Combinational Logic IN OUT CK ELEC 7770: Advanced VLSI Design (Agrawal)
An SRAM Cell WL VDD 1 0 bit bit BL BL ELEC 7770: Advanced VLSI Design (Agrawal)
SRAM Cell Struck by Alpha-ParticleSingle-Event Upset (SEU) Charged Particles WL VDD 1→0 0→1 bit bit BL BL ELEC 7770: Advanced VLSI Design (Agrawal)
A Resistor Hardened SRAM Cell WL VDD 1 0 bit bit BL BL ELEC 7770: Advanced VLSI Design (Agrawal)
D-Latch 1 D Q Q 0 CK = 0 ELEC 7770: Advanced VLSI Design (Agrawal)
SEU in D-Latch Charged Particles 1→0 D Q Q 0→1 CK = 0 ELEC 7770: Advanced VLSI Design (Agrawal)
Single Event Transients in Combinational Logic 1 1 0 1 CK 1 Charged Particles 0 CK ELEC 7770: Advanced VLSI Design (Agrawal)
Effects of Transients • Error correcting effects • Transient pulse is filtered by gate inertia • Transient is blocked by an unsensitized path • Transient is blocked by an inactive clock • Error enhancing effects • Large number of gates can produce multiple pulses • Fanouts can multiply error pulses ELEC 7770: Advanced VLSI Design (Agrawal)
Typical Soft Error Distribution S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, “Robust System Design with Built-In Soft-Error Resilience,” Computer, vol. 38, no. 2, pp. 43-52, February 2005. ELEC 7770: Advanced VLSI Design (Agrawal)
Soft Error Simulation F. Wang and V. D. Agrawal, “Soft Error Rate with Inertial and Logical Masking,” Proc. 22nd International Conference on Quality VLSI Design, January 2009, pp. 459-464. F. Wang and V. D. Agrawal, “Soft Error Rate Determination for Nanoscale Sequential Logic,” Proc. 11th International Symposium on Quality Electronic Design (ISQED), March 2010, pp. 225-230. ELEC 7770: Advanced VLSI Design (Agrawal)
SEUs in FPGA • Parts that can be affected • Look-up table (LUT) • Configuration memory cell • Flip-flop • Block RAM • F. L. Kastensmidt, L. Carro and R. Reis, Fault-Tolerant Techniques for SRAM-Based FPGAs, Springer, 2006. ELEC 7770: Advanced VLSI Design (Agrawal)
F1 F2 F3 F4 LUT 1 1 1 0 0 1 0 0 out Memory cells 0 0 1 1 1 0 0 1 ELEC 7770: Advanced VLSI Design (Agrawal)
F1 F2 F3 F4 SEU in LUT 1 1 1 0 0 1 0 0 out Memory cells 0 0 Charged Particle 1 1 1 changed to 0 0 0 0 1 ELEC 7770: Advanced VLSI Design (Agrawal)
Four Types of SEU in FPGA M FF M M M M F1 F2 F3 F4 M Type 3 Type 2 LUT Type 1 M Type 4 Block RAM Configuration memory cell ELEC 7770: Advanced VLSI Design (Agrawal)
SEU Detection Methods • Hardware redundancy • Time redundancy • Error detection codes (EDC) • Self-checker techniques ELEC 7770: Advanced VLSI Design (Agrawal)
SEU Mitigation Techniques • Triple modular redundancy (TMR) • Multiple redundancy with voting • Error detection and correction codes (EDAC) • Hardened memory cells • FPGA-specific methods • Reconfiguration • Partial configuration • Rerouting design ELEC 7770: Advanced VLSI Design (Agrawal)
Hardware Redundancy for Detection Combinational Logic inputs output Logic 1 indicates error Combinational Logic (duplicated) Hardware overhead is high ~ 100% Performance penalty is negligible. ELEC 7770: Advanced VLSI Design (Agrawal)
D Q D Q Time Redundancy for Detection Combinational Logic inputs output CK+ d Logic 1 indicates error CK Hardware overhead is low. Performance penalty ( ~ d) = maximum detectable pulse width. ELEC 7770: Advanced VLSI Design (Agrawal)
D Q D Q Repeat on Error Detection Combinational Logic inputs C output CK+ d Logic 1 indicates error CK Operation: If error is detected, then output retains its previous value. Repeating the computation can produce correct result. ELEC 7770: Advanced VLSI Design (Agrawal)
Muller C-Element A C output B A S Q R output B ELEC 7770: Advanced VLSI Design (Agrawal)
Dynamic CMOS C-Element A C A output B output B ELEC 7770: Advanced VLSI Design (Agrawal)
Pseudostatic CMOS C-Element Weak keeper A C output A B output B ELEC 7770: Advanced VLSI Design (Agrawal)
Built-In Soft Error Resilience (BISER) Weak keeper Data from combinational logic Flip-flop A output Duplicate Flip-flop B Clock ELEC 7770: Advanced VLSI Design (Agrawal)
BISER • Assumptions: • Most soft errors in combinational logic are eliminated by inertial or logic masking. • Soft error pulse generated in flip-flop is much shorter than clock period. • Probability of either a master or slave latch being struck by soft error exactly at clock edge is small. • Flip-flop is duplicated and outputs fed to C-element. • Twenty times reduction of soft error observed. • Ref.: S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, “Robust System Design with Built-In Soft-Error Resilience,” Computer, vol. 38, no. 2, pp. 43-52, February 2005. ELEC 7770: Advanced VLSI Design (Agrawal)
Triple Modular Redundancy (TMR) Combinational Logic copy 1 Combinational Logic copy 2 Majority Voter inputs output Combinational Logic copy 3 ELEC 7770: Advanced VLSI Design (Agrawal)
TMR Error Reduction Voter input error probability = E, assumed independent for each input. Output error probability, e = Prob(two errors or three errors) = ( ) E2 (1 – E) + ( ) E3 = 3 E2 – 3 E3 + E3 = 3 E2 – 2 E3 For very small E, E3 << E2 → e = 3E2 ELEC 7770: Advanced VLSI Design (Agrawal)
TMR Error Probability ELEC 7770: Advanced VLSI Design (Agrawal)
Majority Voter Circuit A Majority Voter output B C A output B C ELEC 7770: Advanced VLSI Design (Agrawal)
Alternative Implementations of Voter VDD A 0 0 0 1 0 1 1 1 LUT output B output C A B C ELEC 7770: Advanced VLSI Design (Agrawal)
D Q D Q D Q D Q Triple Modular Redundancy (TMR) Combinational Logic inputs CK Majority Voter output CK + d CK + 3d CK + 2d ELEC 7770: Advanced VLSI Design (Agrawal)
D Q D Q D Q TMR for Memory Cells Combinational Logic inputs CK Majority Voter output CK • Problems: • Accumulation of • errors in flip-flops. • Voter is not protected. CK ELEC 7770: Advanced VLSI Design (Agrawal)
r1 r2 r3 Majority Voter Majority Voter Majority Voter Majority Voter D Q D Q D Q FF Refresh and TMR for Memory Cells CK output CK CK ELEC 7770: Advanced VLSI Design (Agrawal)
Reliability Analysis • Determine how long a system will work without failure. • Find: • Mean time to failure (MTTF) or mean time between failures (MTBF) • Mean time to repair (MTTR) • FIT rate ELEC 7770: Advanced VLSI Design (Agrawal)
Reliability Function Reliability function of a system, R(t) = Probability of survival at time t Determined from failure rates of components, λ(t) = Number of failures per unit time Generally varies with time. ELEC 7770: Advanced VLSI Design (Agrawal)
Failure Rate, λ(t) 100 Infant mortality Constant failure Rate (useful life) λ(t) = λ Wearout or aging 10-3 Failures per second, λ(t) 10-6 10-9 10-12 Time, t ELEC 7770: Advanced VLSI Design (Agrawal)
Deriving R(t) R(t) is the probability of no error in interval [0, t]. Divide interval in a large number (n) of subintervals of duration t/n. Let x be the probability of error in one subinterval. Assume that duration t/n is so small that either no error occurs or at most one error can occur. Then, average errors in a subinterval = 0.(1 – x) + 1.x = x = λt/n. Probability of no error in interval [0, t] is, R(t) = (1 – x)n = (1 – λt/n)n = exp(– λt), from Sterling’s formula as n → ∞ ELEC 7770: Advanced VLSI Design (Agrawal)
R(t) and MTBF R(t) = e –λt Failure rate, λ = failures per unit time Number of failures in time T = λT ∞ MTBF = T/λT = 1/λ = ∫ R(t) dt 0 R(t) = exp( – t/MTBF) For t = MTBF, R(MTBF) = e –1 = 0.368 ELEC 7770: Advanced VLSI Design (Agrawal)
Reliability and MTBF 1.0 0.8 0.6 0.4 0.2 0.0 R(t) = 1/e = 0.368 Reliability, R(t) 2 MTBF 3 MTBF 1 MTBF Time, t ELEC 7770: Advanced VLSI Design (Agrawal)
Example: First Generation Computer • 10,000 electron tubes. • Average burn out rate: 5 tubes per 100,000 hours. • MTBF = 100,000/5 = 20,000 hours = 2.3 years, i.e., 37% chance of survival beyond 2.3 years. • Time for 95% chance of survival: • R(t) = exp(– t/MTBF) = 0.95, or t = 1.4 months ELEC 7770: Advanced VLSI Design (Agrawal)
Reliability of TMR R(TMR) = Prob(all three modules correct) + Prob(any two modules correct) = R3 + 3R2 (1 – R) = 3 R2 – 2 R3 = 3e-2λt – 2e-3λt ELEC 7770: Advanced VLSI Design (Agrawal)