1 / 24

Evaluating Impact of Soft-Errors in an Embedded System

Evaluating Impact of Soft-Errors in an Embedded System. Vijay Sheshadri Graduate Student Dept. of Electrical Engineering. What is a Soft-error?. Transient fault caused by cosmic ray particles. . Sufficient charge collection causes an erroneous bit-flip.

stesha
Download Presentation

Evaluating Impact of Soft-Errors in an Embedded System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating Impact of Soft-Errors in an Embedded System Vijay Sheshadri Graduate Student Dept. of Electrical Engineering

  2. What is a Soft-error? • Transient fault caused by cosmic ray particles. Sufficient charge collection causes an erroneous bit-flip A charged particle incident on a component 0 1 The charged particle creates EHPs which get collected by the drain

  3. Bit Read Bit has error protection Does bit matter? Error can be corrected (e.g, ECC) Error is only detected (e.g., parity + no recovery) benign fault no error benign fault no error Silent Data Corruption (SDC) Detected, but unrecoverable error (DUE) no error Soft-error in a System no yes no yes yes yes no Source: Shubhu Mukherjee et al. Radiation-Induced Soft Errors: An Architectural Perspective, HPCA 2005

  4. Masking of Soft-error 1 0 1 1 1 0 0 latching window masking No soft error Particle strike REG I S T ERS REG I S T ERS Soft error I1 I2 D O1 I3 B I4 E I5 1 O2 I6 C I7 Logical Masking Electrical masking 4

  5. FIT Equation: Vulnerability Factors • FIT = (for each vulnerable device i)(intrinsic error ratei * vulnerability factori) • Vulnerability Factor = Timing Vulnerability Factor * Architectural Vulnerability Factor • Timing Vulnerability Factor (TVF) • fraction of time bit is vulnerable • Architectural Vulnerability Factor (AVF) • fraction of time bit matters for final output of a program Source: Shubhu Mukherjee et al. Radiation-Induced Soft Errors: An Architectural Perspective, HPCA 2005

  6. Architectural Vulnerability Factor • Fraction of time bit matters for final output of a program • Branch Predictor • Doesn’t matter at all (AVF = 0%) • Program Counter • Almost always matters (AVF ~ 100%) • Computing AVF for complex structures • Statistical Fault Injection • ACE (Architecturally Correct Execution) Analysis Source: Shubhu Mukherjee et al. Radiation-Induced Soft Errors: An Architectural Perspective, HPCA 2005

  7. Soft-error & Automobiles • Mar,2010 - NHTSA enlisted NASA Engineering and Safety Center (NESC) to investigate “Unintended Acceleration” • Apr,2011 – NESC discounts SEU in its report to NHTSA stating that the ICs manufactured using SOI (Silicon-on-insulator) technology • As per AEC-Q100 standard, SEU testing required for automobile electronics with RAM > 1Mb

  8. An Example • Predicted Block RAM upset rates for a Virtex-5 FPGA = 635 FIT/Mb = 1.5E-05 upsets per day per Mb. • Ref : A. Lesea, “Continuing Experiments of Atmospheric Neutron Effects on Deep Submicron Integrated Circuits,” WP286 (v1.0), Xilinx, Inc. 2008 • Assume this FPGA used in throttle control module • If 500,000 such vehicles produced by vendor, then total upsets per day = 1.5E-05 x 500,000 = 7.6 vehicle upsets per day

  9. Soft-error Mitigation • Robust circuit designs (radiation-hardenend) resilient to soft-errors • Soft-error mitigation at • Device-level – silicon-on-insulator, triple-well • Circuit-level – DICE cell, Triple-modular redundancy • Architecture-level – RMT, lock-stepping, ECC

  10. Soft-error Mitigation • Soft-error mitigation techniques incur penalties in • area (spatial redundancy) • timing (temporal redundancy) • Selective hardening of the components for reduced penalty • Often based on logical/electrical/timing derating • A low cost mitigation technique proposed for critical applications based on application derating • Certain applications can mask or recover from transient faults* Ref: V. Wong et al, “Soft Error Resilience of Probabilistic Inference Applications” SELSE II, 2006

  11. Critical Application - An Analogy Airbag deployment Climate monitor/display GPS Cruise control • A micro-controller embedded in a car dashboard maybe handling many applications. • A critical application in this case could be ‘Airbag deployment’. • SE during this application could be catastrophic

  12. Motor ADC PWM CPU core Target Module • PWM – output is a pulse, width of which decides speed of motor. • Etpwmi0 module • ~800 FFs & ~3000 logic gates • 180-nm CMOS technology, 80 MHz frequency

  13. Basic Simulation Steps* • Pre-analysis: Identify components utilized by critical application • Fault injection: Inject a single fault at random time instance by depositing the opposite value on the component • Error metric: • Error count => no. of mismatches b/w output and reference • PW count => no. of clock-cycles the output is ‘1’ as compared to reference Ref: J. Blome et al, “Cost-Efficient Soft Error Protection for Embedded Microprocessors” CASES, 2006

  14. Simulation tools • Verilog netlist simulated with timing information, using Synopsys VCS • Fault-injection module coded in C. • Uses VPI (verilog procedural interface) functions to • Access a net in the netlist (vpiHandle) • Read value of the net (vpi_get_value) • Overwrite value of the net (vpi_put_value)

  15. Simulation – Pre-analysis • Pre-analysis • Categorize FFs based on their activity • Low-activity FFs (no. of toggles less than 2) • High-activity FFs (no. of toggles higher than 2) • Opposite values forced and output pulse observed for errors • FFs in which errors were observed are identified and subjected to fault-injection

  16. Original value Test bench Fault-injection module Modified value (verilog) (C+VPI) Simulation – Fault-injection • Fault injection • For the FFs obtained from pre-analysis, inject fault at a random instance of time (within time interval of first output pulse) • Measure Error count & PW count. Identify FFs with error in acceptable limits Fault-injection window Output pulse

  17. Absolute error vs. Acceptable error • Absolute error – Raise error flag for any mismatch b/w the output pulse and reference • Acceptable error - Raise error flag only if mismatch b/w the output pulse and reference lies outside tolerance limit* • Examples: • Delayed pulse - Self-correcting pulse Target FF Target FF Actual output Actual output reference copy reference copy Fault-injected here Fault-injected here delay Ref: X. Li, et al “Exploiting Soft Computing for Increased Fault Tolerance” Workshop on Architectural Support for Gigascale Integration, 2006

  18. Actual output reference copy A B Y Simulations-Combinational logic • Fault injection steps: • SE modeled as a 1ns pulse (System Clock Freq = 80MHz) • Transient pulse injected onto the gate output • Target combinational circuit selected at random • Example: 2-input NAND gate A Y B Injected Fault

  19. Results • Pre-analysis - ~18% FFs used by the application • Fault-injection - number of faults injected is proportional to the number of flip-flops in the group • Low-toggle FFs more in number, hence no. of faults injected in low-toggle FF is higher

  20. Results • Low-toggle FF more vulnerable to soft-errors since an erroneous bit-flip may remain unchanged • High-toggle FF is written very often, an erroneous bit flip has a higher probability of getting overwritten

  21. Computing AVF • AVF = Pe * % component • Pe = probability that a fault injected in the component results in an error (Pe) = (no. of errors) / (no. of faults injected) • % component = the percentage of that component with respect to total number of components Example: For a latch, a. if # errors = 50% of injected faults (Pe = 0.5) b. if latches make for 20% of circuit AVF = 0.5 x 0.2 = 0.1

  22. AVF - Results • Low activity FF have a higher Pe and are more in number; hence have a higher AVF • Combinational logic, though high in number, has Pe ~4E-03, causing AVF to drop

  23. Summary • Fault-resilience scheme for critical applications using application derating and inherent error tolerance • For the application considered, • ~12% of the sequential logic was safety critical (prev. work reports 30% of seq. logic hardened for 99% fault-coverage in ARM embedded proc. running image processing algorithm) • failures in combinational logic were negligible • Worst-case scenario would only be the same as radiation-hardening a generic system • i.e., all the hardware is identified as safety-critical

  24. Future Work • Perform fault-injection analysis on the processor core managing the control loop • Conduct neutron beam experiments on the circuit to compare with simulations and find FIT rate • Implement circuit hardening and test the system to ascertain its robustness

More Related