1 / 38

Analytical Approach for Soft Error Rate Estimation of SRAM-Based FPGAs

Analytical Approach for Soft Error Rate Estimation of SRAM-Based FPGAs. Test & Reliability Group (TRG) Department of Electrical & Computer Engineering Northeastern University. Outline. Problem Statement & Motivation Soft Errors Background & Previous work Error Models in FPGAs

libitha
Download Presentation

Analytical Approach for Soft Error Rate Estimation of SRAM-Based FPGAs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analytical Approach for Soft Error Rate Estimation of SRAM-Based FPGAs Test & Reliability Group (TRG) Department of Electrical & Computer Engineering Northeastern University

  2. Outline • Problem Statement & Motivation • Soft Errors Background & Previous work • Error Models in FPGAs • SER Estimation • Experimental Results • Summary & conclusions

  3. Problem Statement • Estimating soft error rate in FPGAs • The probability of system failure • Due to soft errors • For a given mapped design • Mean time to manifest a corrupted conf. bit • To primary outputs or Flip-flops

  4. Motivation • Need for soft error rate estimation • Exponential growth of vulnerable bits due to Moore’s law • High cost of Error tolerant schemes • To make appropriate cost/reliability trade-offs • Where to put redundancy • Why an analytical method? • Previous work: Fault Injection • Time-consuming / Incomplete / Expensive • Needs physical prototype board • Cannot be used in design phases

  5. Background: Error Definitions • Soft Errors: • Intermittent malfunctions of the hardware • Not reproducible • Energetic Particles  Single Event Upsets (SEUs) Soft Errors  (may cause) System Failure

  6. Previous Work • Based on Fault Injection (FI) • Inject fault • Run several workloads • Compare results with fault-free circuit • Exhaustive FI is very time-consuming • Candidate some locations for FI • Analysis based on statistics

  7. Previous Work (Cont.) • Radiation-based fault injection • Expensive & not commonly used • Needs physical implementation • Cannot be used during design phases • Can damage prototype board  Hard error • Simulation-based fault injection • Bit-stream alteration • Needs physical implementation • Bridging errors may lead to hard errors

  8. Outline • Problem Statement & Motivation • Soft Errors Background & Previous work • Error Models in FPGAs • SER Estimation • Experimental Results • Summary & conclusions

  9. Error Models in FPGAs • Memory resources: • User bits • Flip-flops, RAMs, … • Configuration bits • Mux select bits, LUT bits, … • User bits  Transient errors • Config. bits  Permanent errors

  10. M Error Models in FPGAs (Cont.) • Bit flip • Transient error • Can be corrected at the next load • Bit flip • Permanent error • Corrected by reconfiguration E1 E2 E1 E3 • Short or open circuit • Corrected by reconfiguration clk E2 E3 BlockRAM LUT ff F1 M M M M F2 M M F3 F4 M SEU (Bit flip) Configuration Memory Cell Virtex (Xilinx) © Lima (DAC03)

  11. Error Models in FPGAs (Cont.) • Transient errors • User flip-flops, Logic gates, Block RAMs • Permanent errors (all configuration bits) • Routing: • MUX select bits • PIP: Short/Open • Buffer: On/Off • LUT • Control/Clocking Bits

  12. Error Models in FPGAs (Cont.) • Only permanent errors considered • Conf. bits comprise more than • 99% of all memory elements excluding RAM blocks • 95% of all memory elements including RAM blocks

  13. Outline • Problem Statement & Motivation • Soft Errors Background & Previous work • Error Models in FPGAs • SER Estimation • Experimental Results • Summary & conclusions

  14. SER Estimation • Traversing structural paths [Asadi04] • From fault sites to POs

  15. SER Estimation in ASIC Designs • S(n): System failure probability (SFP) vector • Si: SFP given node i erroneous • n: total fault sites • Experiments on ISCAS89 show that: • Three order of magnitude faster • Compared to random-input simulation • Average accuracy: 97%

  16. FPGA vs. ASIC in SER Estimation • ASIC: transient error • Only requires propagation probability • FPGA: both transient & permanent errors • Transient errors: the same • Permanent errors: needs activation as well • Nodes with different error rates in FPGAs • Fault sites: all nodes

  17. SER Estimation of FPGAs: Steps • Compute permanent error rates for all nodes • PRi : the permanent error rate of node i • n: total number of fault sites • Compute netlist failure probability vector • Ni= failure prob. given node i erroneous • System failure rate vector (S) = PR  N • Si = PRi  Ni

  18. How to Compute Ni? • Open & stuck-at errors: • Ni = [SPi  PPi(0) + (1-SPi)  PPi(1)] = PPi • PPi: Propagation prob. (the method used for ASIC) • SP: Signal probability is used for activation prob. • Bridging wired-AND error (nets i and j): • Ni = [SPi(1-SPj)PPi(0)] + [(1-SPi) SPjPPj(0)] • Bridging wired-OR error (nets i and j): • Ni = [SPi(1-SPj)PPj(1)] + [(1-SPi) SPjPPi(1)]

  19. How to Compute PRi? • PR(n): permanent error rate vector • PRi : r  f • r: Raw error rate of an SRAM cell • f: Number of all possible errors at node i • n: total number of fault sites • PRAB= 6  r

  20. System Failure Rate • For the first clock: • For c clock cycles: • The same probability is valid for the next clock cycles • c: Number of clocks checking the state of the circuit • After particle hit

  21. Outline • Problem Statement & Motivation • Soft Errors Background & previous work • Error Models in FPGAs • SER Estimation • Experimental Results • Summary & conclusions

  22. Error List • Mux-open • PIP open • Buffer off • A bit-flip in LUT • Control bit-flip

  23. Experimental Setup • Xilinx Virtex 300 (XCV300) • Xilinx Design Language (XDL) • Benchmark: some ISCAS89 circuits • r = raw failure rate for an SRAM cell • r=0.01 FIT/bit • 1000 clocks executed for each SEU • Platform: Sun Solaris Ultra-10 • 256 MB Main Memory

  24. Results: Sensitive Bits Number of sensitive SRAM bits for each part

  25. Results: Manifestation Time Mean Time To Manifest (MTTM) errors to outputs (Results are in terms of cycles)

  26. Results: SFR & Estimation Time System Failure Rate & Estimation Time Number of Clock cycles: 1000 SP Time: Signal Probability computation time SFR Time: System Failure Rate computation time

  27. Summary & Conclusions • A new approach for SER estimation • For SRAM-based FPGAs • No physical implementation required • Can be used in early design stages • Very fast simulation time • Can cover all possible faults • Mean Time To Manifest errors to outputs: • MTTM(Control/clocking) < MTTM(routing) • MTTM(routing) < MTTM(LUT)

  28. Appendix & Backup

  29. Background: Soft Error Origin • The main sources in terrestrial conditions: • Alpha particles & Neutrons • Soft Error occurs: • if hitting particles generate more than Qcrit • Critical Charge (Qcrit): • the minimum charge needed • To flip the value stored in the cell

  30. Exp. Increase of Soft Errors • e-Qcrit/Qs trend with technology scaling (Shivakumar , DSN 2002) • Qcrit: the critical charge (depend on characteristics of the circuit) • Qs: the charge collection efficiency of a particle strike on the device • Particles of lower energies occur far more frequently

  31. Background: Definitions • How to express Soft Error Rate (SER) • MTBF (Mean Time Between Failures) • FIT (Failure-in-Time) • 1 failure in a billion hours • 1 year MTBF = 114,155 FIT

  32. Background: Definitions • Failure definition: • (a) Propagation of an erroneous value • to at least one flip-flip or primary output or • (b) Propagation of an erroneous value • to at least one primary output • Definition (a) is compatible with (b) • If there is no redundant flip-flop in the circuit

  33. Failure Error Rate of LUT • To reduce number of nodes • LUT as a complex gate • P(tx): the probability of O=tx • LUT failure rate • SO=[AP(t0)+AP(t1)+…+AP(t15)].r.NO • = r.NO

  34. Xilinx Virtex FPGA Model CLB Logic block IO Mux Switch Matrix (SM) Line Segments IOB

  35. CLB Architecture

  36. Error Models in FPGAs (Cont.) • Config. Bits: • Care bits • All 1s • Some of 0s • Don’t care bits • Some of 0s

  37. Error Models: PIP Short/Open • 10: causes open • 01: may cause short or bridging error

  38. Error Models (Cont.) • Buffer on/off • Tri-state buffers • Used in IOBs • Look-Up Table

More Related