190 likes | 305 Views
Analytical Approach for Soft Error Rate Estimation of SRAM-Based FPGAs. Test & Reliability Group (TRG) Department of Electrical & Computer Engineering Northeastern University. Problem Statement. Estimating soft error rate in FPGAs The probability of system failure Due to soft errors
E N D
Analytical Approach for Soft Error Rate Estimation of SRAM-Based FPGAs Test & Reliability Group (TRG) Department of Electrical & Computer Engineering Northeastern University
Problem Statement • Estimating soft error rate in FPGAs • The probability of system failure • Due to soft errors • For a given mapped design • Mean time to manifest a corrupted conf. bit • To primary outputs or Flip-flops
Motivation • Need for soft error rate estimation • Exponential growth of vulnerable bits due to Moore’s law • High cost of Error tolerant schemes • To make appropriate cost/reliability trade-offs • Where to put redundancy • Previous work: Fault Injection • Time-consuming / Incomplete / Expensive • Needs physical prototype board • Cannot be used in design phases • Prototype board can be damaged Hard Error
Error Models in FPGAs • Memory resources: • User bits • Flip-flops, RAMs, … • Configuration bits • Mux select bits, LUT bits, PIPs, … • User bits Transient errors • Config. bits Permanent errors
SER Estimation • Traversing structural paths [Asadi04] • From error sites to outputs
SER Estimation in ASIC Designs • S(n): System failure probability (SFP) vector • Si: SFP given node i erroneous • n: total fault sites • Experiments on ISCAS89 show that: • Three order of magnitude faster • Compared to random-input simulation • Accuracy: more than 90%
FPGA vs. ASIC in SER Estimation • ASIC: transient error • Only requires propagation probability • FPGA: both transient & permanent errors • Transient errors: the same • Permanent errors: needs activation as well • More error sites in FPGAs • Routing signals
FPGA vs. ASIC in SER Estimation • Nodes with different error rates in FPGAs • No attenuation in FPGAs • During propagation
SER Estimation of FPGAs: Steps • Compute permanent error rates for all nodes • PRi : the permanent error rate of node i • n: total number of fault sites • Compute netlist failure probability vector • Ni= failure prob. given node i erroneous • System failure rate vector (S) = PR N • Si = PRi Ni
How to Compute Ni? • Open & stuck-at errors: • Ni = [SPi PPi(0) + (1-SPi) PPi(1)] = PPi • PPi: Propagation prob. (the method used for ASIC) • SP: Signal probability is used for activation prob. • Bridging wired-AND & wired-OR error (nets i and j): • Ni (Wand)= [SPi(1-SPj)PPi(0)] + [(1-SPi) SPjPPj(0)] • Ni (Wor)= [SPi(1-SPj)PPj(1)] + [(1-SPi) SPjPPi(1)] • LUT bit-flip: • Ni = Activation prob. (cell) Prop. Prob. (LUT output)
How to Compute PRi? • PR(n): permanent error rate vector • PRi : r f • r: Raw error rate of an SRAM cell • f: Number of all possible errors at node i • n: total number of error sites • PRAB= 6 r
System Failure Rate • For the first clock: • For c clock cycles: • The same probability is valid for the next clock cycles • c: Number of clocks checking the state of the circuit • After particle hit
Error List • Mux-open • PIP open • Buffer off • A bit-flip in LUT • Control/clocking bit-flip
Experimental Setup • Xilinx Virtex 300 (XCV300) • Xilinx Design Language (XDL) • Benchmark: some ISCAS89 circuits • r = raw failure rate for an SRAM cell • r=0.01 FIT/bit • 1000 clocks executed for each SEU • Platform: Sun Solaris Ultra-10 • 256 MB Main Memory
Results: Sensitive Bits Number of sensitive SRAM bits for each part
Results: SFR & Estimation Time System Failure Rate & Estimation Time Number of Clock cycles: 1000 SP Time: Signal Probability computation time SFR Time: System Failure Rate computation time
Results: Manifestation Time Mean Time To Manifest (MTTM) errors to outputs (Results are in terms of cycles)
Summary & Conclusions • A new approach for SER estimation • For SRAM-based FPGAs • No physical implementation required • Can be used in early design stages • Very fast simulation time • Can cover all possible faults • Mean Time To Manifest errors to outputs: • MTTM(Control/clocking) < MTTM(routing) • MTTM(routing) << MTTM(LUT)
Questions? Thanks