480 likes | 505 Views
Explore the challenges of fault detection in electronic circuits due to physical media failures and statistical behaviors at the nanoscale. Learn about error detection schemes, system reliability, reliability tuning, and subsystem guarding strategies.
E N D
CS137:Electronic Design Automation Day 9: October 17, 2005 Fault Detection
Today • Faults in Logic • Error Detection Schemes • Optimization Problem
Problem • Gates, wires, memories: • built out of physical media • may fail
Device Physics • Represent a 1 or 0 with charge • On a gate, in a memory • Charge may be disrupted • -particle (other ionizing particles) • Ground bounce • Noise coupling • Tunneling • Thermal noise • Behavior of individual electrons is statistical
DRAMs • Small cells • Store charge dynamically on capacitor • Store about 50,000 electrons • Must be refreshed • Data leaks away through parasitic resistance • -particle can be 1,000,000 carriers?
System Reliability • Device fail with Probability: Pfail • Have N components in system • All must work for device to work • Psys = (1-Pfail)N
System Reliability • If NPfail << 1 • NPfail dominates higher order terms…
System Reliability • Psysfail N Pfail
Modern System • 100 Million 1 Billion Transistors • Not to mention wiring… • > GHz = > 1 Billion Transitions / sec. • N = 1018 per second…
As we scale? • N increases • Charge/gate decreases • Less electrons • Higher probability they wander • Greater variability in behavior • Voltage levels decrease • Smaller barriers • Greater variability in device parameters Pfail increases
Exacerbated at Nanoscale • Small numbers of dopants (10s) • High variability • Small numbers of electrons (10-1000s?) • High variability • Highly susceptible to noise • Small number of molecules • May break, decay…
What do we do about it? • Tolerate faulty components • Detect faults • Not do anything bad • Try it again • If statistically unlikely error, • high likelihood won’t recur. • …Focus on detection…
Detect Faults • Key Idea: redundancy • Include enough redundancy in computation • Can tell that an error occurred
What kind of redundancy can we use? • Multiple copies of logic • Compute something about result • Parity on number of outputs • Count of number of 1’s in output
What do we protect against? • Any n errors • Worst-case selection of errors
Single Error Detection • If Pfail small: • No error: (1-Pfail)N 1-NPfail • One error:NPfail (1-Pfail)N-1 NPfail • Two errors: [N(N-1)/2] (Pfail )2(1-Pfail)N-1 • Probability of an error going undetected • For: NPfail << 1 • Goes from NPfail • to (NPfail )2
Single Error Detection (Example) • Probability of an error going undetected • For: NPfail << 1 • Goes from NPfail • to (NPfail )2 • N=1010 Pfail=10-20 • NPfail=10-10<<1 ~1010 cycles MTTF • Mean Time To Failure • 1GHz = 10s • (NPfail)2=10-20 1020 cycles MTTUF • Mean Time To Undetected Fault • 1011s = 3000 years
Detection Overhead • …but: Correction and detection circuitry increase circuit size. • Ndetect > Nlogic • Ndetect = c Nlogic • Probability of an error going undetected • Goes from NPfail • to (cNPfail )2 • To come out ahead, want: c2 << 1/(NPfail ) • c=3, N=1010 Pfail=10-20 • (cNPfail)2=910-20 1019 cycles MTTUF • 1010s = 300 years
Detection Overhead • …but: Correction and detection circuitry increase circuit size. • Ndetect > Nlogic • Ndetect = c Nlogic • Probability of an error going undetected • Goes from NPfail • to (cNPfail )2 • To come out ahead, want: c2 << 1/(NPfail ) • c=3, N=31010 Pfail=10-11 • NPfail=0.3 • (cNPfail)2=0.81 worse • Neither workable!
Reliability Tuning • Want NPfail small • Want: (cNPfail )2 very small • Idea: • Guard subsystems independently • Make Ns suitably small • Smaller probability there is a double error localized in this small subsystem • That is: as long as compartmentalization guarantees very small (cNsPfail )2: • can reduce to single detect case.
Composing Subsystems • Psysundetect = (Nsys/Ns) Psubundetect • Psubundetect = (cNsPfail )2 • Psysundetect = (Nsys/Ns) (cNsPfail )2 • Psysundetect = Nsys Ns (cPfail )2 • Extermes: • Ns= Nsys • Ns=1 No benefit Maximum benefit factor of Nsys [in practice c=f(Ns)]
Composing Subsystems • Psysundetect = Nsys Ns (cPfail )2 • Example: c=3, Nsys=31010 Pfail=10-11 • Ns=103 • 31010 103 (310-11)2 • 33 10-9 310-8 (<<0.81) • Still < 1s MTTUF …
Problem Motivates Problem: • Generate logic capable of detecting any single error
Terminology • Fault-secure: system never produces incorrect code word • Either produces correct result • Or detects the error • Self-testing: for every fault, there is some input that produces an incorrect code word • That detects the error
Terminology • Totally Self Checking: system is both fault-secure and self-testing.
Duplication Detects any single fault (even in checker)
Duplication • N original gates • Duplicate: + N • O outputs • O xors • O/2 2 2 ors • Total 3O gates • Total: 2N+3O • O<N • 2<c<5
Duplication • Total: 2N+3O • O<N • Rent’s Rule: O~kNp • p<1 • Total: 2N+3kNp • c(N)=2+3k/N(1-p) • N small 5 • N large 2
Duplication with PLA Logic Duplicate
PLA Duplication • N product terms in original • N in duplicate • 2 O product terms for matching • ON • 2<c<4
Can we do better? • Seems like overkill to compute twice?
Idea • Encode so outputs have some checkable property • E.g. parity
Will this work? Original Logic Extra cubes for parity parity
Problem • Single fault may produce multiple output errors
How Fix? • How do we fix?
No Logic Sharing • No sharing • Single fault effects single output
Parity Checking • To check parity • Need xor tree on outputs/parity • [(O+1)/2]22 = 2(O+1) xors • For PLA • xor would blow up • Wrap multiple times • 2 product terms per xor • 4O product terms
nanoPLA Wrapped xor Note: two planes here just for buffering/inversion
Better or Worse than Dual? (not include checking)
Can we allow sharing? • When?
Multiple Parity Groups • Can share with different parity groups • Common error flagged in both groups
Multi-Parity Group Compare (AMD) (not include checking)
Best Results from Winter2004 CS137 (not include checking)
Better or Worse than Dual? • Typical results from Mitra [ITC2002] • Multi-level gate mapping to LSI std. cell library (parity here includes multiple parity)
Admin • Assignment #2 due Friday • Wednesday reading online • Friday reading handout
Big Ideas • Low-level physics imperfect • Statistical, noisy • Larger number of devices greater likelihood of faults • Redundancy • Self-checking circuits