480 likes | 504 Views
CS137: Electronic Design Automation. Day 9: October 17, 2005 Fault Detection. Today. Faults in Logic Error Detection Schemes Optimization Problem. Problem. Gates, wires, memories: built out of physical media may fail. Device Physics. Represent a 1 or 0 with charge
E N D
CS137:Electronic Design Automation Day 9: October 17, 2005 Fault Detection
Today • Faults in Logic • Error Detection Schemes • Optimization Problem
Problem • Gates, wires, memories: • built out of physical media • may fail
Device Physics • Represent a 1 or 0 with charge • On a gate, in a memory • Charge may be disrupted • -particle (other ionizing particles) • Ground bounce • Noise coupling • Tunneling • Thermal noise • Behavior of individual electrons is statistical
DRAMs • Small cells • Store charge dynamically on capacitor • Store about 50,000 electrons • Must be refreshed • Data leaks away through parasitic resistance • -particle can be 1,000,000 carriers?
System Reliability • Device fail with Probability: Pfail • Have N components in system • All must work for device to work • Psys = (1-Pfail)N
System Reliability • If NPfail << 1 • NPfail dominates higher order terms…
System Reliability • Psysfail N Pfail
Modern System • 100 Million 1 Billion Transistors • Not to mention wiring… • > GHz = > 1 Billion Transitions / sec. • N = 1018 per second…
As we scale? • N increases • Charge/gate decreases • Less electrons • Higher probability they wander • Greater variability in behavior • Voltage levels decrease • Smaller barriers • Greater variability in device parameters Pfail increases
Exacerbated at Nanoscale • Small numbers of dopants (10s) • High variability • Small numbers of electrons (10-1000s?) • High variability • Highly susceptible to noise • Small number of molecules • May break, decay…
What do we do about it? • Tolerate faulty components • Detect faults • Not do anything bad • Try it again • If statistically unlikely error, • high likelihood won’t recur. • …Focus on detection…
Detect Faults • Key Idea: redundancy • Include enough redundancy in computation • Can tell that an error occurred
What kind of redundancy can we use? • Multiple copies of logic • Compute something about result • Parity on number of outputs • Count of number of 1’s in output
What do we protect against? • Any n errors • Worst-case selection of errors
Single Error Detection • If Pfail small: • No error: (1-Pfail)N 1-NPfail • One error:NPfail (1-Pfail)N-1 NPfail • Two errors: [N(N-1)/2] (Pfail )2(1-Pfail)N-1 • Probability of an error going undetected • For: NPfail << 1 • Goes from NPfail • to (NPfail )2
Single Error Detection (Example) • Probability of an error going undetected • For: NPfail << 1 • Goes from NPfail • to (NPfail )2 • N=1010 Pfail=10-20 • NPfail=10-10<<1 ~1010 cycles MTTF • Mean Time To Failure • 1GHz = 10s • (NPfail)2=10-20 1020 cycles MTTUF • Mean Time To Undetected Fault • 1011s = 3000 years
Detection Overhead • …but: Correction and detection circuitry increase circuit size. • Ndetect > Nlogic • Ndetect = c Nlogic • Probability of an error going undetected • Goes from NPfail • to (cNPfail )2 • To come out ahead, want: c2 << 1/(NPfail ) • c=3, N=1010 Pfail=10-20 • (cNPfail)2=910-20 1019 cycles MTTUF • 1010s = 300 years
Detection Overhead • …but: Correction and detection circuitry increase circuit size. • Ndetect > Nlogic • Ndetect = c Nlogic • Probability of an error going undetected • Goes from NPfail • to (cNPfail )2 • To come out ahead, want: c2 << 1/(NPfail ) • c=3, N=31010 Pfail=10-11 • NPfail=0.3 • (cNPfail)2=0.81 worse • Neither workable!
Reliability Tuning • Want NPfail small • Want: (cNPfail )2 very small • Idea: • Guard subsystems independently • Make Ns suitably small • Smaller probability there is a double error localized in this small subsystem • That is: as long as compartmentalization guarantees very small (cNsPfail )2: • can reduce to single detect case.
Composing Subsystems • Psysundetect = (Nsys/Ns) Psubundetect • Psubundetect = (cNsPfail )2 • Psysundetect = (Nsys/Ns) (cNsPfail )2 • Psysundetect = Nsys Ns (cPfail )2 • Extermes: • Ns= Nsys • Ns=1 No benefit Maximum benefit factor of Nsys [in practice c=f(Ns)]
Composing Subsystems • Psysundetect = Nsys Ns (cPfail )2 • Example: c=3, Nsys=31010 Pfail=10-11 • Ns=103 • 31010 103 (310-11)2 • 33 10-9 310-8 (<<0.81) • Still < 1s MTTUF …
Problem Motivates Problem: • Generate logic capable of detecting any single error
Terminology • Fault-secure: system never produces incorrect code word • Either produces correct result • Or detects the error • Self-testing: for every fault, there is some input that produces an incorrect code word • That detects the error
Terminology • Totally Self Checking: system is both fault-secure and self-testing.
Duplication Detects any single fault (even in checker)
Duplication • N original gates • Duplicate: + N • O outputs • O xors • O/2 2 2 ors • Total 3O gates • Total: 2N+3O • O<N • 2<c<5
Duplication • Total: 2N+3O • O<N • Rent’s Rule: O~kNp • p<1 • Total: 2N+3kNp • c(N)=2+3k/N(1-p) • N small 5 • N large 2
Duplication with PLA Logic Duplicate
PLA Duplication • N product terms in original • N in duplicate • 2 O product terms for matching • ON • 2<c<4
Can we do better? • Seems like overkill to compute twice?
Idea • Encode so outputs have some checkable property • E.g. parity
Will this work? Original Logic Extra cubes for parity parity
Problem • Single fault may produce multiple output errors
How Fix? • How do we fix?
No Logic Sharing • No sharing • Single fault effects single output
Parity Checking • To check parity • Need xor tree on outputs/parity • [(O+1)/2]22 = 2(O+1) xors • For PLA • xor would blow up • Wrap multiple times • 2 product terms per xor • 4O product terms
nanoPLA Wrapped xor Note: two planes here just for buffering/inversion
Better or Worse than Dual? (not include checking)
Can we allow sharing? • When?
Multiple Parity Groups • Can share with different parity groups • Common error flagged in both groups
Multi-Parity Group Compare (AMD) (not include checking)
Best Results from Winter2004 CS137 (not include checking)
Better or Worse than Dual? • Typical results from Mitra [ITC2002] • Multi-level gate mapping to LSI std. cell library (parity here includes multiple parity)
Admin • Assignment #2 due Friday • Wednesday reading online • Friday reading handout
Big Ideas • Low-level physics imperfect • Statistical, noisy • Larger number of devices greater likelihood of faults • Redundancy • Self-checking circuits