830 likes | 1.03k Views
1st Combined R2E Workshop & School-Days Error Detection and Correction Techniques. A. Marchioro / PH-ESE-ME. Outline. SEU Basic Facts Special technologies for SEU protection Mitigation techniques Circuit Techniques In logic In registers In RAMS Logic (Redundancy) Techniques
E N D
1st Combined R2E Workshop & School-DaysError Detection andCorrection Techniques A. Marchioro / PH-ESE-ME
Outline • SEU Basic Facts • Special technologies for SEU protection • Mitigation techniques • Circuit Techniques • In logic • In registers • In RAMS • Logic (Redundancy) Techniques • Coding techniques • Error detection only techniques • Conclusions A. Marchioro / PH-ESE
Significant also in industry Terrestrial cosmic rays and soft errors Vol. 40, No. 1, 1996 Soft Errors in Circuits and Systems Vol. 52, No. 3, 2008 A. Marchioro / PH-ESE
SEU errors in “analog” circuitry • We live in a (mostly) digital world: • (Occasional) errors in analog circuitry will be ignored or will be fixed at the digital level • Particle strike at sensing elements: • Happens all the time at particle detectors • System should be designed to cope with single wrong measurement • Can happen easily in photo-receivers • Particle strikes at critical nodes • Biasing nodes • Self-recovery • Hits at high current nodes are probably going to remain unobserved • DAC registers • Not self recovered, but detectable in digital way • Oscillator circuits and PLLs: • Recovery could take ms, but should eventually occur • May require training or synchronization sequences to be sent • Can cause long sequences of errors in applications such as self-clocking serial streams A. Marchioro / PH-ESE
SEU: where does it occur “0” from Darracq et al.: IEEE Trans. on Nuclear Science, VOL. 49, NO. 3, JUNE 2002 A. Marchioro / PH-ESE
All all particles equally “dangerous” for SEU? • Energy loss (dE/dx) for protons in Si Bethe-Bloch energy loss equation for reference see: http://pdg.lbl.gov/2008/reviews/rpp2008-rev-passage-particles-matter.pdf A. Marchioro / PH-ESE
When and where should we care? “I have this particular component in my system, should I be worried about SEU?”
SEU: Impact on components A. Marchioro / PH-ESE (*) Both user and configuration logic are sensitive (**) Only user logic is sensitive
SEU in a circuit • SEU can occur in several places in a circuit: • In a storage node (Register, Latch or RAM) • Along a logic path (needs to be synchronized with clock sampling to be relevant) • On a clock line (rather bad!) • On a global line such as Reset (catastrophic!) • Different techniques are necessary to protect from these different events • No one-size fits-all solution! A. Marchioro / PH-ESE
Device level SEU protection: SOI + - - + + - STI Oxide well WARNING: Drawing not to scale! + - - + substrate + - - + The majority of commercial ICs are fabricated on bulk technologies. Charge can be collected from several microns of silicon under a device. In thin-film SOI, the active silicon layer can be very thin, < 300 nm, therefore little free charge can be produced. A. Marchioro / PH-ESE
SOI and SEU Bulk SRAM - A SOI SRAM 1 Bulk SRAM - B SOI SRAM 2 A. Marchioro / PH-ESE From J. Doff, TNS, 8/2007
SOI based ASIC design • SOI could be considered for specific and very demanding custom designs, but: • Requires special technology (few vendors) • Has virtually no library support • Has few if any IP available • Requires high volume • Price: Expensive to very expensive, no second source • What about the other chips in your system? • Still, it is used in space and military applications A. Marchioro / PH-ESE
Single Event Upset in logic A Y A B B Y A B Y CLK A. Marchioro / PH-ESE If the length of the spike is longer than the typical gate delay, it will propagate down the logic path and possible be sampled in the next FF This used to be a very rare event in logic up to the .25 um generation Unfortunately it is common in 130, 90 and 65 nm (which means in most commercial chips today)
Protection against SEU in logic Register Regular (fast) gates Slow gates (filter glitches) .. or double sample at register A. Marchioro / PH-ESE
Circuit level mitigation techniques Normal Latch Strong Feedback Latch CK* CK* Din Din CK CK Extra Cap Latch Large Size Latch CK* CK* Din Din CK CK A. Marchioro / PH-ESE
Special topology D-FF cell SEU robust FF: DICE cell From Calin et al. IEEE TNS Dec 1996 A. Marchioro / PH-ESE
Single Event Upset in SRAM BL BL* 1 0 WL A. Marchioro / PH-ESE Sensitive nodes are the drains of off-state transistors
Circuit level protection from Canaris, Whitaker: Circuit Techniques for the Radiation Environment of Space, IEEE 1995 CUSTOM INTEGRATED CIRCUITS CONFERENCE A. Marchioro / PH-ESE
Remarks about SEU in RAMs • In today’s technologies, cells are so small (< 1 m2) that single ions can hit two or more locations at once, multiple SEU are common. • Single bit EDAC is likely not sufficient! • While it is true that most of the memory area is covered by the matrix of cells, hits in other areas (decoder, sense-amp), though rare, can be even more catastrophic A. Marchioro / PH-ESE
A 65 nm 2-Billion Transistor Itanium A. Marchioro / PH-ESE
More on SER… A. Marchioro / PH-ESE
Redundancy • Redundancy is actually a coding techniques, technically a simple “repetition” code, where the information is duplicated or triplicated and checked at convenient boundaries • Redundancy is well applicable in control blocks • Data paths are better protected by other techniques, such as parity etc. A. Marchioro / PH-ESE
Repetition Code Take each symbol si in S and repeat it n times. This is an (n, 1) code. For example the word {s1s2s3}becomes the codeword {s1s1s1s2s2s2s3s3s3} Efficiency (= rate) of the code is: 1/n The minimum distance (see later) is n and the number of errors t that can be corrected is: t = ½ (n – 1) A. Marchioro / PH-ESE
Triple redundancy Three copies of same user logic + state_register Voting logic decides 2 out of three (majority) Used regularly in: High reliability electronics Mainframes Problems: 300% area and power corrects only 1 error can get very wrong with two errors Problem: How do you make sure that the voting logic itself is not affected by SEU? Triple Module Redundancy FSM1 Output FSM2 Input Voting logic FSM3 A B CLK A C B C Logic for Voting A. Marchioro / PH-ESE
Example of triplicated design • Gigabit Optical Link (CERN design: GOL • 0.8 and 1.60 Gb/s optical link • Unidirectional • < 300 mW • G-Link and Gigabit Ethernet protocol • Redundant logic • More than 20,000 units in Atlas, CMS, LHCb and Alice • http://proj-gol.web.cern.ch/proj-gol/) A. Marchioro / PH-ESE
Double redundancy Two copies of same user logic + state_register Voting logic decides if outputs are unequal If mismatch: Report to system Problems: 200% area and power Can’t be used in “real-time” but may be sufficient for many applications Reduced Module Redundancy FSM1 Input Output FSM2 Comparison logic Reset Request CLK A. Marchioro / PH-ESE
What to duplicate? Input Logic Input Logic Reg Reg Output Logic Output Reg Reg Comparison logic Comparison logic Logic Reg Reg • Use this: If clock frequency is low and technology is “old”. • Use this: • If clock frequency is high and technology is “advanced”. A. Marchioro / PH-ESE
FSM general structure Input Logic Input Logic Reg Reg Logic Output Logic Output Reg Reg Comparison logic Comparison logic Logic Logic Reg Reg • Not This. • Do this! A. Marchioro / PH-ESE
Redundancy in time: Single user logic block and two state_registers Two clocks (F1 and F2) Voting logic decides if outputs are unequal at completion of F2 If error: Compute again Problems: Needs time for 3 evaluations (…not really, three transients time constants are enough) No problem at 40 MHz and “modern” technology Needs multi-phase clock Temporal Redundancy Reg1 Input Logic Output CLK1 Comparison logic Reg2 Re-evaluate Request CLK2 CLK1 CLK2 A. Marchioro / PH-ESE
Check for consistency only when results will be committed to memory: For instance when two computers/microcontrollers perform a STORE operation Advantages: Processors can be “standard” Write operations are relatively rare and therefore requirements on comparison resources are small Less resources needed for checking Used in some mainframes with triple redundancy Problem: if you detect an error in processor, how do you resync it? Memory Boundary Redundancy uP 1 Shared Memory uP 2 Comparison logic Error … A. Marchioro / PH-ESE
Check for consistency only when results will become used by external devices: For instance when two computers/microcontrollers want to commit results to disk Advantages: Synchronization is less of a problem Less resources needed for checking In some cases it could even be done in software uP Architectures and/or hardware could even be different Used in high-reliability computer boxes and avionics I/O Boundary Redundancy uP 1 I/O Intf1 I/O device Mem1 I/O Intfc2 Re-evaluate Request uP 1 Comparison logic I/O CLK Mem1 … A. Marchioro / PH-ESE
Mission critical redundancy Various computer configurations used during a Shuttle mission. from: NASA Shuttle documentation A. Marchioro / PH-ESE
Redundancy in avionics from: IEEE Aerospace & Electronic Systems Magazine, October 2000 A. Marchioro / PH-ESE
Hamming Coding “Two weekends in a row I came in and found that all my stuff had been dumped and nothing was done. I was really aroused and annoyed and I wanted those answers and two weekends had been lost. And so I said, ‘Damn it, if the machine can detect an error, why can’t it locate the position of the error and correct it?’” from an interview with R. Hamming, February 3-4, 1977, quoted in T. Thompson, p.17 “The purpose of this memorandum is to give some practical codes which may detect and correct all errors of a given probability of occurrence, and which detect errors of even a rarer occurrence”. from R. Hamming, ‘Self-Correcting Codes – Case 20878, Memorandum 1130-RWH-MFW, Bell Telephone Laboratories, July 27, 1947 A. Marchioro / PH-ESE
Coding for memory repair A. Marchioro / PH-ESE
Mitigating SEU: Forward Error Correction T D Transmitter f(D) TP D R Receiver f(R) RP OK/NotOK =? A. Marchioro / PH-ESE • Examples of FEC: • Simple Parity (actually only error detection) • EDC: Hamming coding • single error detection capability, popular in computer DRAM • BCH • Sophisticated multiple bit error detection and correction; requires complex logic • Reed-Solomon • Sophisticated and efficient multi-word error detection and correction; requires complex logic
Mitigating SEU: FEC (2) D =R f-1(R) f-1(R) R Receiver f(R) RP =? OK/NotOK A. Marchioro / PH-ESE The “parity” function must be such that, if an error is detected, one can also use it to recover the right data!
Families of Error Control Methods • Block Codes: codeword built only on current message-word • Non-block codes: codeword depends on current message word and of some past words, ex: • Convolutional, used (obviously) in streaming channels • Examples of codes: • Hamming • Bose-Chauduri-Hocqueghem (BCH) • Golay • Reed-Solomon (RS) • Reed-Müller • Low Density Parity Check Codes • Turbo Codes • … A. Marchioro / PH-ESE
Parity In B = {0,1}, start with a message word: S = {s1s2s3s4s5s6s7} Compute a “Parity” character s8 defined as: whereis the exclusive-OR (or the sum mod 2). Parity check can detect all single errors (but can not give the position) Parity check can not detect double (or even count) errors Used: - often in computer memories - in serial terminals data transmission A. Marchioro / PH-ESE
Two-Dimensional Parity ParityX ParityY 2 Errors A. Marchioro / PH-ESE
Two-Dimensional Parity A. Marchioro / PH-ESE
Hamming (intuitive version) source parity c5 s1 s2 s3 c7 c6 s4 Definition: cj = computed to give even parity in the circle • Notice: • the 16 code words in Hamming(7,4) differ from each other by at least 3 bits. A. Marchioro / PH-ESE
Hamming Codes (3) Hardware for encoder a0 a0 a1 a1 a2 a2 a3 a3 p0 p1 p2 A. Marchioro / PH-ESE
Hamming Codes (4) Hardware for decoder a’0 a0 + a’1 a1 + a’2 a2 + a’3 a3 + p’0 Correction Logic p’1 p’2 A. Marchioro / PH-ESE
Cost of Hamming SEC A. Marchioro / PH-ESE