310 likes | 318 Views
This lecture discusses software techniques for mitigating soft errors in embedded systems, including process technology solutions, gate-level and circuit-level solutions, microarchitectural solutions, and software solutions.
E N D
Spring 2008 CSE 591Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University
Lecture 4: Soft Errors Software Techniques
Outline • Soft Errors Recap • Process Technology and Packaging Solutions • Gate-level and Circuit-level Solutions • Microarchitectural Solutions • Single-core • Multi-threaded • Software Solutions • Multi Bit Upsets (MBUs) • Single Event Latchup
Razor • Originally proposed to tolerate process variations and achieve power reduction • Shadow latch clocked with a delayed clock • If difference in values latched, raise error • How to use it to detect soft errors?
Multi-issue Processors • Superscalar • Execute instructions from the same thread • Multi-threading • Execute instructions from the same threads in one cycle, but can switch between applications • Simultaneous Multithreading • Issue instructions from different threads in the same cycle Superscalar Multithreading Simultaneous Multithreading
SMT Solutions • SRT: Simultaneous Redundant Threading • Duplicate a thread, and run them on the same core as leading thread and trailing thread • Threads maintain their contexts, including the register file • Threads should not diverge when there are no faults • Memory interface • Only leading thread can read from the memory • Puts a copy in a LVQ – trailing thread reads from here • Leading thread writes to STB to write store values • Only trailing thread can write to the memory - after checking the value in the STB • Branch Interface • Leading thread writes branch outcomes in BOQ • Trailing thread has perfect branch prediction
SMT Solutions: PER • Trailing thread competes for resources – High ILP phases • STB fills up causing leading thread stalls • PER: Partial Explicit Redundancy • Leading thread uses all resources during high-ILP phases • SEM: Single Execution Mode • Trailing thread executes during low-ILP phases • REM: Redundant Execution Mode • In REM state, check all instructions • Need resume point for trailing thread • Maintain state (LVQ, STB, RF, etc…) • Proportional to slack size
SMT Solutions: IRTR • IR: Instruction Reuse • Do not execute an instruction, if it has already executed with the same inputs • Keep a reuse buffer • IRTR: Implicit Redundancy Through Reuse • Check with previous value for soft errors • If matches, continue and overwrite the value in buffer • If mis-match, raise flag • During high ILP regions
Outline • Soft Errors Recap • Process Technology and Packaging Solutions • Gate-level and Circuit-level Solutions • Microarchitectural Solutions • Single-core • Multi-threaded • Software Solutions • Multi Bit Upsets (MBUs) • Single Event Latchup
Watchdog Processor & Control Flow Checking • Watchdog processor • Simple processor, receives signals from the main processor • Checks to see if the signals are coming in correct order • S3 should not come after S1 • Watchdog program can be automatically generated • Formal techniques for correctness • Asynchronous communication of Main processor with watchdog processor Send S1 Processor BB1 Send S2 BB2 Watchdog Processor Memory Send S3 BB3
EDDI (Error Detection by Duplicated Instructions) • Duplicate instructions • Validation instructions • Store and branch are sync points • Check store and branch operands • Memory penalty • Load/store from duplicated locations
EDDI+CFCSS (Control Flow Checking by Software Signatures) • At the beginning of the node, perform G = G xor d • d2 = s1 xor s2, Then G = s1 xor (s1 xor s2) = s2 • If two source nodes jump to the same destination node, then the two source nodes should have the same signature
CFCSS + SWIFT (Software Implemented Fault Tolerance) • If two source nodes jump to the same destination node, then the two source nodes should have the same signature • Need another path-dependent D • B1 -> B5, D=0, Then G = s1 xor d5 xor 0 = s5 • B3 -> B5, D = s1 xor s3, Then G = s3 xor (s1 xor s5) xor (s1 xor s3) = s5
ED4I: Error Detection by Diverse Data and Duplicated Instructions • The simplest way to detect Byzantine Faults is to run the same program on multiple processors and compare results. • ED4I is Byzantine Fault detection for uniprocessors. • Must take into account both temporary and and permanent faults. • Re-executing with same inputs does not guard against permanent faults • Overhead = 100%
Key Idea • Lets feed into the program two different sets of data and then compare the results. • Key Insight: • If the program only uses arithmetic operations, we can alter the input by multiplying all input numbers by a constant. • Then the modified output will be the (real output) * (the constant). • Thus, you can verify that the two computations succeeded AND the two computations will be affected by errors differently.
New Program • If we alter the input to the program, we must alter the program to work with this modified input. • The transformation is given the constant k (called the “diversity factor”) and it creates the “k-factor diverse program”. • The new program will have the same control flow graph as the old program but all the variables will be k-multiples of the of original ones.
Transformations • If k<0, branches flip directions (> ↔ <, ≥ ↔ ≤) • All constants in code get multiplied by k. • Addition and Subtraction of variables unchanged. • Multiplication: v1*v2*....*vn → (v1*v2*....*vn)/kn-1 • Division: v1/v2 → (v1/v2)*k
Fault Detection & Data Integrity • For functional unit hi (such as the adder), fault f and diversity factor k: • Xi = is the set of inputs to hi • Ei = subset of X containing the inputs that will result in erroneous output due to the fault. • E'i = subset of Ei that will escape detection • Ci(k) = Probability of catching an error in hi. • Di(k) = Probability of missing no errors in hi.
Choosing the value of k • For some functional units we can derive Ci(k) and Di(k) analytically for each k. • This is too hard in general so try out a range of k's empirically to determine Ci(k) and Di(k). • Bus Signal (12-bit) • 12-bit carry look-ahead adder • 12-bit Multipliers and Dividers
Analytical Computation of AVF • Iteration Space • L-dimensional integer vector space • L: levels of loop • Each point in IS represents an iteration • Data dependences exist • Fully ordered in time • Array Space • M-dimensional integer vector space • M: array dimension • Every point represents an element of the array for (i=0; i<N1; i++) for (j=0; j<N2; j++) a[i][j] = a[i][j-1]+ a[i-1][j] + a[i][j+1]
Analytical Computation of AVF • Access Function (AF) of a reference • Mapping from IS to AS • When are the elements of array accessed by a reference • References will access different parts of Array Space • Divide the Array Space into regions, in which every element is accessed by a subset of references • Array Interval (AI): Subset of AS that the reference accesses • Every element is accessed by the same set of references
Analytical Computation of AVF Iteration Intervals for an Array Interval • Each reference will access the elements of array interval at iterations given by AF (Access Function) • Iteration Interval (II) is AF in Array Interval • Formula of access time of each element in II • Vulnerability can be computed as a formula on II • Time from r/w r • A reference either reads or writes (not both) • Need to time-order points in II • Break into Iteration Segments, which can be ordered • Strict order, or point-wise ordered
Outline • Soft Errors Recap • Process Technology and Packaging Solutions • Gate-level and Circuit-level Solutions • Microarchitectural Solutions • Single-core • Multi-threaded • Software Solutions • Multi Bit Upsets (MBUs) • Single Event Latchup
Multiple-bit Upsets (MBUs) • Error rate ~ 1/100th of SEU • Hamming Code • 1-bit error correction, 2-bit error detection • Reed Solomon Codes • RS(n,k) with s-bit symbols • s - Each symbol is s-bits • n – total number of bits per code, n = 2s-1 • k – data bits • Number of parity bits = 2t = n-k • Can correct errors in ‘t’ symbols, where t = (n-k)/2 • RS(255, 223) with 8-bit symbols • Can correct 16 symbol errors in each codeword (255 bits) • Other multi-bit error detection and correction schemes • LDPC
Bit Read Bit has error protection Does bit matter? Error can be corrected (e.g, ECC) Strike on state bit (e.g., in register file) no yes benign fault no error no yes yes Error is only detected (e.g., parity + no recovery) yes no benign fault no error Silent Data Corruption (SDC) Detected, but unrecoverable error (DUE) no error
/ X + X 0 + / 0 X + / 0 X = covered with single ECC code + = covered with different ECC code Interleaving bits bits • Interleaving converts • spatial multi-bit error multiple single bit errors
Cycle 1,000,000 Cycle 100 Two Separate Strikes on Different BitsTemporal Double Bit Errors • SECDED ECC (single error correction, double error detection) • could detect error, but cannot correct the error • if errors accumulate • single bit correctable error becomes a double bit detectable error
Solutions for Temporal Double Bit Errors • Natural Effects • whenever a processor reads a cache block, we can correct the single bit error • check for errors when cache blocks are replaced from the cache • More Powerful ECC • SECDED ECC requires 8 bits per 64 bits • 7 bits for single bit correction • 8th bit for double bit detection • Overhead = 13% • ECC with two bit correction requires 12 bits per 64 bits • Overhead = 19%
Scrubbing • Periodically read memory and correct all single bit errors • Disallows accumulation of temporal double bit errors • Standard technique in main memories (DRAMs)
Outline • Soft Errors Recap • Process Technology and Packaging Solutions • Gate-level and Circuit-level Solutions • Microarchitectural Solutions • Single-core • Multi-threaded • Software Solutions • Multi Bit Upsets (MBUs) • Single Event Latchup
Single Event Latchup • SEL: Single Event Latchup • Parasitic circuit elements forming a silicon controlled rectifier (SCR) • Potentially destructive • the device current may destroy the device if not current limited and removed "in time. • Removal of power to the device is required in all non-catastrophic SEL conditions in order to recover device operations. • SEL probability increases with temperature!