1.22k likes | 1.59k Views
Chapter 3. Fault-Tolerant Design. What is this chapter about?. Gives Overview of Fault-Tolerant Design Focus on Basic Concepts in Fault-Tolerant Design Metrics Used to Specify and Evaluate Dependability Review of Coding Theory Fault-Tolerant Design Schemes Hardware Redundancy
E N D
Chapter 3 Fault-Tolerant Design
What is this chapter about? • Gives Overview of Fault-Tolerant Design • Focus on • Basic Concepts in Fault-Tolerant Design • Metrics Used to Specify and Evaluate Dependability • Review of Coding Theory • Fault-Tolerant Design Schemes • Hardware Redundancy • Information Redundancy • Time Redundancy • Examples of Fault-Tolerant Applications in Industry
Fault-Tolerant Design • Introduction • Fundamentals of Fault Tolerance • Fundamentals of Coding Theory • Fault Tolerant Schemes • Industry Practices • Concluding Remarks
Introduction • Fault Tolerance • Ability of system to continue error-free operation in presence of unexpected fault • Important in mission-critical applications • E.g., medical, aviation, banking, etc. • Errors very costly • Becoming important in mainstream applications • Technology scaling causing circuit behavior to become less predictable and more prone to failures • Needing fault tolerance to keep failure rate within acceptable levels
Faults • Permanent Faults • Due to manufacturing defects, early life failures, wearout failures • Wearout failures due to various mechanisms • e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. • Temporary Faults • Only present for short period of time • Caused by external disturbance or marginal design parameters
Temporary Faults • Transient Errors (Non-recurring errors) • Cause by external disturbance • e.g., radiation, noise, power disturbance, etc. • Intermittent Errors (Recurring errors) • Cause by marginal design parameters • Timing problems • e.g., races, hazards, skew • Signal integrity problems • e.g., crosstalk, ground bounce, etc.
Redundancy • Fault Tolerance requires some form of redundancy • Time Redundancy • Hardware Redundancy • Information Redundancy
Time Redundancy • Perform Same Operation Twice • See if get same result both times • If not, then fault occurred • Can detect temporary faults • Cannot detect permanent faults • Would affect both computations • Advantage • Little to no hardware overhead • Disadvantage • Impacts system or circuit performance
Hardware Redundancy • Replicate hardware and compare outputs • From two or more modules • Detects both permanent and temporary faults • Advantage • Little or no performance impact • Disadvantage • Area and power for redundant hardware
Information Redundancy • Encode outputs with error detecting or correcting code • Code selected to minimize redundancy for class of faults • Advantage • Less hardware to generate redundant information than replicating module • Drawback • Added complexity in design
Failure Rate • (t) = Component failure rate • Measured in FITS (failures per 109 hours)
System Failure Rate • System constructed from components • No Fault Tolerance • Any component fails, whole system fails
Reliability • If component working at time 0 • R(t) = Probability still working at time t • Exponential Failure Law • If failure rate assumed constant • Good approximation if past infant mortality period
Reliability for Series System • Series System • All components need to work for system to work
System Reliability with Redundancy • System reliability with component B in Parallel • Can tolerate one component B failing
Mean-Time-to-Failure (MTTF) • Average time before system fails • Equal to area under reliability curve • For Exponential Failure Law
Maintainability • If system failed at time 0 • M(t) = Probability repaired and operational at time t • System repair time divided into • Passive repair time • Time for service engineer to travel to site • Active repair time • Time to locate failing component, repair/replace, and verify system operational • Can be improved through designing system so easy to locate failed component and verify
Repair Rate and MTTR • = rate at which system repaired • Analogous to failure rate • Maintainability often modeled as • Mean-Time-to-Repair (MTTR) = 1/
Normal system operation failures S 1 0 t0 t1 t2 t3 t4t Availability • System Availability • Fraction of time system is operational
Availability • Telephone Systems • Required to have system availability of 0.9999 (“four nines”) • High-Reliability Systems • May require 7 or more nines • Fault-Tolerant Design • Needed to achieve such high availability from less reliable components
Coding Theory • Coding • Using more bits than necessary to represent data • Provides way to detect errors • Errors occur when bits get flipped • Error Detecting Codes • Many types • Detect different classes of errors • Use different amounts of redundancy • Ease of encoding and decoding data varies
Block Code • Message = Data Being Encoded • Block code • Encodes m messages with n-bit codeword • If no redundancy • m messages encoded with log2(m) bits • minimum possible
Block Code • To detect errors, some redundancy needed • Space of distinct 2n blocks partitioned into codewords and non-codewords • Can detect errors that cause codeword to become non-codeword • Cannot detect errors that cause codeword to become another codeword
Separable Block Code • Separable • n-bit blocks partitioned into • k information bits directly representing message • (n-k) check bits • Denoted (n,k) Block Code • Advantage • k-bit message directly extracted without decoding • Rate of Separable Block Code = k/n
Example of Separable Block Code • (4,3) Parity Code • Check bit is XOR of 3 message bits • message 101 codeword 1010 • Single Bit Parity
Example of Non-Separable Block Code • One-Hot Code • Each Codeword has single 1 • Example of 8-bit one-hot • 10000000, 01000000, 00100000, 00010000 00001000, 00000100, 00000010, 00000001 • Redundancy = 1 - log2(8)/8 = 5/8
Linear Block Codes • Special class • Modulo-2 sum of any 2 codewords also codeword • Null space of (n-k)xn Boolean matrix • Called Parity Check Matrix, H • For any n-bit codeword c • cHT = 0 • All 0 codeword exists in any linear code
Linear Block Codes • Generator Matrix, G • kxn Matrix • Codeword c for message m • c = mG • GHT = 0
Systematic Block Code • First k-bits correspond to message • Last n-k bits correspond to check bits • For Systematic Code • G = [Ikxk : Pkx(n-k)] • H = [I(n-k)x(n-k) : PT(n-k)xk] • Example
Distance of Code • Distance between two codewords • Number of bits in which they differ • Distance of Code • Minimum distance between any two codewords in code • If n=k (no redundancy), distance = 1 • Single-bit parity, distance = 2 • Code with distance d • Detect d-1 errors • Correct up to (d-1)/2 errors
Error Correcting Codes • Code with distance 3 • Called single error correcting (SEC) code • Code with distance 4 • Called single error correcting and double error detecting (SEC-DED) code • Procedure for constructing SEC code • Described in [Hamming 1950] • Any H-matrix with all columns distinct and no all-0 column is SEC
Hamming Code • For any value of n • SEC code constructed by • setting each column in H equal to binary representation of column number (starting from 1) • Number of rows in H equal to log2(n+1) • Example of SEC Hamming Code for n=7
Error Correction in Hamming Code • Syndrome, s • s = HvT for received vector v • If v is codeword • Syndrome = 0 • If v non-codeword and single-bit error • Syndrome will match one of columns of H • Will contain binary value of bit position in error
Example of Error Correction • For (7,3) Hamming Code • Suppose codeword 0110011 has one-bit error changing it to 1110011
SEC-DED Code • Make SEC Hamming Code SEC-DED • By adding parity check over all bits • Extra parity bit • 1 for single-bit error • 0 for double-bit error • Makes possible to detect double bit error • Avoid assuming single-bit error and miscorrecting it
Example of Error Correction • For (7,4) SEC-DED Hamming Code • Suppose codeword 0110011 has two-bit error changing it to 1010011 • Doesn’t match any column in H
Hsiao Code • Weight of column • Number of 1’s in column • Constructing n-bit SEC-DED Hsiao Code • First use all possible weight-1 columns • Then all possible weight-3 columns • Then weight-5 columns, etc. • Until n columns formed • Number check bits is log2(n+1) • Minimizes number of 1’s in H-matrix • Less hardware and delay for computing syndrome • Disadvantage: Correction logic more complex
Example of Hsiao Code • (7,3) Hsiao Code • Uses weight-1 and weight-3 columns
Unidirectional Errors • Errors in block of data which only cause 01 or 10, but not both • Any number of bits in error in one direction • Example • Correct codeword 111000 • Unidirectional errors could cause • 001000, 000000, 101000 (only 10 errors) • Non-unidirectional errors • 101001, 011001, 011011 (both10 and 01)
Unidirectional Error Detecting Codes • All unidirectional error detecting (AUED) Codes • Detect all unidirectional errors in codeword • Single-bit parity is not AUED • Cannot detect even number of errors • No linear code is AUED • All linear codes must contain all-0 vector, so cannot detect all 10 errors
Two-Rail Code • Two-Rail Code • One check bit for each information bit • Equal to complement of information bit • Two-Rail Code is AEUD • 50% Redundancy • Example of (6,3) Two-Rail Code • Message 101 has Codeword 101010 • Set of all codewords • 000111, 001110, 010101, 011100, 100110, 101010, 110001, 111000
Berger Codes • Lowest redundancy of separable AUED codes • For k information bits, log2(k+1) check bits • Check bits equal to binary representation of number of 0’s in information bits • Example • Information bits 1000101 • log2(7+1)=3 check bits • Check bits equal to 100 (4 zero’s)
Berger Codes • Codewords for (5,3) Berger Code • 00011, 00110, 01010, 01101, 10010, 10101, 11001, 11100 • If unidirectional errors • Contain 10 errors • increase 0’s in information bits • can only decrease binary number in check bits • Contain 01 errors • decrease 0’s in information bits • can only increase binary number in check bits
Berger Codes • If 8 information bits • Berger code requires log28+1=4 check bits • (16,8) Two-Rail Code • Requires 50% redundancy • Redundancy advantage of Berger Code • Increases as k increased
Constant Weight Codes • Constant Weight Codes • Non-separable, but lower redundancy than Berger • Each codeword has same number of 1’s • Example 2-out-of-3 constant weight code • 110, 011, 101 • AEUD code • Unidirectional errors always change number of 1’s
Constant Weight Codes • Number codewords in m-out-of-n code • Codewords maximized when m close to n/2 as possible • n/2-out-of-n when n even • (n/2-0.5 or n/2+0.5)-out-of-n when n odd • Minimizes redundancy of code
Example • 6-out-of-12 constant weight code • 12-bit Berger Code • Only 28 = 256 codewords
Constant Weight Codes • Advantage • Less redundancy than Berger codes • Disadvantage • Non-separable • Need decoding logic • to convert codeword back to binary message
Burst Error • Burst Error • Common, multi-bit errors tend to be clustered • Noise source affects contiguous set of bus lines • Length of burst error • number of bits between first and last error • Wrap around from last to first bit of codeword • Example: Original codeword 00000000 • 00111100 is burst error length 4 • 00110100 is burst error length 4 • Any number of errors between first and last error
Cyclic Codes • Special class of linear code • Any codeword shifted cyclically is another codeword • Used to detect burst errors • Less redundancy required to detect burst error than general multi-bit errors • Some distance 2 codes can detect all burst errors of length 4 • detecting all possible 4-bit errors requires distance 5 code