Maximizing Hardware Fault Tolerance with Triple Modular Redundancy

Fault-Tolerant Computing Systems#2Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th

Hardware Fault Tolerance • Triple Modular Redundancy (TMR)* • Can mask the failure of one hardware unit • No explicit actions need to be performed for the occurrence of faults (error detection, recovery, etc.) *proposed by Von Neumann Module Module Voting Element Input Output (replicas) Majority votes Majority Module

Triple Modular Redundancy (1) • Triple Modular Redundancy (TMR) • Suitable for transient faults • Voting element does not remove the faulty unit after an error occurs • Reliability of the TMR becomes lower than a simplex system once a failure occurs. • Ex. (0,1,1) = 1, (1,0,0)=0, (1,0)=??? Module Module Voting Element Input Output Majority votes Majority Module

Voting Element Module Voting Element Module … Module & Voting Element & + & Triple Modular Redundancy (2) • Bit-wise voting • Take the majority for each bit • Voting element has to be simple and highly reliable unit • Tight synchronization is required • Single clock • Generalization of TMR is N-modular redundancy (NMR) Voting Element (Voter)

Voting Element Module Voting Element Module … Module Voting Element Static Redundancy • The effect of a faulty element (component, circuit, system) is immediately masked by permanently connected and continually operating replicas of the element. • TMR and NMR are static redundancy scheme • KEY Points • Permanently connected • Continually operating replicas fault

Dynamic Redundancy • When a fault is detected, that fault or its effect is subsequently corrected =>Reconfiguration • Consists of several units, but with only one operating at a time. • Other units are just “Spare” Module Module Module operating unit Module Module Module Spare Module Module Module … … … Reconfiguration

Dynamic Redundancy • Cold-Standby system • Only one unit is powered up and operational • Spares are not powered on -> they are still cold ! • Faulty unit is replaced by turning off its power and powering up a spare • Hot-Standby • All units are operating simultaneously • Their outputs are then matched • If they are the same, one is selected arbitrarily • If not, faulty unit is detected and the system will be reconfigured • Dual System • Matching circuit continuously compares the results of two unit How to detect the fail unit ?? Module Module compare

Coding • Code • One of most important techniques for supporting fault tolerance hardware • Codeword, Non-codeword • Ex. 0001 = a 0010 = b 0011 = c • Single Parity Check Code • Even Parity • Odd Parity

1 0 1 1 0 0 1 1 0 0 1 1 data bits data bits parity bit parity bit The # of 1s in the codeword is Even The # of 1s in the codeword is Odd Single Parity Check • Add check bit “Parity bit”to the information bit • Total number of 1s in the codeword is always even or always odd • Odd parity check • Parity bit is 1 iff the number of 1s inthe data bit is even • Evenparity check • Parity bit is 1 iff the number of 1s in the data bit is odd codeword codeword

0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 1 1 1 data bits parity bit Receiving ex.1 Receiving ex.2 Parity Check • How it works? Ex. of oddparity check The parity bit is 1 iff the number of 1s in the data bit is even Sending side x ok? Check whether the # of 1s in the codeword is odd or not. • Occurrence of one bit error can be detected. • Cannot correct an error (no way to specify the place)

Parity Check (Advanced) oddparity or evenparity? Sending side Step1 Step2 Receiving side Receiving side Receiving side

Why? Coding (Hamming Distance) • Minimum Distance • The minimum distance of Hamming Distance between any pair of 2 different Codewords • Ex. Single Parity Check Code • Minimum Distance = 2 • 1bit error can be detected (d -1)/2 Correction d Td = number of bit errors that can be detected Tc = number of bit errors that can be corrected d = minimum distance 0000 1111 d - 1 Detection 0001 0010 … 0101 0110 … 1101 0111 …

Self-Checking • Can detect faults by itself • Ex．Self-Checking Parity Checker A x0 x2 x4 x6 x1 x3 x5 x7 Functional Circuit Inputs x x8 B Checker z Error Indication If using odd parity Codewords (0, 1), (1, 0): Error Free Noncodewords (0, 0), (1, 1): Error (in A or B) z2 z1

Self-Checking Circuit • Fault-Secure • Even f  F occurs, incorrect codeword will not be produced • Self-Testing • When f  F occurs, there will be an input that leads to the output of non-codeword (which means the detections of fault) • Totally Self-Checking • Fault-Secure + Self-Testing input Codeword or Non-codeword Non-codeword means fault F = set of faults

2-Rail Logic • Don’t use Not gate x1 z1 x1 x 0 1 x 0 x0 x0 z0 x1 y1 1 0 y z1 1 x0 y0 x y1 y z0 y0 x1 z1 x0 x y y1 z0 y0

2-Rail Logic and Unidirectional Error The effect of fault on the output • Unidirectional Error (definition) All erroneous signal are only one of: • Error that 1  0 occurs • Error that 0  1 occurs • 2-Rail Logic • Incorrect codeword will never be produced ex. (0,1)  (1,0) never occurs • however, the non-codeword may be produced • Fault-Secure

Disk Shadowing Maintaining a set of identical disk images on separate several disk devices. • Disk Mirroring • 2 Disks • with 2 disk controllers • Write to both disks, read from either of disks • Tandem System • the first commercial fault-tolerant system Host Host Disk Controller Disk Disk Disk Controller

RAID (Redundant Array of Inexpensive Disks) • Striping Divide the storage area into several parts called stripes, then distribute those stripes to several disks • Load balancing between disks • to maximize throughput • Fault Tolerance can be implemented at low cost Controller D0 D1 D2 D3 D4 D5

RAID-0 Striping Controller D0 D1 D2 Only striping No redundancy D3 D4 D5 • Advantage • Good performance due to high data throughput • Disadvantage • Non-Fault Tolerance • Usable Storage Capacity Percentage = 100%

RAID-1 Mirroring Controller D0 D0 D1 D1 Writing all data to N disks • Advantage • High performance of fault tolerance (tolerate/mask failure of N-1 disk) • Faster on reads (compare to a single drive) • Disadvantage • Slower on writes (compare to a single drive) • Low utilization efficiency • Usable Storage Capacity Percentage = 100/N %

RAID-4 Controller D0 D1 D2 P0~2 Add one redundant parity disk D3 D4 D5 P3~5 N • Advantage • Very good for read (the same as RAID-0) • High utilization efficiency • Tolerate/mask failure of 1 disk • Disadvantage • Slow on writes (typically, small random write) *due to the concentration of access to the parity disk • Usable Storage Capacity Percentage = 100*(N-1)/N %

RAID-5 Controller D0 D1 D2 P0~2 D3 D4 P3~5 D5 Similar to RAID-4, but distributes parity among the drives • Advantage • Very good for read/write (even small random write) *Parity disk does not become a bottleneck anymore • High utilization efficiency • Tolerate/mask failure of 1 disk • Disadvantage • Slower than RAID-4 on read *parity data must be skipped on each drive during reads • Usable Storage Capacity Percentage= 100*(N-1)/N %

Maximizing Hardware Fault Tolerance with Triple Modular Redundancy

Maximizing Hardware Fault Tolerance with Triple Modular Redundancy

Presentation Transcript

FAULT-TOLERANT COMPUTING

FAULT-TOLERANT COMPUTING

Fault-Tolerant Computing Basics

Fault-Tolerant Computing Systems #1 Introduction

Fault-tolerant Computing

Fault-Tolerant Computing Basics