1 / 22

Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance

Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance. Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th. Hardware Fault Tolerance. Triple Modular Redundancy (TMR)* Can mask the failure of one hardware unit

alland
Download Presentation

Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fault-Tolerant Computing Systems#2Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th

  2. Hardware Fault Tolerance • Triple Modular Redundancy (TMR)* • Can mask the failure of one hardware unit • No explicit actions need to be performed for the occurrence of faults (error detection, recovery, etc.) *proposed by Von Neumann Module Module Voting Element Input Output (replicas) Majority votes Majority Module

  3. Triple Modular Redundancy (1) • Triple Modular Redundancy (TMR) • Suitable for transient faults • Voting element does not remove the faulty unit after an error occurs • Reliability of the TMR becomes lower than a simplex system once a failure occurs. • Ex. (0,1,1) = 1, (1,0,0)=0, (1,0)=??? Module Module Voting Element Input Output Majority votes Majority Module

  4. Voting Element Module Voting Element Module … Module & Voting Element & + & Triple Modular Redundancy (2) • Bit-wise voting • Take the majority for each bit • Voting element has to be simple and highly reliable unit • Tight synchronization is required • Single clock • Generalization of TMR is N-modular redundancy (NMR) Voting Element (Voter)

  5. Voting Element Module Voting Element Module … Module Voting Element Static Redundancy • The effect of a faulty element (component, circuit, system) is immediately masked by permanently connected and continually operating replicas of the element. • TMR and NMR are static redundancy scheme • KEY Points • Permanently connected • Continually operating replicas fault

  6. Dynamic Redundancy • When a fault is detected, that fault or its effect is subsequently corrected =>Reconfiguration • Consists of several units, but with only one operating at a time. • Other units are just “Spare” Module Module Module operating unit Module Module Module Spare Module Module Module … … … Reconfiguration

  7. Dynamic Redundancy • Cold-Standby system • Only one unit is powered up and operational • Spares are not powered on -> they are still cold ! • Faulty unit is replaced by turning off its power and powering up a spare • Hot-Standby • All units are operating simultaneously • Their outputs are then matched • If they are the same, one is selected arbitrarily • If not, faulty unit is detected and the system will be reconfigured • Dual System • Matching circuit continuously compares the results of two unit How to detect the fail unit ?? Module Module compare

  8. Coding • Code • One of most important techniques for supporting fault tolerance hardware • Codeword, Non-codeword • Ex. 0001 = a 0010 = b 0011 = c • Single Parity Check Code • Even Parity • Odd Parity

  9. 1 0 1 1 0 0 1 1 0 0 1 1 data bits data bits parity bit parity bit The # of 1s in the codeword is Even The # of 1s in the codeword is Odd Single Parity Check • Add check bit “Parity bit”to the information bit • Total number of 1s in the codeword is always even or always odd • Odd parity check • Parity bit is 1 iff the number of 1s inthe data bit is even • Evenparity check • Parity bit is 1 iff the number of 1s in the data bit is odd codeword codeword

  10. 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 1 1 1 data bits parity bit Receiving ex.1 Receiving ex.2 Parity Check • How it works? Ex. of oddparity check The parity bit is 1 iff the number of 1s in the data bit is even Sending side x ok? Check whether the # of 1s in the codeword is odd or not. • Occurrence of one bit error can be detected. • Cannot correct an error (no way to specify the place)

  11. Parity Check (Advanced) oddparity or evenparity? Sending side Step1 Step2 Receiving side Receiving side Receiving side

  12. Why? Coding (Hamming Distance) • Minimum Distance • The minimum distance of Hamming Distance between any pair of 2 different Codewords • Ex. Single Parity Check Code • Minimum Distance = 2 • 1bit error can be detected (d -1)/2 Correction d Td = number of bit errors that can be detected Tc = number of bit errors that can be corrected d = minimum distance 0000 1111 d - 1 Detection 0001 0010 … 0101 0110 … 1101 0111 …

  13. Self-Checking • Can detect faults by itself • Ex.Self-Checking Parity Checker A x0 x2 x4 x6 x1 x3 x5 x7 Functional Circuit Inputs x x8 B Checker z Error Indication If using odd parity Codewords (0, 1), (1, 0): Error Free Noncodewords (0, 0), (1, 1): Error (in A or B) z2 z1

  14. Self-Checking Circuit • Fault-Secure • Even f  F occurs, incorrect codeword will not be produced • Self-Testing • When f  F occurs, there will be an input that leads to the output of non-codeword (which means the detections of fault) • Totally Self-Checking • Fault-Secure + Self-Testing input Codeword or Non-codeword Non-codeword means fault F = set of faults

  15. 2-Rail Logic • Don’t use Not gate x1 z1 x1 x 0 1 x 0 x0 x0 z0 x1 y1 1 0 y z1 1 x0 y0 x y1 y z0 y0 x1 z1 x0 x y y1 z0 y0

  16. 2-Rail Logic and Unidirectional Error The effect of fault on the output • Unidirectional Error (definition) All erroneous signal are only one of: • Error that 1  0 occurs • Error that 0  1 occurs • 2-Rail Logic • Incorrect codeword will never be produced ex. (0,1)  (1,0) never occurs • however, the non-codeword may be produced • Fault-Secure

  17. Disk Shadowing Maintaining a set of identical disk images on separate several disk devices. • Disk Mirroring • 2 Disks • with 2 disk controllers • Write to both disks, read from either of disks • Tandem System • the first commercial fault-tolerant system Host Host Disk Controller Disk Disk Disk Controller

  18. RAID (Redundant Array of Inexpensive Disks) • Striping Divide the storage area into several parts called stripes, then distribute those stripes to several disks • Load balancing between disks • to maximize throughput • Fault Tolerance can be implemented at low cost Controller D0 D1 D2 D3 D4 D5

  19. RAID-0 Striping Controller D0 D1 D2 Only striping No redundancy D3 D4 D5 • Advantage • Good performance due to high data throughput • Disadvantage • Non-Fault Tolerance • Usable Storage Capacity Percentage = 100%

  20. RAID-1 Mirroring Controller D0 D0 D1 D1 Writing all data to N disks • Advantage • High performance of fault tolerance (tolerate/mask failure of N-1 disk) • Faster on reads (compare to a single drive) • Disadvantage • Slower on writes (compare to a single drive) • Low utilization efficiency • Usable Storage Capacity Percentage = 100/N %

  21. RAID-4 Controller D0 D1 D2 P0~2 Add one redundant parity disk D3 D4 D5 P3~5 N • Advantage • Very good for read (the same as RAID-0) • High utilization efficiency • Tolerate/mask failure of 1 disk • Disadvantage • Slow on writes (typically, small random write) *due to the concentration of access to the parity disk • Usable Storage Capacity Percentage = 100*(N-1)/N %

  22. RAID-5 Controller D0 D1 D2 P0~2 D3 D4 P3~5 D5 Similar to RAID-4, but distributes parity among the drives • Advantage • Very good for read/write (even small random write) *Parity disk does not become a bottleneck anymore • High utilization efficiency • Tolerate/mask failure of 1 disk • Disadvantage • Slower than RAID-4 on read *parity data must be skipped on each drive during reads • Usable Storage Capacity Percentage= 100*(N-1)/N %

More Related