100 likes | 109 Views
This article provides a historical perspective on fault tolerance in IBM S/390 Parallel Enterprise Server G5. It discusses terms such as concurrent error detection and repair, in-line error checking, on-line error correction, transient faults, and hard faults. The article also explores the evolution of fault tolerance mechanisms in microprocessor and memory subsystems.
E N D
IBM S/390 Parallel Enterprise Server G5 fault tolerance:A historical perspective by L. Spainhower & T.A. Gregg Presented by Mahmut Yilmaz
Some Terms • Concurrent error detection & repair: The system finds errors & repairs itself while still running • In-line error checking: EDC, ECC • On-line error correction: Correct error while the system can still operate • Transient (soft) faults: Temporary faults or bit flips like Single Event Upsets • Hard faults: Persistent faults that remain active for a significant period of time (forever?)
Background • S/390 failure modes • Permanent, intermittent and transient faults • If an error occurs frequently and reaches a threshold permanent • Thermal Conduction Module (TCM) • TCM: A liquid cooling method introduced by IBM – A series of spring loaded cylinders conduct the heat from chips to the cooling chamber • Circuit growth rates exceed reliability gains • Parity check and ECC were used • Circuits were encapsulated • System repair required all system resources • Most repairs were concurrent
Background (cont.) • CMOS • G1 (1994) to G5 • G1: Less reliable than 9020 • System failures are more probable • G2: Dynamic memory sparing • G3: More robust ECC & CPU sparing (manual replacement) • G4: Concurrent CPU sparing & CPU instruction level retry • G5: Most reliable • Greatly exceeds any TCM • Protected good against soft faults (hard faults?)
Microprocessor Fault Tolerant Design • Duplication is used by several systems • Intel, Himalaya systems • Duplication requires more than 100% hardware overhead • Error detection only! • Fetch-decode (I-Unit) and execute (E-Unit) are generally not protected • S/390 protects • Transient fault rates are increasing with decreased feature sizes
Microprocessor Fault Tolerant Design (cont.) • G5 Fault Tolerant Design Point • 9X2: Main goal is to keep CPI low • G5: Main goal is to keep clock period short • In-line error protection is not suitable for G5: • High fan-out/fan-in • Increased chip area • Longer wires • Increased path length • Result: Duplicated I-unit and E-unit • A checker like DIVA checker: R-unit • Total hardware overhead: 35% • No performance penalty (?)
Microprocessor Fault Tolerant Design (cont.) • G5 Fault Tolerant Design Point (cont.) • Recovery and on-line repair R-unit • L1: Store-through cache • L2: Shared memory • Line sparing • Up on error detection: If retry is not successful CPU stopped • Dynamic CPU repairing (DCS) • Faulty CPU R-unit Spare CPU R-unit
Memory Fault Tolerance • ECC • Permanent fault in L1 Cache line or quarter cache delete • Permanent fault in L2 Cache delete • Data array or address directory marked as invalid • Spare lines • L3: Main memory • Background scrubbing • On-line repair: Built-in spare chips • Word line or chip kill After reaching threshold, replace module
I/O & Power/Cooling Subsystem Fault Tolerance • Multiple paths Path redundancy • Power/Cooling subsystems
Questions • Is duplication the optimal choice? No protection against hard faults! • How to protect a CPU against intermittent faults? (Delay faults)Generally, they are the beginning phase of a hard fault • How to protect ALU by parity check? Adder? (page 868, 1st parag.) • If the retry is unsuccessful, the CPU is stopped. Would not it be better to use a counter to account for transient faults? What if a transient fault occurs while retrying?