IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective

IBM S/390 Parallel Enterprise Server G5 fault tolerance:A historical perspective by L. Spainhower & T.A. Gregg Presented by Mahmut Yilmaz

Some Terms • Concurrent error detection & repair: The system finds errors & repairs itself while still running • In-line error checking: EDC, ECC • On-line error correction: Correct error while the system can still operate • Transient (soft) faults: Temporary faults or bit flips like Single Event Upsets • Hard faults: Persistent faults that remain active for a significant period of time (forever?)

Background • S/390 failure modes • Permanent, intermittent and transient faults • If an error occurs frequently and reaches a threshold  permanent • Thermal Conduction Module (TCM) • TCM: A liquid cooling method introduced by IBM – A series of spring loaded cylinders conduct the heat from chips to the cooling chamber • Circuit growth rates exceed reliability gains • Parity check and ECC were used • Circuits were encapsulated • System repair required all system resources • Most repairs were concurrent

Background (cont.) • CMOS • G1 (1994) to G5 • G1: Less reliable than 9020 • System failures are more probable • G2: Dynamic memory sparing • G3: More robust ECC & CPU sparing (manual replacement) • G4: Concurrent CPU sparing & CPU instruction level retry • G5: Most reliable • Greatly exceeds any TCM • Protected good against soft faults (hard faults?)

Microprocessor Fault Tolerant Design • Duplication is used by several systems • Intel, Himalaya systems • Duplication requires more than 100% hardware overhead • Error detection only! • Fetch-decode (I-Unit) and execute (E-Unit) are generally not protected • S/390 protects • Transient fault rates are increasing with decreased feature sizes

Microprocessor Fault Tolerant Design (cont.) • G5 Fault Tolerant Design Point • 9X2: Main goal is to keep CPI low • G5: Main goal is to keep clock period short • In-line error protection is not suitable for G5: • High fan-out/fan-in • Increased chip area • Longer wires • Increased path length • Result: Duplicated I-unit and E-unit • A checker like DIVA checker: R-unit • Total hardware overhead: 35% • No performance penalty (?)

Microprocessor Fault Tolerant Design (cont.) • G5 Fault Tolerant Design Point (cont.) • Recovery and on-line repair  R-unit • L1: Store-through cache • L2: Shared memory • Line sparing • Up on error detection: If retry is not successful  CPU stopped • Dynamic CPU repairing (DCS) • Faulty CPU R-unit  Spare CPU R-unit

Memory Fault Tolerance • ECC • Permanent fault in L1  Cache line or quarter cache delete • Permanent fault in L2  Cache delete • Data array or address directory marked as invalid • Spare lines • L3: Main memory • Background scrubbing • On-line repair: Built-in spare chips • Word line or chip kill  After reaching threshold, replace module

I/O & Power/Cooling Subsystem Fault Tolerance • Multiple paths  Path redundancy • Power/Cooling subsystems

Questions • Is duplication the optimal choice? No protection against hard faults! • How to protect a CPU against intermittent faults? (Delay faults)Generally, they are the beginning phase of a hard fault • How to protect ALU by parity check? Adder? (page 868, 1st parag.) • If the retry is unsuccessful, the CPU is stopped. Would not it be better to use a counter to account for transient faults? What if a transient fault occurs while retrying?

IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective