1 / 10

Fault Tolerance in IBM S/390 Parallel Enterprise Server G5: A Historical Perspective

This article provides a historical perspective on fault tolerance in IBM S/390 Parallel Enterprise Server G5. It discusses terms such as concurrent error detection and repair, in-line error checking, on-line error correction, transient faults, and hard faults. The article also explores the evolution of fault tolerance mechanisms in microprocessor and memory subsystems.

ericson
Download Presentation

Fault Tolerance in IBM S/390 Parallel Enterprise Server G5: A Historical Perspective

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IBM S/390 Parallel Enterprise Server G5 fault tolerance:A historical perspective by L. Spainhower & T.A. Gregg Presented by Mahmut Yilmaz

  2. Some Terms • Concurrent error detection & repair: The system finds errors & repairs itself while still running • In-line error checking: EDC, ECC • On-line error correction: Correct error while the system can still operate • Transient (soft) faults: Temporary faults or bit flips like Single Event Upsets • Hard faults: Persistent faults that remain active for a significant period of time (forever?)

  3. Background • S/390 failure modes • Permanent, intermittent and transient faults • If an error occurs frequently and reaches a threshold  permanent • Thermal Conduction Module (TCM) • TCM: A liquid cooling method introduced by IBM – A series of spring loaded cylinders conduct the heat from chips to the cooling chamber • Circuit growth rates exceed reliability gains • Parity check and ECC were used • Circuits were encapsulated • System repair required all system resources • Most repairs were concurrent

  4. Background (cont.) • CMOS • G1 (1994) to G5 • G1: Less reliable than 9020 • System failures are more probable • G2: Dynamic memory sparing • G3: More robust ECC & CPU sparing (manual replacement) • G4: Concurrent CPU sparing & CPU instruction level retry • G5: Most reliable • Greatly exceeds any TCM • Protected good against soft faults (hard faults?)

  5. Microprocessor Fault Tolerant Design • Duplication is used by several systems • Intel, Himalaya systems • Duplication requires more than 100% hardware overhead • Error detection only! • Fetch-decode (I-Unit) and execute (E-Unit) are generally not protected • S/390 protects • Transient fault rates are increasing with decreased feature sizes

  6. Microprocessor Fault Tolerant Design (cont.) • G5 Fault Tolerant Design Point • 9X2: Main goal is to keep CPI low • G5: Main goal is to keep clock period short • In-line error protection is not suitable for G5: • High fan-out/fan-in • Increased chip area • Longer wires • Increased path length • Result: Duplicated I-unit and E-unit • A checker like DIVA checker: R-unit • Total hardware overhead: 35% • No performance penalty (?)

  7. Microprocessor Fault Tolerant Design (cont.) • G5 Fault Tolerant Design Point (cont.) • Recovery and on-line repair  R-unit • L1: Store-through cache • L2: Shared memory • Line sparing • Up on error detection: If retry is not successful  CPU stopped • Dynamic CPU repairing (DCS) • Faulty CPU R-unit  Spare CPU R-unit

  8. Memory Fault Tolerance • ECC • Permanent fault in L1  Cache line or quarter cache delete • Permanent fault in L2  Cache delete • Data array or address directory marked as invalid • Spare lines • L3: Main memory • Background scrubbing • On-line repair: Built-in spare chips • Word line or chip kill  After reaching threshold, replace module

  9. I/O & Power/Cooling Subsystem Fault Tolerance • Multiple paths  Path redundancy • Power/Cooling subsystems

  10. Questions • Is duplication the optimal choice? No protection against hard faults! • How to protect a CPU against intermittent faults? (Delay faults)Generally, they are the beginning phase of a hard fault • How to protect ALU by parity check? Adder? (page 868, 1st parag.) • If the retry is unsuccessful, the CPU is stopped. Would not it be better to use a counter to account for transient faults? What if a transient fault occurs while retrying?

More Related