1 / 17

Reliability

Reliability. Threads for Fault Tolerance. Multiprocessors: Transient fault detection. Transient Faults. Faults that persist for a “short” duration Cause: cosmic rays, energetic particles originating from outer space Effect: knock off electrons, discharge capacitor Solution

monita
Download Presentation

Reliability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reliability

  2. Threads for Fault Tolerance • Multiprocessors: • Transient fault detection

  3. Transient Faults • Faults that persist for a “short” duration • Cause: cosmic rays, energetic particles originating from outer space • Effect: knock off electrons, discharge capacitor • Solution • no practical absorbent for cosmic rays • 1 fault per 1000 computers per year (estimated fault rate) • Future is worse • smaller feature size, higher transistor count, reduced noise margin

  4. Background • Fault tolerant systems use redundancy to improve reliability: • Time redundancy: separate executions • Space redundancy: separate physical copies of resources • DMR/TMR • Data redundancy • ECC: Automatic repeat request (ARQ) , Forward error correction (FEC) • Parity: odd/even • Examples: • IBM: duplicated pipelines, spare processors, ECC in memories... • HP: DMR/TMR processors, Parity/ECC in buses, memories...

  5. Multiprocessors: Fault Detection • Chip-level Redundantly Threaded processor • Replicates register values but not memory values • The leading thread commits stores only after checking • Memory is guaranteed to be correct • Other instructions commit without checking • The leading thread sends committed values for: • branch outcomes • load/store values • store addresses

  6. Sphere of Replication (SoR) • Logical boundary of redundant execution within a system • Components within protected via redundant execution • Components outside must be protected via other means • Its size matters: • Error detection latency • Stored-state size

  7. Example Spheres of Replication ORH-Dual: On-Chip Replicated Hardware (similar to IBM G5) Compaq Himalaya

  8. Fault Detection in Compaq Himalaya System Replicated Microprocessors + Cycle-by-Cycle Lockstepping

  9. Fault Detection via Simultaneous Multithreading (SMT) Replicated Microprocessors + Cycle-by-Cycle Lockstepping

  10. Concept • SMT improves the performance of a processor by: • allowing independent threads to execute simultaneously • doing so in different functional units • Redundant Multithreading (RMT): • leverages SMT’s properties to allow fault detection for microprocessors • runs two copies of the same program as independent threads • compares their outputs and initiates recovery in case of mismatch

  11. Input Replication • Load Value Queue (LVQ) • Keep threads on same path despite I/O or MP writes • Out-of-order load issue possible

  12. Output Comparison Compare & validate output before sending it outside the SoR

  13. Store Queue Comparator (STQ) • Store Queue Comparator • Compares outputs to data cache • Catch faults before propagating to rest of system

  14. Store Queue Comparator (cont’d) • Extends residence time of leading-thread stores • Size constrained by cycle time goal • Base CPU statically partitions single queue among threads • Potential solution: per-thread store queues • Deadlock if matching trailing store cannot commit • Several small but crucial changes to avoid this

  15. Branch Outcome Queue (BOQ) • Branch Outcome Queue • Forward leading-thread branch targets to trailing fetch • 100% prediction accuracy in absence of faults

  16. Simultaneous & Redundantly Threaded Processor (SRT) • SRT = SMT + Fault Detection • Less hardware compared to replicated microprocessors • SMT needs ~5% more hardware over uniprocessor • SRT adds very little hardware overhead to existing SMT • Better performance than complete replication • better use of resources • Lower cost

  17. Issues • Cycle-by-cycle output comparison and input replication: • Equivalent insts from different threads may execute in different cycles • Equivalent insts from different threads might execute in different order • Precise scheduling of the threads crucial for optimal performance • Branch misprediction • Cache miss

More Related