1 / 22

Hardware Fault Tolerance Through Simultaneous Multithreading (part 3)

Hardware Fault Tolerance Through Simultaneous Multithreading (part 3). Jonathan Winter. 3 SMT + Fault Tolerance Papers. Eric Rotenberg, "AR-SMT - A Microarchitectural Approach to Fault Tolerance in Microprocessors" , Symposium on Fault-Tolerant Computing, 1999.

Download Presentation

Hardware Fault Tolerance Through Simultaneous Multithreading (part 3)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter

  2. 3 SMT + Fault Tolerance Papers • Eric Rotenberg, "AR-SMT - A Microarchitectural Approach to Fault Tolerance in Microprocessors", Symposium on Fault-Tolerant Computing, 1999. • Steven K. Reinhardt and Shubhendu S. Mukherjee, "Transient Fault Detection via Simultaneous Multithreading", ISCA 2000. • Shubhendu S. Mukherjee, Michael Kontz and Steven K. Reinhardt,"Detailed Design and Evaluation of Redundant Multithreading Alternatives", ISCA 2002.

  3. Outline • Background • SMT • Hardware fault tolerance • AR-SMT • Basic mechanisms • Implementation issues • Simulation and Results • Transient Fault Detection via SMT • Sphere of replication • Basic mechanisms • Comparison to AR-SMT • Simulation and Results • Redundant Multithreading Alternatives • Realistic processor implementation • CRT • Simulation and Results • Fault Recovery • Future Lectures ?

  4. Sphere of Replication • Size of sphere of replication • Two alternatives – with and without register file • Instruction and data caches kept outside

  5. Redundant Multithreading Alternatives • Discusses real world fault tolerant processors • Evaluates SRT on a more realistic and detailed processor than the previous paper • Proposes Chip-level Redundant Threading (CRT) • Detailed simulation results with new metric • Relative SMT-Efficiency

  6. Real World SMT, CMP, and FT • Simulated processor based on Compaq Alpha Araña (a.k.a. 21464 or EV8) • IBM Power4 and HP Mako are 2-way CMPs • Compaq Himalaya uses multi-chip lockstepping • IBM S/390 G5 uses on-chip lockstepping

  7. Detailed Processor Description • 8 way SMT with 4 hardware contexts • IBOX fetches chunks of 8 instructions and forwards them to the PBOX • Complex branch prediction mechanism • Line predictor • Branch predictor, jump target predictor, and return address stack

  8. Detailed Processor Description (part 2) • PBOX performs initial processing • Register renaming and partial decoding • Maintains tables for recovery from miss-predictions • QBOX issues instructions out-of-order to the EBOX, FBOX, or MBOX • Retires instructions and commits architectural state in program order • Consists of instruction queue, in-flight table, and completion unit • MBOX conducts loads and stores • Load and store queues divided between threads • Available queue space is very small per thread

  9. SRT on Detailed Processor • Input replication uses LVQ variant that allows out-of-order load issue from trailing thread • Output comparison is the same as SRT • Improvement is suggested that has per-thread SQ • PBOX storage structures made per-thread to avoid deadlock situations • Branch outcome queue converted to line prediction queue • Preferential space redundancy (PSR) implemented to better cover permanent faults

  10. Chip-level Redundant Threading • Each core executes a lead and trailing thread from different programs • LVQ and line prediction queue must forward data to other processor’s trailing thread • Store buffer must receive retired stores from other processor for comparison

  11. CRT Advantages • CRT checks much less information to detect faults then lockstepped processors • Lockstep fault detection circuitry is on critical path for cache misses • CRT executes threads more efficiently because of SMT dynamic scheduling on each processor

  12. Simulation Environment • Asim performance model framework used • Simulates processor like Alpha 21464 • All 18 SPEC CPU95 benchmarks used • Combinations of SPEC used for multi-program simulations • Lockstepped processor simulated with zero fault detection delay (Lock0) and with 8-cycle delay (Lock8) • SRT architecture simulated with delays for forwarding line predictions and load values • Extra delays for CRT architecture

  13. SMT-Efficiency • SMT-Efficiency (SMT-E) used instead of IPC • SMT-E of individual thread is IPC of thread in SMT mode divided by the IPC in single-thread mode in an SMT • Overall SMT-E is arithmetic mean of individual SMT-Efficiencies • A. Snavely and D. M. Tullsen, “Symbiotic Job Scheduling for a Simultaneous Multithreading Processor”, ASPLOS 2000

  14. SMT-Speedup (= SMT-Efficiency) • Y. Sazeides and T. Juan, “How to Compare the Performance of Two SMT Microarchitectures”, ISPASS 2001

  15. Preferential Space Redundancy • Without PSR, 65% of instructions execute on same functional unit • With PSR, only 0.06% of instructions run on the same unit • No performance degradation is experienced

  16. SRT – One Logical Thread • SRT 32% slower than single thread on SMT • SRT 11% faster than running two redundant copies • Degradation 30% with per-thread store queue • Best-case 26% degradation with oracle store queue

  17. SRT – Two Logical Threads • Degradation of SRT is 40% • Per-thread store queue give 32% degradation • Store lifetime drops from 44 cycles to vs. 39 for single thread • Oracle store queue gives 5% better efficiency

  18. Chip-level Redundant Threading • With one logical thread, CRT performs similarly to lockstepping • With two logical threads CRT beats Lock0 and Lock8 by 10% and 2% respectively • Adding the per-thread store queue causes CRT to beat Lock8 by 13% average (22% maximum) • Using an oracle store queue improves performance by 6% more

  19. CRT with Four Logical Threads • Initial CRT configuration is no better than Lock8 • Adding per-thread store queue gives CRT 13% better performance than Lock8 • Using an oracle store queue improve performance only by another 2%

  20. Conclusions • The benefits of SRT are not as great as in the original paper when using a detailed model • 30% and 32% degradation seen on single thread and multithread workloads • SRT methods can be used to detect permanent faults • Chip-level redundant threading gives improved performance over lockstepped processors • Overall CRT provided a 13% improvement

  21. Transient Fault Recovery • AR-SMT suggests that the R-stream could be used as a checkpoint for recovery • SRT suggests checkpoint/restart or failover • Argues that since faults are infrequent, the will have a minor impact on performance

  22. Future Lectures ? • Hardware Transient Fault Recovery • T.N. Vijaykumar, Irith Pomeranz, and Karl Cheng, “Transient-Fault Recovery Using Simultaneous Multithreading”, ISCA 2002 • Mohamed Gomaa, Chad Scarbrough, T.N. Vijaykumar, and Irith Pomeranz, “Transient-Fault Recovery for Chip Multiprocessors”, ISCA 2003 • Slipstream Processors (an AR-SMT extension) • Karthik Sundaramoorth, Zach Purser, and Eric Rotenberg, “Slipstream Processors: Improving both Performance and Fault Tolerance”, ASPLOS 2000 • Khaled Z. Ibrahim, Gregory T. Byrd, and Eric Rotenberg, “Slipstream Execution Mode for CMP-Based Multiprocessors”, HPCA 2003

More Related