220 likes | 452 Views
Hardware Fault Tolerance Through Simultaneous Multithreading (part 3). Jonathan Winter. 3 SMT + Fault Tolerance Papers. Eric Rotenberg, "AR-SMT - A Microarchitectural Approach to Fault Tolerance in Microprocessors" , Symposium on Fault-Tolerant Computing, 1999.
E N D
Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter
3 SMT + Fault Tolerance Papers • Eric Rotenberg, "AR-SMT - A Microarchitectural Approach to Fault Tolerance in Microprocessors", Symposium on Fault-Tolerant Computing, 1999. • Steven K. Reinhardt and Shubhendu S. Mukherjee, "Transient Fault Detection via Simultaneous Multithreading", ISCA 2000. • Shubhendu S. Mukherjee, Michael Kontz and Steven K. Reinhardt,"Detailed Design and Evaluation of Redundant Multithreading Alternatives", ISCA 2002.
Outline • Background • SMT • Hardware fault tolerance • AR-SMT • Basic mechanisms • Implementation issues • Simulation and Results • Transient Fault Detection via SMT • Sphere of replication • Basic mechanisms • Comparison to AR-SMT • Simulation and Results • Redundant Multithreading Alternatives • Realistic processor implementation • CRT • Simulation and Results • Fault Recovery • Future Lectures ?
Sphere of Replication • Size of sphere of replication • Two alternatives – with and without register file • Instruction and data caches kept outside
Redundant Multithreading Alternatives • Discusses real world fault tolerant processors • Evaluates SRT on a more realistic and detailed processor than the previous paper • Proposes Chip-level Redundant Threading (CRT) • Detailed simulation results with new metric • Relative SMT-Efficiency
Real World SMT, CMP, and FT • Simulated processor based on Compaq Alpha Araña (a.k.a. 21464 or EV8) • IBM Power4 and HP Mako are 2-way CMPs • Compaq Himalaya uses multi-chip lockstepping • IBM S/390 G5 uses on-chip lockstepping
Detailed Processor Description • 8 way SMT with 4 hardware contexts • IBOX fetches chunks of 8 instructions and forwards them to the PBOX • Complex branch prediction mechanism • Line predictor • Branch predictor, jump target predictor, and return address stack
Detailed Processor Description (part 2) • PBOX performs initial processing • Register renaming and partial decoding • Maintains tables for recovery from miss-predictions • QBOX issues instructions out-of-order to the EBOX, FBOX, or MBOX • Retires instructions and commits architectural state in program order • Consists of instruction queue, in-flight table, and completion unit • MBOX conducts loads and stores • Load and store queues divided between threads • Available queue space is very small per thread
SRT on Detailed Processor • Input replication uses LVQ variant that allows out-of-order load issue from trailing thread • Output comparison is the same as SRT • Improvement is suggested that has per-thread SQ • PBOX storage structures made per-thread to avoid deadlock situations • Branch outcome queue converted to line prediction queue • Preferential space redundancy (PSR) implemented to better cover permanent faults
Chip-level Redundant Threading • Each core executes a lead and trailing thread from different programs • LVQ and line prediction queue must forward data to other processor’s trailing thread • Store buffer must receive retired stores from other processor for comparison
CRT Advantages • CRT checks much less information to detect faults then lockstepped processors • Lockstep fault detection circuitry is on critical path for cache misses • CRT executes threads more efficiently because of SMT dynamic scheduling on each processor
Simulation Environment • Asim performance model framework used • Simulates processor like Alpha 21464 • All 18 SPEC CPU95 benchmarks used • Combinations of SPEC used for multi-program simulations • Lockstepped processor simulated with zero fault detection delay (Lock0) and with 8-cycle delay (Lock8) • SRT architecture simulated with delays for forwarding line predictions and load values • Extra delays for CRT architecture
SMT-Efficiency • SMT-Efficiency (SMT-E) used instead of IPC • SMT-E of individual thread is IPC of thread in SMT mode divided by the IPC in single-thread mode in an SMT • Overall SMT-E is arithmetic mean of individual SMT-Efficiencies • A. Snavely and D. M. Tullsen, “Symbiotic Job Scheduling for a Simultaneous Multithreading Processor”, ASPLOS 2000
SMT-Speedup (= SMT-Efficiency) • Y. Sazeides and T. Juan, “How to Compare the Performance of Two SMT Microarchitectures”, ISPASS 2001
Preferential Space Redundancy • Without PSR, 65% of instructions execute on same functional unit • With PSR, only 0.06% of instructions run on the same unit • No performance degradation is experienced
SRT – One Logical Thread • SRT 32% slower than single thread on SMT • SRT 11% faster than running two redundant copies • Degradation 30% with per-thread store queue • Best-case 26% degradation with oracle store queue
SRT – Two Logical Threads • Degradation of SRT is 40% • Per-thread store queue give 32% degradation • Store lifetime drops from 44 cycles to vs. 39 for single thread • Oracle store queue gives 5% better efficiency
Chip-level Redundant Threading • With one logical thread, CRT performs similarly to lockstepping • With two logical threads CRT beats Lock0 and Lock8 by 10% and 2% respectively • Adding the per-thread store queue causes CRT to beat Lock8 by 13% average (22% maximum) • Using an oracle store queue improves performance by 6% more
CRT with Four Logical Threads • Initial CRT configuration is no better than Lock8 • Adding per-thread store queue gives CRT 13% better performance than Lock8 • Using an oracle store queue improve performance only by another 2%
Conclusions • The benefits of SRT are not as great as in the original paper when using a detailed model • 30% and 32% degradation seen on single thread and multithread workloads • SRT methods can be used to detect permanent faults • Chip-level redundant threading gives improved performance over lockstepped processors • Overall CRT provided a 13% improvement
Transient Fault Recovery • AR-SMT suggests that the R-stream could be used as a checkpoint for recovery • SRT suggests checkpoint/restart or failover • Argues that since faults are infrequent, the will have a minor impact on performance
Future Lectures ? • Hardware Transient Fault Recovery • T.N. Vijaykumar, Irith Pomeranz, and Karl Cheng, “Transient-Fault Recovery Using Simultaneous Multithreading”, ISCA 2002 • Mohamed Gomaa, Chad Scarbrough, T.N. Vijaykumar, and Irith Pomeranz, “Transient-Fault Recovery for Chip Multiprocessors”, ISCA 2003 • Slipstream Processors (an AR-SMT extension) • Karthik Sundaramoorth, Zach Purser, and Eric Rotenberg, “Slipstream Processors: Improving both Performance and Fault Tolerance”, ASPLOS 2000 • Khaled Z. Ibrahim, Gregory T. Byrd, and Eric Rotenberg, “Slipstream Execution Mode for CMP-Based Multiprocessors”, HPCA 2003