200 likes | 465 Views
Hardware Fault Tolerance Through Simultaneous Multithreading (part 2). Jonathan Winter. 3 SMT + Fault Tolerance Papers. Eric Rotenberg, "AR-SMT - A Microarchitectural Approach to Fault Tolerance in Microprocessors" , Symposium on Fault-Tolerant Computing, 1999.
E N D
Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter
3 SMT + Fault Tolerance Papers • Eric Rotenberg, "AR-SMT - A Microarchitectural Approach to Fault Tolerance in Microprocessors", Symposium on Fault-Tolerant Computing, 1999. • Steven K. Reinhardt and Shubhendu S. Mukherjee,"Transient Fault Detection via Simultaneous Multithreading",ISCA 2000. • Shubhendu S. Mukherjee, Michael Kontz and Steven K. Reinhardt, "Detailed Design and Evaluation of Redundant Multithreading Alternatives", ISCA 2002.
Outline • Background • SMT • Hardware fault tolerance • AR-SMT • Basic mechanisms • Implementation issues • Simulation and Results • Transient Fault Detection via SMT • Sphere of replication • Basic mechanisms • Comparison to AR-SMT • Simulation and Results • Redundant Multithreading Alternatives • Realistic processor implementation • CRT • Simulation and Results • Fault Recovery • Next Lecture
Transient Fault Detection via SMT • More detailed analysis of Simultaneous and Redundant Threading (SRT) • Introduces Sphere of Replication concept • Explores SRT design space • Discussion of input replication • Architecture for output comparison • Performance improving mechanisms • More depth in simulation
Sphere of Replication • Components inside sphere are protected against faults using replication • External components must use other means of fault tolerance (parity, ECC, etc.) • Inputs to sphere must be duplicated for each of the redundant processes • Outputs of the redundant processes are compared to detect faults • Simple to understand in lockstepping • Larger sphere • more state to replicate • less input replication and output comparison
Sphere of Replication (part 2) • Size of sphere of replication • Two alternatives – with and without register file • Instruction and data caches kept outside
Input Replication • Must ensure that both threads received same inputs to guarantee they follow the same path • Instructions – assume no self-modification • Cached load data • Out-of-order execution issue • Multiprocessor cache coherence issues • Uncached load data – must synchronize • External interrupts • Stall lead thread and deliver interrupt synchronously • Record interrupt delivery point and deliver later
Cached Load Data - ALAB • Active Load Address Buffer (ALAB) • Delays cache block replacement or invalidation • ALAB is table with address tag, counter, and pending-invalidate bit • Counter tracks trailing thread’s outstanding loads • Blocks cannot be replaced or invalidated until counter is zero • Pending-invalidate set on unevictable block • Leading thread stalls when ALAB is full • Must detect and address deadlocks
Cached Load Data - LVQ • Load Value Queue (LVQ) • Explicit designation of leading and trailing thread • Only leading thread issues loads and stores • Load addresses and values forward to trailing thread via LVQ • Trailing thread executes loads in-order and non-speculatively (why?) • Input replication guaranteed • Design simpler and less pressure on cache • Earlier fault detection • Constrains scheduling of trailing thread loads
Output Comparison • Store buffer used to verify address and value of stores to be committed • Trailing thread searches for matching entry • Mismatch means fault occurred • Cached load values require no checking • Uncached load values could have side effects • Issue non-speculatively, so stall leading thread • Assumes uncached loads are always detected • Register Check Buffer used to match register writebacks. • 3 register files required: future files + architectural file
Enhancing SRT Performance • Slack Fetch • Maintain constant lag between thread’s execution • Lead thread updates branch and data predictors • Lead thread prefetches loads • Traditional SMT ICount fetch policy is modified to maintain slack • Branch Outcome Queue • Deliver branch outcomes directly to trailing thread • Trailing thread has no control speculation
AR-SMT verses SRT • AR-SMT only has space redundancy in functional units • SRT can potentially have space redundancy across the pipeline • AR-SMT is trace processor-based while SRT is conventional • Register file of R-stream must be protected • AR-SMT forwards load data values • AR-SMT checks every instruction during fault detection • SRT requires no operating system modifications • AR-SMT doesn’t support uncached loads and stores or multiprocessor coherence • Delay buffer performs function of register check buffer and branch outcome queue • All of main memory is in AR-SMT sphere • Better fault coverage but very costly
Simulation Environment • Modified Simplescalar “sim-outorder” • Long front-end pipeline because of out-of-order nature and SMT • Simple approximation of trace cache • Used 11 SPEC95 benchmarks
Results • Again, this paper only analyzes the performance impact of fault tolerance • Baseline Characterization • ORH-Dual two pipelines, each with half the resources • SMT-Dual replicated threads with no detection hardware • ORH and SMT-Dual 32% slower than SMT-Single
Slack Fetch & Branch Outcome Queue • 10%,14%, 15% (27% max) performance improvements for SF, BOQ, and SF + BOQ • Reduced memory stalls through prefetching • Prevents trailing thread from wasting resources by speculating • Performance better with slack of 256 instructions over 32 or 128
Input Replication • Assumes output comparison performed by oracle • Almost no performance penalty paid for 64-entry ALAB or LVQ • With a 16-entry ALAB and LVQ, benchmarks performance degraded 8% and 5% respectively
Output Comparison • Assumes inputs replicated by oracle • Leading thread can stall if store queue is full • 64-entry store buffer eliminates almost all stalls • Register check buffer or size 32, 64, and 128 entries degrades performance by 27%, 6%, and 1% respectively
Overall Results • Speedup of SRT processor with 256 slack fetch, branch outcome queue with 128 entries, 64-entry store buffer, and 64-entry load value queue. • SRT demonstrates a 16% speedup on average (up to 29%) over a lockstepping processor with the “same” hardware
Multi-cycle and Permanent Faults • Transient faults could potentially persist for multiple cycles and affect both threads • Increasing slack fetch decreases this possibility • Spatial redundancy can be increased by partitioning function units and forcing threads to execute on different groups • Performance loss for this approach is less than 2%
Conclusions • Sphere of replication helps analysis of input replication and output comparison • Keep register file in sphere • LVQ is superior to ALAB (simpler) • Slack fetch and branch outcome queue mechanism enhance performance • SRT fault tolerance method performs 16% better on average than lockstepping