Transient Fault Detection via Simultaneous Multithreading

1. Transient Fault Detection via Simultaneous Multithreading Just introduce Steve and yourself. Just introduce Steve and yourself.

2. Transient Faults Faults that persist for a �short� duration Cause: cosmic rays (e.g., neutrons) Effect: knock off electrons, discharge capacitor Solution no practical absorbent for cosmic rays 1 fault per 1000 computers per year (estimated fault rate) Future is worse smaller feature size, reduce voltage, higher transistor count, reduced noise margin Get thru this slide quickly Get thru this slide quickly

3. Fault Detection in Compaq Himalaya System Get thru this slide quickly Replication is completely in hardware, not visible to OSGet thru this slide quickly Replication is completely in hardware, not visible to OS

4. Fault Detection via Simultaneous Multithreading Transition to this more smoothly, cost-performance tradeoffTransition to this more smoothly, cost-performance tradeoff

5. quickly quickly

6. Simultaneous & Redundantly Threaded Processor (SRT) + Less hardware compared to replicated microprocessors SMT needs ~5% more hardware over uniprocessor SRT adds very little hardware overhead to existing SMT + Better performance than complete replication better use of resources + Lower cost avoids complete replication market volume of SMT & SRT

7. SRT Design Challenges Lockstepping doesn�t work SMT may issue same instruction from redundant threads in different cycles Must carefully fetch/schedule instructions from redundant threads branch misprediction cache miss

8. Contributions & Outline Sphere of Replication (SoR) Output comparison for SRT Input replication for SRT Performance Optimizations for SRT SRT outperforms on-chip replicated microprocessors Related Work Summary

9. Sphere of Replication (SoR) SRT: time & space redundancy, unlike prior which is space identify boundaries where redundancy ends � SRT: time & space redundancy, unlike prior which is space identify boundaries where redundancy ends �

11. Sphere of Replication for SRT SRT derived from SMT SMT pipeline looks like uniprocessor pipeline But, have mix of instructions from two or more threads Here we have corresponding loads from two threads in RUU/IQ SoR includes IQ (e.g., load) space redundancy time redundancy SoR combines both logical and physical replicationSRT derived from SMT SMT pipeline looks like uniprocessor pipeline But, have mix of instructions from two or more threads Here we have corresponding loads from two threads in RUU/IQ SoR includes IQ (e.g., load) space redundancy time redundancy SoR combines both logical and physical replication

12. Output Comparison in SRT

13. <address, data> for stores from redundant threads compare & validate at commit time Output Comparison Note that we don�t do output comparison on all instructions Only selected onesNote that we don�t do output comparison on all instructions Only selected ones

14. Input Replication in SRT

15. Input Replication Cached load data pair loads from redundant threads: too slow allow both loads to probe cache: false faults with I/O or multiprocessors Load Value Queue (LVQ) pre-designated leading & trailing threads mention leading thread executes load out-of-order & speculatively trailing thread doesn�tmention leading thread executes load out-of-order & speculatively trailing thread doesn�t

16. Input Replication (contd.) Cached Load Data: alternate solution Active Load Address Buffer Special Cases Cycle- or time-sensitive instructions External interrupts

18. Performance Optimizations Slack fetch maintain constant slack of instructions between leading and trailing thread + leading thread prefetches cache misses + leading thread prefetches correct branch outcomes Branch Outcome Queue feed branch outcome from leading to trailing thread Combine the above two

19. Baseline Architecture Parameters

20. Target Architectures SRT SMT + fault detection Output Comparison Input Replication (Load Value Queue) Slack Fetch + Branch Outcome Queue ORH-Dual: On-Chip Replicated Hardware Each pipeline of dual has half the resources of SRT Two pipelines share fetch stage (including branch predictor)

21. Performance Model & Benchmarks SimpleScalar 3.0 modified to support SMT by Steve Raasch, U. of Michigan SMT/Simplescalar modified to support SRT Benchmarks compiled with gcc 2.6 + full optimization subset of spec95 suite (11 benchmarks) skipped between 300 million and 20 billion instructions simulated 200 million for each benchmark

22. SRT vs. ORH-Dual Performance improves because output comparison and input replication don�t hurt Slack Fetch and Branch outcome queue help Performance improves because output comparison and input replication don�t hurt Slack Fetch and Branch outcome queue help

23. Recent Related Work Saxena & McCluskey, IEEE Systems, Man, & Cybernetics, 1998. + First to propose use of SMT for fault detection AR-SMT, Rotenberg, FTCS, 1999 + Forwards values from leading to checker thread DIVA, Austin, MICRO, 1999 + Converts checker thread into simple processor Our work on SRT Sphere of replication formalizes the problem e.g., checker and redundant threads need to be separate, unlike AR-SMT or DIVA e.g., AR-SMT needs to be augmented with ECC on register file, DIVA cannot capture transient faults on uncached loads Output comparison e.g., need to compare only instructions leaving the sphere, store for SRT, whereas every instruction for AR-SMT and DIVA Input replication e.g., false transient fault detection in AR-SMT and DIVA because you do cached load twiceOur work on SRT Sphere of replication formalizes the problem e.g., checker and redundant threads need to be separate, unlike AR-SMT or DIVA e.g., AR-SMT needs to be augmented with ECC on register file, DIVA cannot capture transient faults on uncached loads Output comparison e.g., need to compare only instructions leaving the sphere, store for SRT, whereas every instruction for AR-SMT and DIVA Input replication e.g., false transient fault detection in AR-SMT and DIVA because you do cached load twice

24. Improvements over Prior Work Sphere of Replication (SoR) e.g., AR-SMT register file must be augmented with ECC e.g., DIVA must handle uncached loads in a special way Output Comparison e.g., AR-SMT & DIVA compare all instructions, SRT compares selected ones based on SoR Input Replication e.g., AR-SMT & DIVA detect false transient faults, SRT avoids this problem using LVQ Slack Fetch mention DIVA and AR-SMT don�t distinguish between redundant thread & checker threadmention DIVA and AR-SMT don�t distinguish between redundant thread & checker thread

25. Summary Simultaneous & Redundantly Threaded Processor (SRT) SMT + Fault detection Sphere of replication Output comparison of committed store instructions Input replication via load value queue Slack fetch & branch outcome queue SRT outperforms equivalently-sized on-chip replicated hardware by 16% on average & up to 29%

Transient Fault Detection via Simultaneous Multithreading

Transient Fault Detection via Simultaneous Multithreading

Presentation Transcript

Simultaneous Multithreading (SMT)

Redundant Multithreading Techniques for Transient Fault Detection

SIMULTANEOUS MULTITHREADING

Symbiotic Jobscheduling for a Simultaneous Multithreading Processor

Line Fault Detection

Transient Fault Detection and Recovery via Simultaneous Multithreading

Hardware Fault Tolerance Through Simultaneous Multithreading (part 2)

Simultaneous Multithreading: Multiplying Alpha Performance

Fault detection

Compiler-Managed Redundant Multi-Threading for Transient Fault Detection

Fault Detection

Distributed Online Simultaneous Fault Detection for Multiple Sensors

Simultaneous Multithreading (SMT)

Computer Architecture Lec 10 –Simultaneous Multithreading

Hardware Fault Tolerance Through Simultaneous Multithreading (part 3)

Transient Fault Detection via Simultaneous Multithreading

Transient Fault Tolerance via Dynamic Process-Level Redundancy

Limits to ILP and Simultaneous Multithreading

Improving Database Performance on Simultaneous Multithreading Processors

Fault detection

Compiler-Managed Redundant Multi-Threading for Transient Fault Detection