250 likes | 546 Views
Transient Faults . Faults that persist for a ?short" durationCause: cosmic rays (e.g., neutrons)Effect: knock off electrons, discharge capacitorSolutionno practical absorbent for cosmic rays1 fault per 1000 computers per year (estimated fault rate)Future is worsesmaller feature size, reduce v
E N D
1. Transient Fault Detection via Simultaneous Multithreading Just introduce Steve and yourself.
Just introduce Steve and yourself.
2. Transient Faults Faults that persist for a “short” duration
Cause: cosmic rays (e.g., neutrons)
Effect: knock off electrons, discharge capacitor
Solution
no practical absorbent for cosmic rays
1 fault per 1000 computers per year (estimated fault rate)
Future is worse
smaller feature size, reduce voltage, higher transistor count, reduced noise margin Get thru this slide quickly
Get thru this slide quickly
3. Fault Detection in Compaq Himalaya System Get thru this slide quickly
Replication is completely in hardware, not visible to OSGet thru this slide quickly
Replication is completely in hardware, not visible to OS
4. Fault Detection via Simultaneous Multithreading Transition to this more smoothly, cost-performance tradeoffTransition to this more smoothly, cost-performance tradeoff
5. quickly quickly
6. Simultaneous & Redundantly Threaded Processor (SRT) + Less hardware compared to replicated microprocessors
SMT needs ~5% more hardware over uniprocessor
SRT adds very little hardware overhead to existing SMT
+ Better performance than complete replication
better use of resources
+ Lower cost
avoids complete replication
market volume of SMT & SRT
7. SRT Design Challenges Lockstepping doesn’t work
SMT may issue same instruction from redundant threads in different cycles
Must carefully fetch/schedule instructions from redundant threads
branch misprediction
cache miss
8. Contributions & Outline Sphere of Replication (SoR)
Output comparison for SRT
Input replication for SRT
Performance Optimizations for SRT
SRT outperforms on-chip replicated microprocessors
Related Work
Summary
9. Sphere of Replication (SoR) SRT: time & space redundancy, unlike prior which is space
identify boundaries where redundancy ends … SRT: time & space redundancy, unlike prior which is space
identify boundaries where redundancy ends …
11. Sphere of Replication for SRT SRT derived from SMT
SMT pipeline looks like uniprocessor pipeline
But, have mix of instructions from two or more threads
Here we have corresponding loads from two threads in RUU/IQ
SoR includes IQ (e.g., load)
space redundancy
time redundancy
SoR combines both logical and physical replicationSRT derived from SMT
SMT pipeline looks like uniprocessor pipeline
But, have mix of instructions from two or more threads
Here we have corresponding loads from two threads in RUU/IQ
SoR includes IQ (e.g., load)
space redundancy
time redundancy
SoR combines both logical and physical replication
12. Output Comparison in SRT
13. <address, data> for stores from redundant threads
compare & validate at commit time Output Comparison Note that we don’t do output comparison on all instructions
Only selected onesNote that we don’t do output comparison on all instructions
Only selected ones
14. Input Replication in SRT
15. Input Replication Cached load data
pair loads from redundant threads: too slow
allow both loads to probe cache: false faults with I/O or multiprocessors
Load Value Queue (LVQ)
pre-designated leading & trailing threads mention leading thread executes load out-of-order & speculatively
trailing thread doesn’tmention leading thread executes load out-of-order & speculatively
trailing thread doesn’t
16. Input Replication (contd.) Cached Load Data: alternate solution
Active Load Address Buffer
Special Cases
Cycle- or time-sensitive instructions
External interrupts
18. Performance Optimizations Slack fetch
maintain constant slack of instructions between leading and trailing thread
+ leading thread prefetches cache misses
+ leading thread prefetches correct branch outcomes
Branch Outcome Queue
feed branch outcome from leading to trailing thread
Combine the above two
19. Baseline Architecture Parameters
20. Target Architectures SRT
SMT + fault detection
Output Comparison
Input Replication (Load Value Queue)
Slack Fetch + Branch Outcome Queue
ORH-Dual: On-Chip Replicated Hardware
Each pipeline of dual has half the resources of SRT
Two pipelines share fetch stage (including branch predictor)
21. Performance Model & Benchmarks SimpleScalar 3.0
modified to support SMT by Steve Raasch, U. of Michigan
SMT/Simplescalar modified to support SRT
Benchmarks
compiled with gcc 2.6 + full optimization
subset of spec95 suite (11 benchmarks)
skipped between 300 million and 20 billion instructions
simulated 200 million for each benchmark
22. SRT vs. ORH-Dual Performance improves because output comparison and input replication don’t hurt
Slack Fetch and Branch outcome queue help
Performance improves because output comparison and input replication don’t hurt
Slack Fetch and Branch outcome queue help
23. Recent Related Work Saxena & McCluskey, IEEE Systems, Man, & Cybernetics, 1998.
+ First to propose use of SMT for fault detection
AR-SMT, Rotenberg, FTCS, 1999
+ Forwards values from leading to checker thread
DIVA, Austin, MICRO, 1999
+ Converts checker thread into simple processor Our work on SRT
Sphere of replication
formalizes the problem
e.g., checker and redundant threads need to be separate, unlike AR-SMT or DIVA
e.g., AR-SMT needs to be augmented with ECC on register file, DIVA cannot capture transient faults on uncached loads
Output comparison
e.g., need to compare only instructions leaving the sphere, store for SRT, whereas every instruction for AR-SMT and DIVA
Input replication
e.g., false transient fault detection in AR-SMT and DIVA because you do cached load twiceOur work on SRT
Sphere of replication
formalizes the problem
e.g., checker and redundant threads need to be separate, unlike AR-SMT or DIVA
e.g., AR-SMT needs to be augmented with ECC on register file, DIVA cannot capture transient faults on uncached loads
Output comparison
e.g., need to compare only instructions leaving the sphere, store for SRT, whereas every instruction for AR-SMT and DIVA
Input replication
e.g., false transient fault detection in AR-SMT and DIVA because you do cached load twice
24. Improvements over Prior Work Sphere of Replication (SoR)
e.g., AR-SMT register file must be augmented with ECC
e.g., DIVA must handle uncached loads in a special way
Output Comparison
e.g., AR-SMT & DIVA compare all instructions, SRT compares selected ones based on SoR
Input Replication
e.g., AR-SMT & DIVA detect false transient faults, SRT avoids this problem using LVQ
Slack Fetch mention DIVA and AR-SMT don’t distinguish between redundant thread & checker threadmention DIVA and AR-SMT don’t distinguish between redundant thread & checker thread
25. Summary Simultaneous & Redundantly Threaded Processor (SRT)
SMT + Fault detection
Sphere of replication
Output comparison of committed store instructions
Input replication via load value queue
Slack fetch & branch outcome queue
SRT outperforms equivalently-sized on-chip replicated hardware by 16% on average & up to 29%