340 likes | 582 Views
Redundant Multithreading Techniques for Transient Fault Detection. Shubu Mukherjee Michael Kontz Steve Reinhardt. Intel HP (current) Intel Consultant, U. of Michigan. Versions of this work have been presented at ISCA 2000 and ISCA 2002.
E N D
Redundant Multithreading Techniques for Transient Fault Detection Shubu Mukherjee Michael Kontz Steve Reinhardt Intel HP (current) Intel Consultant, U. of Michigan Versions of this work have been presented at ISCA 2000 and ISCA 2002
Transient Faults from Cosmic Rays & Alpha particles + decreasing feature size - decreasing voltage (exponential dependence?) - increasing number of transistors (Moore’s Law) - increasing system size (number of processors) - no practical absorbent for cosmic rays
R1 (R2) R1 (R2) microprocessor microprocessor Output Comparison Input Replication Memory covered by ECC RAID array covered by parity Servernet covered by CRC Fault Detection via Lockstepping(HP Himalaya) Replicated Microprocessors + Cycle-by-Cycle Lockstepping
Threads ? Replicated Microprocessors + Cycle-by-Cycle Lockstepping Fault Detection via Simultaneous Multithreading R1 (R2) R1 (R2) THREAD THREAD Output Comparison Input Replication Memory covered by ECC RAID array covered by parity Servernet covered by CRC
Thread1 Thread2 Instruction Scheduler Functional Units Simultaneous Multithreading (SMT) Example: Alpha 21464, Intel Northwood
Redundant Multithreading (RMT) RMT = Multithreading + Fault Detection
Outline • SRT concepts & design • Preferential Space Redundancy • SRT Performance Analysis • Single- & multi-threaded workloads • Chip-level Redundant Threading (CRT) • Concept • Performance analysis • Summary • Current & Future Work
Overview • SRT = SMT + Fault Detection • Advantages • Piggyback on an SMT processor with little extra hardware • Better performance than complete replication • Lower cost due to market volume of SMT & SRT • Challenges • Lockstepping very difficult with SRT • Must carefully fetch/schedule instructions from redundant threads
Sphere of Replication Sphere of Replication • Two copies of each architecturally visible thread • Co-scheduled on SMT core • Compare results: signal fault if different LeadingThread TrailingThread InputReplication OutputComparison Memory System (incl. L1 caches)
Basic Pipeline Dispatch Decode Commit Fetch Execute Data Cache
Dispatch Decode Commit Fetch Execute LVQ Data Cache Load Value Queue (LVQ) • Load Value Queue (LVQ) • Keep threads on same path despite I/O or MP writes • Out-of-order load issue possible
Dispatch Decode Commit Fetch Execute STQ Data Cache Store Queue Comparator (STQ) • Store Queue Comparator • Compares outputs to data cache • Catch faults before propagating to rest of system
Store Queue Comparator (cont’d) Store Queue st ... st 5 [0x120] st ... to data cache Compareaddress & data st ... st ... st 5 [0x120] • Extends residence time of leading-thread stores • Size constrained by cycle time goal • Base CPU statically partitions single queue among threads • Potential solution: per-thread store queues • Deadlock if matching trailing store cannot commit • Several small but crucial changes to avoid this
BOQ Dispatch Decode Commit Fetch Execute Data Cache Branch Outcome Queue (BOQ) • Branch Outcome Queue • Forward leading-thread branch targets to trailing fetch • 100% prediction accuracy in absence of faults
LPQ Dispatch Decode Commit Fetch Execute Data Cache Line Prediction Queue (LPQ) • Line Prediction Queue • Alpha 21464 fetches chunks using line predictions • Chunk = contiguous block of 8 instructions
Chunk 1: end of cache line Chunk 2: taken branch Line Prediction Queue (cont’d) • Generate stream of “chunked” line predictions • Every leading-thread instruction carries itsI-cache coordinates • Commit logic merges into fetch chunks for LPQ • Independent of leading-thread fetch chunks • Commit-to-fetch dependence raised deadlock issues 1F8: add 1FC: load R1(R2) 200: beq 280 204: and 208: bne 200 200: add
Line Prediction Queue (cont’d) • Read-out on trailing-thread fetch also complex • Base CPU “thread chooser” gets multiple line predictions, ignores all but one • Fetches must be retried on I-cache miss • Tricky to keep queue in sync with thread progress • Add handshake to advance queue head • Roll back head on I-cache miss • Track both last attempted & last successful chunks
Outline • SRT concepts & design • Preferential Space Redundancy • SRT Performance Analysis • Single- & multi-threaded workloads • Chip-level Redundant Threading (CRT) • Concept • Performance analysis • Summary • Current & Future Work
Preferential Space Redundancy • SRT combines two types of redundancy • Time: same physical resource, different time • Space: different physical resource • Space redundancy preferable • Better coverage of permanent/long-duration faults • Bias towards space redundancy where possible
PSR Example: Clustered Execution LPQ add r1,r2,r3 add r1,r2,r3 add r1,r2,r3 add r1,r2,r3 IQ 0 Exec 0 Dispatch Decode Commit Fetch IQ 1 Exec 1 • Base CPU has two execution clusters • Separate instruction queues, function units • Steered in dispatch stage
PSR Example: Clustered Execution LPQ 0 0 0 0 add r1,r2,r3 [0] add r1,r2,r3 [0] add r1,r2,r3 [0] 0 IQ 0 Exec 0 Dispatch Decode Commit Fetch IQ 1 Exec 1 • Leading thread instructions record their cluster • Bit carried with fetch chunk through LPQ • Attached to trailing-thread instruction • Dispatch sends to oppositecluster if possible
PSR Example: Clustered Execution LPQ IQ 0 Exec 0 Dispatch Decode Commit Fetch IQ 1 Exec 1 add r1,r2,r3 [0] add r1,r2,r3 [0] add r1,r2,r3 [0] add r1,r2,r3 add r1,r2,r3 • 99.94% of instruction pairs use different clusters • Full spatial redundancy for execution • No performance impact (occasional slight gain)
Outline • SRT concepts & design • Preferential Space Redundancy • SRT Performance Analysis • Single- & multi-threaded workloads • Chip-level Redundant Threading (CRT) • Concept • Performance analysis • Summary • Current & Future Work
SRT Evaluation • Used SPEC CPU95, 15M instrs/thread • Constrained by simulation environment • 120M instrs for 4 redundant thread pairs • Eight-issue, four-context SMT CPU • 128-entry instruction queue • 64-entry load and store queues • Default: statically partitioned among active threads • 22-stage pipeline • 64KB 2-way assoc. L1 caches • 3 MB 8-way assoc L2
SRT Performance: One Thread • One logical thread two hardware contexts • Performance degradation = 30% • Per-thread store queue buys extra 4%
SRT Performance: Two Threads • Two logical threads four hardware contexts • Average slowdown increases to 40% • Only 32% with per-thread store queues
Outline • SRT concepts & design • Preferential Space Redundancy • SRT Performance Analysis • Single- & multi-threaded workloads • Chip-level Redundant Threading (CRT) • Concept • Performance analysis • Summary • Current & Future Work
Chip-Level Redundant Threading • SRT typically more efficient than splitting one processor into two half-size CPUs • What if you already have two CPUs? • IBM Power4, HP PA-8800 (Mako) • Conceptually easy to run these in lock-step • Benefit: full physical redundancy • Costs: • Latency through centralized checker logic • Overheads (misspeculation etc.) incurred twice • CRT combines best of SRT & lockstepping • requires multithreaded CMP cores
LeadingThread A TrailingThread A TrailingThread B Chip-Level Redundant Threading CPU A CPU B LVQ LPQ Stores LVQ LPQ LeadingThread B Stores
CRT Performance • With per-thread store queues, ~13% improvement over lockstepping with 8-cycle checker latency
Summary & Conclusions • SRT is applicable in a real-world SMT design • ~30% slowdown, slightly worse with two threads • Store queue capacity can limit performance • Preferential space redundancy improves coverage • Chip-level Redundant Threading = SRT for CMPs • Looser synchronization than lockstepping • Free up resources for other application threads
More Information • Publications • S.K. Reinhardt & S.S.Mukherjee, “Transient Fault Detection via Simultaneous Multithreading,” International Symposium on Computer Architecture (ISCA), 2000 • S.S.Mukherjee, M.Kontz, & S.K.Reinhardt, “Detailed Design and Evaluation of Redundant Multithreading Alternatives,” International Symposium on Computer Architecture (ISCA), 2002 • Papers available from: • http://www.cs.wisc.edu/~shubu • http://www.eecs.umich.edu/~stever • Patents • Compaq/HP filed eight patent applications on SRT