Redundant Multithreading Techniques for Transient Fault Detection

Redundant Multithreading Techniques for Transient Fault Detection Shubu Mukherjee Michael Kontz Steve Reinhardt Intel HP (current) Intel Consultant, U. of Michigan Versions of this work have been presented at ISCA 2000 and ISCA 2002

Transient Faults from Cosmic Rays & Alpha particles + decreasing feature size - decreasing voltage (exponential dependence?) - increasing number of transistors (Moore’s Law) - increasing system size (number of processors) - no practical absorbent for cosmic rays

R1  (R2) R1  (R2) microprocessor microprocessor Output Comparison Input Replication Memory covered by ECC RAID array covered by parity Servernet covered by CRC Fault Detection via Lockstepping(HP Himalaya) Replicated Microprocessors + Cycle-by-Cycle Lockstepping

Threads ? Replicated Microprocessors + Cycle-by-Cycle Lockstepping Fault Detection via Simultaneous Multithreading R1  (R2) R1  (R2) THREAD THREAD Output Comparison Input Replication Memory covered by ECC RAID array covered by parity Servernet covered by CRC

Thread1 Thread2 Instruction Scheduler Functional Units Simultaneous Multithreading (SMT) Example: Alpha 21464, Intel Northwood

Redundant Multithreading (RMT) RMT = Multithreading + Fault Detection

Outline • SRT concepts & design • Preferential Space Redundancy • SRT Performance Analysis • Single- & multi-threaded workloads • Chip-level Redundant Threading (CRT) • Concept • Performance analysis • Summary • Current & Future Work

Overview • SRT = SMT + Fault Detection • Advantages • Piggyback on an SMT processor with little extra hardware • Better performance than complete replication • Lower cost due to market volume of SMT & SRT • Challenges • Lockstepping very difficult with SRT • Must carefully fetch/schedule instructions from redundant threads

Sphere of Replication Sphere of Replication • Two copies of each architecturally visible thread • Co-scheduled on SMT core • Compare results: signal fault if different LeadingThread TrailingThread InputReplication OutputComparison Memory System (incl. L1 caches)

Basic Pipeline Dispatch Decode Commit Fetch Execute Data Cache

Dispatch Decode Commit Fetch Execute LVQ Data Cache Load Value Queue (LVQ) • Load Value Queue (LVQ) • Keep threads on same path despite I/O or MP writes • Out-of-order load issue possible

Dispatch Decode Commit Fetch Execute STQ Data Cache Store Queue Comparator (STQ) • Store Queue Comparator • Compares outputs to data cache • Catch faults before propagating to rest of system

Store Queue Comparator (cont’d) Store Queue st ... st 5  [0x120] st ... to data cache Compareaddress & data st ... st ... st 5  [0x120] • Extends residence time of leading-thread stores • Size constrained by cycle time goal • Base CPU statically partitions single queue among threads • Potential solution: per-thread store queues • Deadlock if matching trailing store cannot commit • Several small but crucial changes to avoid this

BOQ Dispatch Decode Commit Fetch Execute Data Cache Branch Outcome Queue (BOQ) • Branch Outcome Queue • Forward leading-thread branch targets to trailing fetch • 100% prediction accuracy in absence of faults

LPQ Dispatch Decode Commit Fetch Execute Data Cache Line Prediction Queue (LPQ) • Line Prediction Queue • Alpha 21464 fetches chunks using line predictions • Chunk = contiguous block of 8 instructions

Chunk 1: end of cache line Chunk 2: taken branch Line Prediction Queue (cont’d) • Generate stream of “chunked” line predictions • Every leading-thread instruction carries itsI-cache coordinates • Commit logic merges into fetch chunks for LPQ • Independent of leading-thread fetch chunks • Commit-to-fetch dependence raised deadlock issues 1F8: add 1FC: load R1(R2) 200: beq 280 204: and 208: bne 200 200: add

Line Prediction Queue (cont’d) • Read-out on trailing-thread fetch also complex • Base CPU “thread chooser” gets multiple line predictions, ignores all but one • Fetches must be retried on I-cache miss • Tricky to keep queue in sync with thread progress • Add handshake to advance queue head • Roll back head on I-cache miss • Track both last attempted & last successful chunks

Preferential Space Redundancy • SRT combines two types of redundancy • Time: same physical resource, different time • Space: different physical resource • Space redundancy preferable • Better coverage of permanent/long-duration faults • Bias towards space redundancy where possible

PSR Example: Clustered Execution LPQ add r1,r2,r3 add r1,r2,r3 add r1,r2,r3 add r1,r2,r3 IQ 0 Exec 0 Dispatch Decode Commit Fetch IQ 1 Exec 1 • Base CPU has two execution clusters • Separate instruction queues, function units • Steered in dispatch stage

PSR Example: Clustered Execution LPQ 0 0 0 0 add r1,r2,r3 [0] add r1,r2,r3 [0] add r1,r2,r3 [0] 0 IQ 0 Exec 0 Dispatch Decode Commit Fetch IQ 1 Exec 1 • Leading thread instructions record their cluster • Bit carried with fetch chunk through LPQ • Attached to trailing-thread instruction • Dispatch sends to oppositecluster if possible

PSR Example: Clustered Execution LPQ IQ 0 Exec 0 Dispatch Decode Commit Fetch IQ 1 Exec 1 add r1,r2,r3 [0] add r1,r2,r3 [0] add r1,r2,r3 [0] add r1,r2,r3 add r1,r2,r3 • 99.94% of instruction pairs use different clusters • Full spatial redundancy for execution • No performance impact (occasional slight gain)

SRT Evaluation • Used SPEC CPU95, 15M instrs/thread • Constrained by simulation environment • 120M instrs for 4 redundant thread pairs • Eight-issue, four-context SMT CPU • 128-entry instruction queue • 64-entry load and store queues • Default: statically partitioned among active threads • 22-stage pipeline • 64KB 2-way assoc. L1 caches • 3 MB 8-way assoc L2

SRT Performance: One Thread • One logical thread  two hardware contexts • Performance degradation = 30% • Per-thread store queue buys extra 4%

SRT Performance: Two Threads • Two logical threads  four hardware contexts • Average slowdown increases to 40% • Only 32% with per-thread store queues

Chip-Level Redundant Threading • SRT typically more efficient than splitting one processor into two half-size CPUs • What if you already have two CPUs? • IBM Power4, HP PA-8800 (Mako) • Conceptually easy to run these in lock-step • Benefit: full physical redundancy • Costs: • Latency through centralized checker logic • Overheads (misspeculation etc.) incurred twice • CRT combines best of SRT & lockstepping • requires multithreaded CMP cores

LeadingThread A TrailingThread A TrailingThread B Chip-Level Redundant Threading CPU A CPU B LVQ LPQ Stores LVQ LPQ LeadingThread B Stores

CRT Performance • With per-thread store queues, ~13% improvement over lockstepping with 8-cycle checker latency

Summary & Conclusions • SRT is applicable in a real-world SMT design • ~30% slowdown, slightly worse with two threads • Store queue capacity can limit performance • Preferential space redundancy improves coverage • Chip-level Redundant Threading = SRT for CMPs • Looser synchronization than lockstepping • Free up resources for other application threads

More Information • Publications • S.K. Reinhardt & S.S.Mukherjee, “Transient Fault Detection via Simultaneous Multithreading,” International Symposium on Computer Architecture (ISCA), 2000 • S.S.Mukherjee, M.Kontz, & S.K.Reinhardt, “Detailed Design and Evaluation of Redundant Multithreading Alternatives,” International Symposium on Computer Architecture (ISCA), 2002 • Papers available from: • http://www.cs.wisc.edu/~shubu • http://www.eecs.umich.edu/~stever • Patents • Compaq/HP filed eight patent applications on SRT

Redundant Multithreading Techniques for Transient Fault Detection

Redundant Multithreading Techniques for Transient Fault Detection

Presentation Transcript

Fault Detection Tools and Techniques

Transient Fault Detection via Simultaneous Multithreading

Techniques for Event Detection

Line Fault Detection

“Real-time” Transient Detection Algorithms

Transient Fault Detection and Recovery via Simultaneous Multithreading

Transient Fault Recovery For Chip Multiprocessors

Fault detection

Fault-Tolerance Techniques

Compiler-Managed Redundant Multi-Threading for Transient Fault Detection

JAVA MULTITHREADING TECHNIQUES

Fault Detection

Sophistocation of Fault Detection

Graph Techniques for Malware Detection

Soft-Error Detection Through Software Fault-Tolerance Techniques

Transient Fault Detection via Simultaneous Multithreading

Fault Location Techniques for Distribution Systems

FAULT-TOLERANT TECHNIQUES FOR NANOCOMPUTERS

Visualization Techniques for Intrusion Detection

Fault detection

Fault Detection and Diagnosis

Compiler-Managed Redundant Multi-Threading for Transient Fault Detection