CS 7810 Lecture 25

CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999

Redundancy • If a processor’s output is error-prone, reliability • can be provided with redundancy Input Program Primary Core Checker Core Verify & Commit

Redundancy • If a processor’s output is error-prone, reliability • can be provided with redundancy Input Program Primary Core Checker Core Checker Core Verify & Commit One checker can detect errors. For recovery, we may need another checker or some other form of redundancy

Why Redundancy? • Soft Errors: A high energy particle can strike a device and • deposit enough charge to flip the value Input Program • Cosmic rays • Alpha particles Primary Core Checker Core Verify & Commit

Why Redundancy? • Soft Errors: voltage spikes or noise Input Program • Crosstalk • di/dt • Lower voltages Primary Core Checker Core Verify & Commit

Why Redundancy? • Allows unverified or aggressively clocked primary cores Input Program • Functionally incorrect core: some corner case slips through • Electrically incorrect core: high temperature causes a circuit to not meet the timing constraint Primary Core Checker Core Verify & Commit

DIVA Microarchitecture BPred I-$ Dec/Ren IQ Rename Regs ALU D-$ Arch Regs LR3 + LR7  LR15 4 8 12 If both checks succeed, write 12 into LR15 Storage Check Rd LR3 and LR7 from Arch Regs and confirm it equals 4 and 8 ALU Check Add 4+8 and confirm it equals 12

Microarchitecture Details • Instructions are fed to checker in order during commit • The logic and storage checks detect errors in ALUs • and datapath • The checker core is a simple in-order pipeline – easy to • design and verify • An error in an earlier stage (LR3 instead of LR2) can be • detected by also adding a ren/decode stage to the checker • In-order core has no stalls (need bypass for register file) • – no data dependences, cache misses, branch mispredicts • Contention for register file and data cache can degrade • primary thread

Recovery • The architected register file and data cache are ECC • protected – when an error is detected, it is assumed • that checker and architected state are correct • Primary core is re-started from faulting instruction • A fault in the primary core may result in deadlock: • e.g. instruction that produces R5 is waiting for R5 to be • produced (instead of R4) • A timeout in the checker signals an error

Redundant Multi-Threading • Execute two threads in parallel (CMP or SMT) – each • thread maintains its own register state • Threads execute as in a conventional processor, except • trailing thread commits after verifying result • leading thread commits stores to a buffer – these get written to cache/memory only after verification • load values of the leading thread are sent to trailing thread, so trailing thread never accesses data cache • branch outcomes are also sent to trailing thread Reg results, load values, branch outcomes Leading Thread Trailing Thread Store values

Fault Model • A single error in either core can be detected • Since loads are not replicated, the load/store datapath • must be ECC protected • For recovery, a second checker thread is required • ECC in the checker register file will enable recovery • in most cases without a second checker

RMT on SMT/CMP + SMT does not require inter-core traffic – values can be read from shared register file/data cache – Single thread performance may be degraded – Each redundant instr executes on high-power pipeline + Trailing CMP core can be a simple in-order processor  low power/area overheads + Trailing core’s frequency can be independently controlled + Heterogeneous CMP where cores can be dynamically employed for throughput/reliability + Lower probability for errors

Parallelization of Trailing Thread Sequential Thread Parallel Thread 1 Parallel Thread 2 Parallel Thread 3 Parallel Thread 4 Is it more power-efficient to execute the verification thread in parallel?

Parallelization of Trailing Thread Sequential Thread Parallel Thread 1 Parallel Thread 2 Parallel Thread 3 Parallel Thread 4 If the trailing cores are frequency-scaled, dynamic power does not change, but leakage power increases If the trailing cores are frequency-and-voltage scaled, dynamic power decreases, and leakage power increases

Error Types

Acronyms!! • MTTF & MTBF: Mean time to/between failures • Errors are either SDC (silent data corruption) or DUE • (detected unrecoverable errors) • Many errors get masked: • ACE bits: these bits are required for architecturally • correct execution • un-ACE bits: these bits do not affect the final output • AVF: architecture vulnerability factor (the percentage of • time/space that a structure holds ACE state)

Partial Coverage • RMT covers faults in the entire core (almost!) • If that is too expensive, provide error coverage in • specific structures to reduce error probabilities • Are there ways to ensure that an instruction spends less • time in architecturally vulnerable structures?

Title • Bullet

CS 7810 Lecture 25

CS 7810 Lecture 25

Presentation Transcript

CS 7810 Lecture 19

CS 7810 Lecture 17

CS 7810 Lecture 22

CS 7810 Lecture 25

CS 7810 Lecture 9

CS 7810 Lecture 2

CS 7810 Lecture 14

CS 7810 Lecture 8

CS 7810 Lecture 13

CS 7810 Lecture 21

CS 7810 Lecture 23

CS 7810 Lecture 9

CS 7810 Lecture 21

CS 7810 Lecture 13

CS 7810 Lecture 3

CS 7810 Lecture 8

CS 7810 Lecture 5

CS 7810 Lecture 12

CS 7810 Lecture 19

CS 7810 Lecture 22

CS 7810 Lecture 2