270 likes | 418 Views
Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor 31st Annual International Symposium on Computer Architecture (ISCA), Munich, Germany, June 2004. Shubu Mukherjee 1 Christopher Weaver 1 , Joel Emer 1 , Steve Reinhardt 1,2
E N D
Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor31st Annual International Symposium on Computer Architecture (ISCA), Munich, Germany, June 2004 Shubu Mukherjee1 Christopher Weaver1, Joel Emer1, Steve Reinhardt1,2 1Massachusetts Microprocessor Design Center, Intel 2University of Michigan, Ann Arbor
Outline • Trade-off performance for lower soft error rate • MITF (mean instructions to failure) • reduce errors by keeping objects longer in protected memory • False Detected Unrecoverable Errors • processor would unnecessarily crash on such an error • techniques to avoid false errors • (possibly incorrect) bit • anti- bit
Alpha or Neutron Particle Strike Changes State of a Single Bit 0 1
Silent Data Corruption (SDC) Bit Read? no yes benign fault no error Bit has error protection? detection & correction no yes detection only affects program outcome? yes no yes no benign fault no error SDC
SDC Definitions • SDC = Silent Data Corruption • MTTF = Mean Time to Failure • SDC MTTF = time between two SDC events • Chip SDC Rate (inversely to MTTF) = Rate of occurrence of SDC events = over all bits[ (Circuit Soft Error Rate) X (SDC AVF) ] • Target market will typically set SDC budget • note: budget is non-zero • Circuit Soft Error Rate • determined by alpha or neutron flux, circuit parameters, etc. • AVF (Architectural Vulnerability Factor), Mukherjee, et al. MICRO, ‘03 • fraction of strikes that affect program outcome • AVF = 0% for branch predictor • AVF = 100% for program counter • AVF < 100% for instruction queue
Instruction Queue’s SDC AVF Similar to Mukherjee, et al., MICRO ‘03 CPU2000 Asim Simpoint Itanium®2-like
SDC Reduction Techniques • Chip SDC Rate = over all bits [ (Circuit Soft Error Rate) X (SDC AVF) ] • Conventional techniques • process technology (e.g., fully-depleted SOI) • circuit technology (e.g., radiation-hardened cells) • error detection or correction codes (e.g., parity, ECC) • Our new technique • reduce exposure to radiation to reduce SDC AVF • trade off between performance and soft error rate
IPC MITF AVF MITF = mean instructions to failure(work between two errors) # instructions committed MITF = # errors encountered IPC X (# cycles) = # errors encountered IPC X Total time X frequency = # errors encountered = IPC X MTTF X frequency IPC X frequency = (Circuit Soft Error Rate) X AVF IPC frequency = X AVF Circuit Soft Error Rate
IQ RR Commit Fetch Execute Decode Instruction Cache (IC) Reducing SDC of an Instruction Queue (IQ)(assume protected instruction cache) • Increase IPC: fetch aggressively from IC to IQ • Reduce SDC AVF: prevent instructions from sitting needlessly in IQ • Net benefit if we improve MITF (proportional to IPC / AVF)
Squash Instructions • Goal • don’t have instructions sit needlessly in the Instruction Queue • Algorithm to Reduce Exposure to Radiation • Trigger: Cache Miss • Action: Squash all instructions in instruction queue following the Load Miss • Evaluation using • Asim Performance Model Framework • First 100 million instruction simpoint of all CPU2000 benchmarks • Itanium®2-like architecture, but scaled (note: in-order machine)
IPC SDC MITF SDC AVF SDC MITF Improvement from Reducing Exposure
Outline • Trade-off performance for lower soft error rate • MITF (mean instructions to failure) • reduce errors by keeping objects longer in protected memory • False Detected Unrecoverable Errors • processor would unnecessarily crash on such an error • techniques to avoid false errors • (possibly incorrect) bit • anti- bit
affects program outcome? Detected Unrecoverable Error (DUE) Bit Read? no yes benign fault no error Bit has error protection? detection & correction no no error detection only affects program outcome? no yes yes no yes no benign fault no error False DUE True DUE SDC
DUE Definitions • DUE = Detected Unrecoverable Error • MTTF = Mean Time to Failure • DUE MTTF = time between two DUE events • Chip DUE Rate (inversely to MTTF) = Rate of occurrence of DUE events = over all bits[ (Circuit Soft Error Rate) X (DUE AVF) ] • Target market will typically set DUE budget • note: budget is non-zero • Circuit Soft Error Rate • determined by alpha or neutron flux, circuit parameters, etc. • DUE AVF (Architectural Vulnerability Factor) • fraction of strikes that result in DUE events • Total DUE AVF = (True DUE AVF) + (False DUE AVF)
DUE AVF of Instruction Queue with Parity CPU2000 Asim Simpoint Itanium®2-like False DUE AVF 33%
Total Soft Error Rate • Total Soft Error Rate = all bits [ (SDC Rate) + (DUE Rate) ] • Parity converts SDC to DUE • True DUE AVF (with error detection) = SDC AVF (without detection) • Parity also introduces False DUE • e.g., error flagged on wrong-path or dynamically dead instruction • Parity-protecting a bit increases overall observed soft error rate • Example: instruction queue • SDC AVF (without error detection) = 29% • DUE AVF (with error detection) = 62% • True DUE AVF = 29% • False DUE AVF = 33% • Idle & miscellaneous = 38%
Reducing DUE • Chip DUE Rate = over all bits [ (Circuit Soft Error Rate) X (DUE AVF) ] • DUE AVF = (True DUE AVF) + (False DUE AVF) • Techniques • convert back to SDC • process technology (e.g., fully-depleted SOI) • circuit technology (e.g., radiation-hardened cells) • error recovery techniques (e.g., ECC) • Our new techniques • exposure reduction techniques (first part of this talk) • False DUE AVF reduction
Sources of False DUE in an Instruction Queue • Instructions with uncommitted results • e.g., wrong-path, predicated-false • solution: (possibly incorrect) bit till commit • Instruction types neutral to errors • e.g., no-ops, prefetches, branch predict hints • solution: anti- bit • Dynamically dead instructions • instructions whose results will not be used in future • solution: bit beyond commit
Coping with Wrong-Path Instructions(assume parity-protected instruction queue) IQ RR Commit Fetch Execute Decode Instruction Cache (IC) Data Cache X inst inst inst DECLARE ERROR ON ISSUE • Problem: not enough information at issue
The (Possibly Incorrect) Bit(assume parity-protected instruction queue) IQ RR inst () Commit Fetch Execute Decode POST ERROR IN BITON ISSUE Instruction Cache (IC) Data Cache inst () inst inst inst () inst inst () At commit point, declare error only if not wrong-path instruction and bit is set
IQ RR Commit Fetch Execute Decode inst anti- bit neutralizes the bit Instruction Cache (IC) Data Cache anti- bit: coping with No-ops(assume parity-protected instruction queue) inst (anti-) inst (anti-) inst inst inst inst On issue, if the anti- bit is set, then do not set the bit
IQ RR Commit Fetch Execute Decode Instruction Cache (IC) Data Cache bit: avoiding False DUE on Dynamically Dead Instructions write R1 write R1() write R1 read R1 read R1 read R1 () write R1() write R1 write R1() write R1() read R1 • Declare the error on reading R1, if bit is set • If R1 isn’t read (i.e., dynamically dead), then no False DUE • bit can be used in caches & main memory …
Scope of the Bit • bit allows declaring an error on use of a value or object • rather than when the error is detected • e.g., declare error on register read, rather when it was detected • bit goes out of scope • when error information cannot be propagated • e.g., store writes data into cache without bits • typically, raise error when bit goes out of scope • Design points: increasing levels of bit protection • bit till register commit • bit till register read • bit till store commit • bit till I/O commit
% False DUE AVF Eliminated(PI = ) CPU2000 Asim Simpoint Itanium®2-like Practical to eliminate most of the False DUE AVF
Summary • Trade-off performance for lower soft error rate • MITF (mean instructions to failure) (IPC / AVF) • reduce errors by keeping objects longer in protected memory • False Detected Unrecoverable Errors • processor would unnecessarily crash on such an error • techniques to avoid false errors • (possibly incorrect) bit • anti- bit • PET (post-commit error tracking) buffer, see paper
% of False DUE Covered Possible to eliminate most of the False DUE AVF