1 / 27

Shubu Mukherjee 1 Christopher Weaver 1 , Joel Emer 1 , Steve Reinhardt 1,2 1 Massachusetts Microprocessor Design Center,

Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor 31st Annual International Symposium on Computer Architecture (ISCA), Munich, Germany, June 2004. Shubu Mukherjee 1 Christopher Weaver 1 , Joel Emer 1 , Steve Reinhardt 1,2

sandra_john
Download Presentation

Shubu Mukherjee 1 Christopher Weaver 1 , Joel Emer 1 , Steve Reinhardt 1,2 1 Massachusetts Microprocessor Design Center,

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor31st Annual International Symposium on Computer Architecture (ISCA), Munich, Germany, June 2004 Shubu Mukherjee1 Christopher Weaver1, Joel Emer1, Steve Reinhardt1,2 1Massachusetts Microprocessor Design Center, Intel 2University of Michigan, Ann Arbor

  2. Outline • Trade-off performance for lower soft error rate • MITF (mean instructions to failure) • reduce errors by keeping objects longer in protected memory • False Detected Unrecoverable Errors • processor would unnecessarily crash on such an error • techniques to avoid false errors •  (possibly incorrect) bit • anti- bit

  3. Alpha or Neutron Particle Strike Changes State of a Single Bit 0 1

  4. Silent Data Corruption (SDC) Bit Read? no yes benign fault no error Bit has error protection? detection & correction no yes detection only affects program outcome? yes no yes no benign fault no error SDC

  5. SDC Definitions • SDC = Silent Data Corruption • MTTF = Mean Time to Failure • SDC MTTF = time between two SDC events • Chip SDC Rate (inversely  to MTTF) = Rate of occurrence of SDC events = over all bits[ (Circuit Soft Error Rate) X (SDC AVF) ] • Target market will typically set SDC budget • note: budget is non-zero • Circuit Soft Error Rate • determined by alpha or neutron flux, circuit parameters, etc. • AVF (Architectural Vulnerability Factor), Mukherjee, et al. MICRO, ‘03 • fraction of strikes that affect program outcome • AVF = 0% for branch predictor • AVF = 100% for program counter • AVF < 100% for instruction queue

  6. Instruction Queue’s SDC AVF Similar to Mukherjee, et al., MICRO ‘03 CPU2000 Asim Simpoint Itanium®2-like

  7. SDC Reduction Techniques • Chip SDC Rate = over all bits [ (Circuit Soft Error Rate) X (SDC AVF) ] • Conventional techniques • process technology (e.g., fully-depleted SOI) • circuit technology (e.g., radiation-hardened cells) • error detection or correction codes (e.g., parity, ECC) • Our new technique • reduce exposure to radiation to reduce SDC AVF • trade off between performance and soft error rate

  8. IPC  MITF AVF MITF = mean instructions to failure(work between two errors) # instructions committed MITF = # errors encountered IPC X (# cycles) = # errors encountered IPC X Total time X frequency = # errors encountered = IPC X MTTF X frequency IPC X frequency = (Circuit Soft Error Rate) X AVF IPC frequency = X AVF Circuit Soft Error Rate

  9. IQ RR Commit Fetch Execute Decode Instruction Cache (IC) Reducing SDC of an Instruction Queue (IQ)(assume protected instruction cache) • Increase IPC: fetch aggressively from IC to IQ • Reduce SDC AVF: prevent instructions from sitting needlessly in IQ • Net benefit if we improve MITF (proportional to IPC / AVF)

  10. Squash Instructions • Goal • don’t have instructions sit needlessly in the Instruction Queue • Algorithm to Reduce Exposure to Radiation • Trigger: Cache Miss • Action: Squash all instructions in instruction queue following the Load Miss • Evaluation using • Asim Performance Model Framework • First 100 million instruction simpoint of all CPU2000 benchmarks • Itanium®2-like architecture, but scaled (note: in-order machine)

  11. IPC  SDC MITF SDC AVF SDC MITF Improvement from Reducing Exposure

  12. Outline • Trade-off performance for lower soft error rate • MITF (mean instructions to failure) • reduce errors by keeping objects longer in protected memory • False Detected Unrecoverable Errors • processor would unnecessarily crash on such an error • techniques to avoid false errors •  (possibly incorrect) bit • anti- bit

  13. affects program outcome? Detected Unrecoverable Error (DUE) Bit Read? no yes benign fault no error Bit has error protection? detection & correction no no error detection only affects program outcome? no yes yes no yes no benign fault no error False DUE True DUE SDC

  14. DUE Definitions • DUE = Detected Unrecoverable Error • MTTF = Mean Time to Failure • DUE MTTF = time between two DUE events • Chip DUE Rate (inversely  to MTTF) = Rate of occurrence of DUE events = over all bits[ (Circuit Soft Error Rate) X (DUE AVF) ] • Target market will typically set DUE budget • note: budget is non-zero • Circuit Soft Error Rate • determined by alpha or neutron flux, circuit parameters, etc. • DUE AVF (Architectural Vulnerability Factor) • fraction of strikes that result in DUE events • Total DUE AVF = (True DUE AVF) + (False DUE AVF)

  15. DUE AVF of Instruction Queue with Parity CPU2000 Asim Simpoint Itanium®2-like False DUE AVF 33%

  16. Total Soft Error Rate • Total Soft Error Rate = all bits [ (SDC Rate) + (DUE Rate) ] • Parity converts SDC to DUE • True DUE AVF (with error detection) = SDC AVF (without detection) • Parity also introduces False DUE • e.g., error flagged on wrong-path or dynamically dead instruction • Parity-protecting a bit increases overall observed soft error rate • Example: instruction queue • SDC AVF (without error detection) = 29% • DUE AVF (with error detection) = 62% • True DUE AVF = 29% • False DUE AVF = 33% • Idle & miscellaneous = 38%

  17. Reducing DUE • Chip DUE Rate = over all bits [ (Circuit Soft Error Rate) X (DUE AVF) ] • DUE AVF = (True DUE AVF) + (False DUE AVF) • Techniques • convert back to SDC • process technology (e.g., fully-depleted SOI) • circuit technology (e.g., radiation-hardened cells) • error recovery techniques (e.g., ECC) • Our new techniques • exposure reduction techniques (first part of this talk) • False DUE AVF reduction

  18. Sources of False DUE in an Instruction Queue • Instructions with uncommitted results • e.g., wrong-path, predicated-false • solution:  (possibly incorrect) bit till commit • Instruction types neutral to errors • e.g., no-ops, prefetches, branch predict hints • solution: anti-  bit • Dynamically dead instructions • instructions whose results will not be used in future • solution:  bit beyond commit

  19. Coping with Wrong-Path Instructions(assume parity-protected instruction queue) IQ RR Commit Fetch Execute Decode Instruction Cache (IC) Data Cache X inst inst inst DECLARE ERROR ON ISSUE • Problem: not enough information at issue

  20. The  (Possibly Incorrect) Bit(assume parity-protected instruction queue) IQ RR inst () Commit Fetch Execute Decode POST ERROR IN  BITON ISSUE Instruction Cache (IC) Data Cache inst () inst inst inst () inst inst () At commit point, declare error only if not wrong-path instruction and  bit is set

  21. IQ RR Commit Fetch Execute Decode inst anti- bit neutralizes the  bit Instruction Cache (IC) Data Cache anti- bit: coping with No-ops(assume parity-protected instruction queue) inst (anti-) inst (anti-) inst inst inst inst On issue, if the anti- bit is set, then do not set the  bit

  22. IQ RR Commit Fetch Execute Decode Instruction Cache (IC) Data Cache  bit: avoiding False DUE on Dynamically Dead Instructions write R1 write R1() write R1 read R1 read R1 read R1 () write R1() write R1 write R1() write R1() read R1 • Declare the error on reading R1, if  bit is set • If R1 isn’t read (i.e., dynamically dead), then no False DUE •  bit can be used in caches & main memory …

  23. Scope of the  Bit •  bit allows declaring an error on use of a value or object • rather than when the error is detected • e.g., declare error on register read, rather when it was detected •  bit goes out of scope • when error information cannot be propagated • e.g., store writes data into cache without  bits • typically, raise error when  bit goes out of scope • Design points: increasing levels of  bit protection •  bit till register commit •  bit till register read •  bit till store commit •  bit till I/O commit

  24. % False DUE AVF Eliminated(PI = ) CPU2000 Asim Simpoint Itanium®2-like Practical to eliminate most of the False DUE AVF

  25. Summary • Trade-off performance for lower soft error rate • MITF (mean instructions to failure)  (IPC / AVF) • reduce errors by keeping objects longer in protected memory • False Detected Unrecoverable Errors • processor would unnecessarily crash on such an error • techniques to avoid false errors •  (possibly incorrect) bit • anti- bit • PET (post-commit error tracking) buffer, see paper

  26. BACKUP SLIDES FOLLOW

  27. % of False DUE Covered Possible to eliminate most of the False DUE AVF

More Related