330 likes | 438 Views
Microprocessor Reliability. Robert Pawlowski ECE 570 – 2/19/2013. Reliability. Involves different aspects about a processor that can affect performance and functionality. Ultimately can reduce the lifetime of the processor. I ssues typically manifest themselves at the device level.
E N D
Microprocessor Reliability Robert Pawlowski ECE 570 – 2/19/2013
Reliability • Involves different aspects about a processor that can affect performance and functionality. • Ultimately can reduce the lifetime of the processor. • Issues typically manifest themselves at the device level. • Solutions can be implemented at multiple design levels.
Why the concern? • Operating at highest frequencies and/or lowest power possible increases sensitivity to process-related variabilities. • Gate length/doping concentration variations • Temperature • Supply voltage droops • This decreases processor yield • Decreasing device sizes Increased effect of external issues
Outline • Error Classification • Hard Errors • Soft Errors • Sources of radiation • Device/Circuit approaches • Architectural approaches • Error detection • Error correction • System level impact
Processor Error Classification • Hard Errors will result in permanent processor failure. • Processor lifetime is inversely proportional to hard error rate. • Soft Errors do not permanently damage the device.
Hard Errors • Extrinsic failures • Caused by process and manufacturing defects • Occur with decreasing rate over time • No impact from micro-architecture • Intrinsic failures • Related to processor wear-out • Occur with increasing rate over time • Related to wafer packaging, process parameters, and processor design.
Soft Errors • Occur in both memory and logic • External radiation main issue in memory • Alpha particles • High energy neutrons • Thermal neutrons • Different causes of transient errors in logic • External radiation • Supply voltage droop • Power supply fluctuations • Ground bounce, cross-talk • Process variation, temperature • Affect delay of computational paths
Outline • Error Classification • Hard Errors • Soft Errors • Sources of radiation • Device/Circuit approaches • Architectural approaches • Error detection • Error correction • System level impact
Radiation-Induced Soft Errors • Ionized particle strike causing a state change • No permanent damage (Hard-error) • Combo logic – Single Event Transients (SET) • Memory cells – Single Bit Upset (SBU) Multi Bit Upset (MBU) • Three causes of soft errors • Alpha particles • Thermal neutrons • High-energy neutrons
Alpha-Particles • Emitted from impurities in packaging materials. • Create electron-hole pairs through direct ionization • Range for a 10 MeV particle < 100um • Typical energy 4-9MeV • Improved manufacturing trends Reduced effect • Purified materials • Shielding layers
Neutrons • Result of cosmic ray reactions with atmosphere • High-Energy neutrons react with chip materials. • Concrete only shielding material • 1.4x lower flux/foot of thickness
Neutrons • Thermal neutrons (<<< 1MeV) react with Boron-Doped Phosphosilicate Glass (BPSG) dielectric layer. • Produce ionized particles that can cause soft-errors • Solution Remove BPSG from advanced processes • Mostly solved – SEU’s still found in 45nm, 90nm
Outline • Error Classification • Hard Errors • Soft Errors • Sources of radiation • Device/Circuit approaches • Architectural approaches • Error detection • Error correction • System level impact
Device-level solutions • Larger device sizes Larger capacitance • Increase the amount of charge necessary to flip bit (critical charge) • Multiple VT design • Sensitivity to variation at low-VDD may limit effectiveness. • Body biasing also common to both radiation hardening and variation tolerance
Circuit-level solutions • DICE cell • Used for SRAM, FF’s, latches • Built-in currentsensors on supply lines of memory cells.
Outline • Error Classification • Hard Errors • Soft Errors • Sources of radiation • Device/Circuit approaches • Architectural approaches • Error detection • Error correction • System level impact
Modular redundancy • Dual Modular Redundancy • Triple Modular Redundancy
Redundant Circuits • Redundancy increases area/power • DMR/TMR in sub/near-VT • Timing variation between circuits increases • Utilization of redundant lanes for parallel operation can increase throughput at low-VDD
Self-Checking Circuits • Partition circuit into smaller blocks • Error checker for each block • Use error detection codes • Berger codes • Arithmetic codes • Increases circuit delay for error computation
Circuit-Level Speculation • Uses approximated circuit implementation • Goal is to reduce critical path
Tunable Replica Circuits • Mirrors delay of critical path • Monitors for errors over voltage/frequency changes
Timing Speculation • Razor timing error detection • Designed for transient faults • Effective against SET’s and SBU’s on flip-flops • Requires error recovery
Outline • Error Classification • Hard Errors • Soft Errors • Sources of radiation • Device/Circuit approaches • Architectural approaches • Error detection • Error correction • System level impact
Error Recovery Options in Scalar Processors • Clock Gating: • Global error signal • Clock gating • 1-cycle penalty
Error Recovery Options in Scalar Processors • Multiple Issue: • Error signals propagated to control unit • Instructions must be flushed • Error instruction then replayed • 2N-cycle penalty
Error Recovery Options in Scalar Processors • Counter-flow pipelining • Micro-rollback
Error correcting codes for memories • Most common is Hamming code • Check bits stored when data written • Identifies error and erroneous bit position
Error correcting codes for memories • Single-bit ECC adds area/power and delay • Low-VDD Increased delay • Hybrid VDD operation will reduce delay • Overhead increases for multi-bit ECC • Increased memory density higher probability of MBU • Current research increase in ratio of MBU to total SER in sub-VT
Outline • Error Classification • Hard Errors • Soft Errors • Sources of radiation • Device/Circuit approaches • Architectural approaches • Error detection • Error correction • System level impact
System-Level Impact • Soft errors can have a large affect on processor functionality • Increasing issue with further device scaling • All methods off error detection/correction are costly • Need to be added to system blocks wisely • SEU distribution • Effects of process variation
System-Level Impact • How to determine what blocks have the highest system-level impact? • Mostly through simulation • For radiation: all-encompassing • Includes fault injection @ circuit level • Different models have been developed • ReStore – University of Illinois at Urbana-Champaign • Focuses on system level effect of radiation-induced errors • RAMP – IBM • Directed more towards hard-errors and processor failure.