520 likes | 535 Views
Coping with Physical Failures, Soft Errors, and Reliability Issues. Chapter 8. What is this chapter about?. Gives an Overview of and Promising Solutions to the Causes of Manufacturing Defects and Soft Errors Focus on Signal Integrity Defect-Based Tests Process Sensors and Adaptive Design
E N D
Coping with Physical Failures, Soft Errors, and Reliability Issues Chapter 8
What is this chapter about? • Gives an Overview of and Promising Solutions to the Causes of Manufacturing Defects and Soft Errors • Focus on • Signal Integrity • Defect-Based Tests • Process Sensors and Adaptive Design • Soft Errors • BISER • Circuit-Level Approaches • Defect and Error Tolerance
Coping with Physical Failures, Soft Errors, and Reliability Issues • Introduction • Signal Integrity • Manufacture Defects, Process Variations, and Reliability • Soft Errors • Defect and Error Tolerance • Concluding Remarks
Introduction • Defects • Random defects • Caused by manufacturing imperfections and occur in random places • Systematic defects • Caused by process or manufacturing variations Defect level (DL) is a function of process yield (Y) and fault coverage (FC)
Concept of Signal Integrity Signal integrity is the ability of a signal to generate correct responses in a circuit. A signal with good integrity stays within safe margins for its voltage amplitude and transition time.
Basic Concept of Integrity Loss • Integrity Loss: any portion of signal that exceeds amplitude-safe and time-safe margin. • where Vi is one of the acceptable amplitude levels and is a time frame during which integrity loss occurs.
Sources of Integrity Loss • Interconnects • Power Supply Noise • Process Variations
Integrity Loss Sensors/Monitors (1) • Current Sensor • Current sensors are often used to detect the completion of asynchronous circuits.
Integrity Loss Sensors/Monitors (2) • Power Supply Noise Sensor • The voltage depends on the power/ground bounces: the higher the PSN is, the longer the propagation and the higher the voltage will be.
Integrity Loss Sensors/Monitors (3) • Noise Detector (ND) Sensor • ND sensor is designed to detect integrity loss due to voltage violations.
Integrity Loss Sensors/Monitors (4) • Integrity Loss Sensor (ILS) • The integrity loss sensor is a delay violation sensor.
Integrity Loss Sensors/Monitors (5) • Jitter Monitor • Jitter is often defined as the time deviation of a signal from its ideal location in time.
Integrity Loss Sensors/Monitors (6) • A ring oscillator can work as a Process Variation Sensor • The variation of delay caused by PV-faults in any of the inverters in the loop results in deviation in the frequency of the oscillator, which can be detected. • , where is an odd number of inverters and is the delay of one inverter.
Readout Architectures (1) • BIST-Based Architecture • When a noise or delay violation occurs (flag=1), the contents of all scan cells are then scanned out through Sout for further reliability and diagnosis analysis. BIST Architecture Readout Circuitry
Readout Architectures (2) • Scan-Based Architecture • At the driving side of an interconnect, pattern generation BSC(PGBSC) is used to generate test patterns. At the receiving side of the interconnect, an observation BSC(OBSC) is used to detect integrity loss.
Readout Architectures (3) • Basic Concept of PV-Test Architecture • On-chip ROs with counters, embedded in a test chip are used to detect process variation by measuring the RO’s frequency shifts.
Manufacture Defects, Process Variations, and Reliability • 100% single stuck-at fault coverage cannot guarantee perfect product quality, because there are remaining defects that are: • Timing-dependent • Sequence-dependent • Attributed to timing-dependent, non-single-stuck-at faults
Structural Tests • A Defect-Based Test Architecture
Defect-Based Tests • Small Delay Defect Tests • Bridge Defect Tests • N-Detect Tests • Tests • Tests • VLVTests
Reliability Stress • Concept of Infant Mortality • Methods to screen infant mortality • Method I -Burn-in • Where ttf is time to failure, C is a constant, is the activation energy (eV), k is the boltzman’s constant, and T is an absolute temperature. • Method II - Elevated Voltage Stress
Redundancy and Memory Repair • Redundancy: • Spare rows, columns, or blocks • Repair schemes: • Pellston Technology [Wuu 2005]: If repeated error are detected, disable cache line (set “not to use” bit) • Perform memory BIST at new operating conditions; exclude failing cells and resize cache (cache size can vary larger or smaller, depending on whether new conditions are more favourable or worse)
Process Sensors and Adaptive design • Compare traditional test structures put on the scribe lines and embed additional process sensors on-chip. • On-Chip Process Sensors: • Process Variation Sensor • Thermal Sensor • Dynamic Voltage Scaling
Process variation Sensor • Ring oscillators: Many factors can affect the frequency of the ring oscillator such as process variation, temperature and voltage. • Analog Process Variation Sensor: The analog circuit will be sensitive to different process parameters. Neither can report the process variation at the specific spot on the die and unlikely to extract and analyze the data in real time.
□ Thermal Sensor • On-chip thermal sensors are the last defence to prevent system crash or permanent damage to the chip. • Thermal sensor example: Figure 8.14:Thermal sensor example
□ Dynamic Voltage Scaling • DVS Figure 8.15: Dynamic voltage scaling scheme
DynamicVoltage Scaling (cont’d) • Use sleep transistors and dynamic biasing to save power • Use the adaptive test method for smart binning
Soft Errors • Introduction • Sources of Soft Errors and SER Trends • Coping with Soft Errors
Introduction • Soft errors • Soft errors are transient single-event upsets (SEUs) caused by various type of radiation • Cosmic radiation is the major source of soft errors,especially in memories. • Terrestrial radiation is another source of soft errors.
Sources of Soft Errors and SER Trends • If a glitch is induced at the junction (red label) in a memory element, its state can be reversed.
Sources of Soft Errors and SER Trends • Logic circuits are less susceptible to these glitches than memories for the following reasons. • The glitch must be of sufficient strength to propagate from the location of the strike. • The glitch needs to have a functionally sensitized path to be latched. • The glitch must arrive at a latch during its latching window.
□ Coping with Soft Errors • As chips are susceptible to soft errors, many soft error protection schemes targeting chip designs have been proposed. • Fault Tolerance • Error-resilient microarchitectures • soft errroe mitigation
□ Fault Tolerance • Removing the source of soft errors to improve the reliability of a chip. • Three fundamental fault tolerance schemes: • Hardware (spatial) redundancy • assumption that defects and radiation particles will only hit on a specific device and not another device • Time (temporal) redundancy • assumption that the radiation strike will not happen on the same circuitry against at a slightly later time • Information redundancy • using error-detecting code or error-correcting code to represent information contents
□ Fault Tolerance • Common fault tolerance schemes used in high reliability system • Duplicate and compare • used in mainframes and high-end servers • Triple modular redundancy • used for systems that cannot fail • Redundant multithreading • using error-detecting code or error-correcting code to represent information contents
□ Error-Resilient Microarchitectures • Two representative error-resilient processor microarchitectures • DIVA • Razor • DIVA • Dynamic Implementation Verification Architecture (DIVA) • DIVA Checker • a smaller and simpler shadow processor • contain a functional checker stage (CHK), commit stage (CT), and a watchdog timer(WT) • DIVA Core • The main processor that fetches, decodes, and executes instructions, holding their speculative results in the reorder buffer (ROB)
□ Error-Resilient Microarchitectures • Razor • Dynamic voltage scaling (DVS) is one of the most effective and widely used methods for power-aware computing. • The key idea of Razor is to tune the supply voltage by monitoring the error during circuit of operation; this is accomplished with a shadow unit, but this shadow unit has been pushed all the way down into a Razor flip-flop. This Razor flip-flop is shown in Figure 8.21a.
□ Error-Resilient Microarchitectures
□ Error-Resilient Microarchitectures • Razor A reduced overhead Razor flip-flop with the metastability detection circuit is illustrated in Figure 8.21b.
Soft Error Mitigation • Soft error mitigation techniques are to provide partial immunity of a design to potential soft errors while significantly minimizing the required cost over fault tolerance schems. • There are three soft error mitigation methods: • (1) Built-In Soft-Error Resilience (BISER) BISER proposed in [Mitra 2005] can be used to allow scan design to protect a device from soft errors during normal operation.
Soft Error Mitigation • Figure 8.22 shows the BISER scan cell design that reduces the impact of soft errors affecting storage elements by more than 20 times.
Soft Error Mitigation • Circuit-level approaches (2) Gate resizing for soft error mitigation [Zhou 2006] is based on physical-level design modifications. Figure 8.23 illustrates the effect of gate resizing on the amplitude and width of a 0-to-1 transient at the output of a gate.
Soft Error Mitigation • Circuit-level approaches (3) Netlist transformation for soft error mitigation [Almukhaizim 2006] is based on logic-level design modifications. .
Defect and Error Tolerance • Defect Tolerance • Insert redundancy circuitry in a circuit under test • The circuit can continue correct operation in the presence of defects. • Error Tolerance • Allow the circuit to continue acceptable operation in the presence of errors
Random Spot defects • Assume a design consists N submodules. • Each module has n unique positions where a defect would cause it to fail its tests. • D defects uniformly distributed over the submodule. • Number of defects in any submodule is independent of the number of defects in other submodules.
Defect Probability • Probability that an arbitrary position on a submodule is associated with a defect is: p = D / (nN) • Probability of having d defects in a given submodule is: P(d) = C(n,d)pd(1-p)n-d where C(n,d) = n! / (d!(n-d)!)
Poisson Distribution • P(d) is binomially distributed, the average number of defects in an arbitrary submodule is: E(d) = λ = np = D / N • For large n and small p, the binomial distribution can be approximated by Poisson distribution
Example • Assume a submodule is equally likely to be defect-free or defective: • Thus, λ = 0.693. • Effective yield can increase significantly if the system can accept some defective submodules.
Probability of Having Exact d Defects at a Submodule as a Function of Yield (Y) for Various Values of Failure Rateλ d λ = 0.105 λ = 0.223 λ = 0.357 λ = 0.511 λ = 0.693 λ = 0.916 λ = 1.204 λ = 1.609 λ = 2.303 Y = 0.90 0.09 Y = 0.80 0.18 0.02 Y = 0.70 0.25 0.04 0.01 Y = 0.60 0.31 0.08 0.01 Y = 0.50 0.35 0.12 0.03 Y = 0.40 0.37 0.17 0.05 0.01 Y = 0.30 0.36 0.22 0.09 0.03 0.01 Y = 0.20 0.32 0.26 0.14 0.06 0.02 Y = 0.10 0.23 0.27 0.20 0.12 0.05 0.02 0.01 0 1 2 3 4 5 6 7
M Switch M M Defect Tolerance • Used to be called redundancy repair • A typical defect-tolerant design is shown on the left • Two spares (identical modules) • A switch used to select one module
Error Tolerance • The main Objective of error tolerance is to increase the effective yield of a process by identifying defective but acceptable chips • This lies in the development of • An accurate method to estimate error rate • An effective method to predict yield
Acceptable Chips IC Fabrication Fault Ranking Testing Unacceptable Chips Fault-Oriented Test Methodology • Enhance effective yield based on error-rate analysis • Estimate error rate of each modeled fault • A set of acceptable faults is identified based on their error rates