760 likes | 771 Views
This article discusses the challenges and limitations of self-repair technology for logic circuits in nano-electronics, including wear-out problems and potential mechanisms, as well as strategies for test, repair, and cost management. The importance of fault-tolerant computing and the implementation of error detection and correction codes are also explored.
E N D
Self Repair Technology for Logic Circuits Architecture, Overhead and Limitations Heinrich T. Vierhaus BTU Cottbus Computer Engineering Group
Outline 1. Introduction: Nano Structure Problems 2. The Problem of Wear-Out 3. Repair for Memory and FPGAs 4. Basic Logic Repair Strategies & Structures 5. Test and Repair Administration 6. De-Stressing Strategies 7. Cost, Overhead, Single Points of Failure 8. Summary and Conclusions
1. Introduction A bunch of new problems from nano structures ...
Nanoelectronic Problems Lithography: The wavelength used to „map“ structural information from masks to wafers is larger (4 times of more) than the minimum structural features (193 versus 90 / 65 / 45 nm). Adaptation of layouts for correction of mapping faults. Statistical Parameter Variations: The number of atoms in MOS-transistor channels becomes so small that statistical variations of doping densities have an impact on device parameters such as threshold voltages.
New Problems with Nano-Technologies Light source Wave length: 193 nm mask (reticle) resist exposed resist wafer Feature size: down to 28 nm
Layout Correction Modified layout for compensation of mapping faults Compensation is critical and non-ideal Faults are not random but correlated! Requires fast fault diagnosis
Poly Poly - - Si Si n n n n doping atom doping atom p p - - Substrate Substrate Doping Fluctuations in MOS Transistors Density and distribution of doping atoms cause shifts in transistor threshold voltages!
Nanostructure Problems Individual device characteristics such as Vth are more dependent on statistical variations of underlying physical features such as doping profiles. Primary Relevance: Yield A significant share of basic devices will be „out or specs“ and needs a replacement by backup elements for yield improvement after production. Primary Relevance: Yield Smaller features mean higher stress (field strength, current density), also foster new mechanisms of early wear-out. Primary Relevance: Lifetime Transient error recognition and compensation „in time“ is becoming a must due to e. g. charged particles that can discharge circuit nodes. Primary Relevance: Dependability
Fault Tolerant Computing Works only for transient faults! Software-based fault detection & compensation specific Fault event HW logic & RT-level detection & compensation Typically works for transient and permanent faults! universal very specific Typically works for specific types of transient faults only! Transistor-and switch level compensation
2. Wear-Out Problems and Mechanisms Structures on ICs used to live longer than either their application or even their users. Not any more ...
IC Structures May Get Tired „Wear-out“ – effects ICs in nano-electronics are likely to appear much earlier, causing a lot of problems for dependable long-time applications !
Wear-Out Mechnisms Metal Migration: Metal atoms (Al, Cu) tend to migrate under high current density and high temperature. Stress migration: Migration effects may be enhanced under mechanical stress conditons. Effect: Metal lines and vias may actually cause line interrupts. The effect is partly reversible by changing current directions.
Transistor Degradation Negative Bias Thermal Instability (NBTI): Reduced switching speed for p-channel MOS transistors that have operated under long-time constant negative gate bias. The effect is partly reversible. Hot Carrier Injection (HCI): Reduced switching speed for n-channel MOS transistors, induced by positive gate bias and frequent switching. Not reversible. Gate Oxide Deterioration: Induced by high field strengh. Not reversible Dielectric Breakdown: Insulating layers between metal lines may break causing shorts between signal lines. Design technology including a prospective „life time budget“!!
Management of Wear-Out by „Fault Tolerant Computing? Built-in fault tolerance and error compensation are needed in nano- technologies anyway and for the management of transient faults. Wear-out induced faults may show up as „intermittent“ faults first, which become more and more frequent. Fault in synchronous circuits and systems are detected „by clock cycle“. Hence the detection does not even recognize if the fault is permanent or not for many types of fault tolerant architecture.
Triple Modular Redundancy Execution Unit 1 Comparator Voter input signal Result out (majority) Execution Unit 2 Error detect Execution Unit 3 Can detect and compensate almost any type of fault Overhead about 200-300 %, additional signal delays The voter itself is not covered but must be a „self checking checker“ Standard (by law) in avionics applications!
Error Detecting / Correcting Codes Data Data Transmission / Storage Error correction Signature Signature Fault- detect Comparison Often applicable to 1- or 2-bit faults only Often limited to certain fault models (uni-directional) Signature Becomes expensive if applied to computational units
Can TMR and Codes CompensatePermanent Faults? Fault / error detection circuitry typically works on a clock-cycle base. It does not „know“ if a fault is transient or permanent. A permanent fault is a fault event that occurs in several to many successive clock cycles repeatedly. Error correction technology can detect and compensate such permanent faults as well as transient faults. A critical condition occurs if transient faults occur on top of permanent faults. Then the superposition of fault effects is likely to exceed the system‘s fault handling capacity. System components that run actively „in parallel“ suffer from the same wear-out effects. Therefore there is a an increase in dependability before wear-out limits, but no significant life time extension!
Redundancy and Wear-Out During the normal life time of the system, duplication or triplication can enhance reliability significantly. But also area and power consumption are about triplicated. And by the end of normal operating time (out of fuel / steam) all three systems will fail shortly one after the other !! Reliability enhancement is not equal to life time extension !!
Self Repair? Works only for transient faults! Software-based fault detection & compensation specific Fault event HW logic & RT-level detection & compensation Typically works for transient and permanent faults! universal Self Repair for permanent faults! very specific Typically works for specific types of transient faults only! Transistor-and switch level compensation
3. Repair for Memory and FPGAs Compensation of transient faults is not enough. Some technologies for transient compensation can handle permanent faults, too, but not on the long run and with additional transient faults!
Memory Test & Repair Read- / write lines Lines Line address spare column columns
Memory Test & Repair (2) Read- / Write lines Lines Line address spare column Memory BIST controller columns ... is already state-of-the-art!
Repair Mechanism: Row/Line-Shift Little Overhead for the re-configuration process Loss of many “good” CLBs for every fault
Distributed Backup CLBs Minimum loss of functional CLBs High effort for re-wiring requires massive „embedded“ computing power (32-bit CPU, 500 MHz)
Self Repair within FPGA Basic Blocks Heterogeneous repair strategies required (memory, logic) Logic blocks may use methods known from memory BISR Additional repair strategies are necessary for logic elements The basic overhead for FPGAs versus standard logic (about 10) is enhanced. Repair strategies for logic may use some features already used in FPGAs (e. g. switched interconnects).
FPGAs for a Solution? The granularity of re-configurable logic blocks (CLBs) in most FPGAs is the order of several thousand transistors. Replacement strategies must be placed on a granularity of blocks in the area of 100-500 transistors for fault densities between 0.01 % and 0.1 %. Efficient FPGA- repair mechanism requires detailed fault diagnosis plus specific repair schemes, which cannot be kept as pre-computed reconfiguration schemes. Computation of specific repair schemes requires „in-system EDA“ (re-placement and routing) with a massive demand for computing power. There is no source of such „always available“ computing power.
Self-Repairing FPGA ? Reconfigurable Logic Memory CLB WB CLB WB CLB WB CLB New-Config. CLB WB CLB WB CLB WB CLB CLB WB CLB WB CLB WB CLB Program CLB WB CLB WB CLB WB CLB Config. Scheme CLB WB CLB WB CLB WB CLB CLB WB CLB WB CLB WB CLB Virtual CPU
Advanced FPGA Structures ... are only partly re-configurable for performance reasons !
FPGA / CPLD Repair Looks pretty easy at first glance because of regular architecture! Requires lines / columns of switches for configuration at inputs and between AND / OR matrices. Requires additional programmability of cross-points by double-gate transistor as in EEPROMs or Flash memory. Not fully compatible with standard CMOS Limited number of (re-) configurations Floating gate (FAMOS) transistors are fault-sensitive!
4. Basic Logic Repair Strategies Repair techniques that replace failing building blocks by redundant elements from a „silent“ storage are not new. IBM has been selling such computer systems specifically for applications in banks for decade. But always with few (2-10) backup elements (CPUs) assuming a small number of failures (< 10) within years.
Mainframes .. will often contain „redundant“ CPUs for eventual fault compensation. But one faulty transistor then „costs“ a whole CPU, limiting the fault handling to a few (about 10) permanent fault cases.
Repair Overhead versus Element Loss Repair procedure Functioning overhead elements lost New Methods and Archi- tectures Prohibitive overhead Prohibitive fault density 10 1k 10k 100k 1M 10M 1 100 Size of replaced blocks (granularity)
Built-in Self Repair (BISR) BISR is well understood for highly regular structures such as embedded memory blocks. BISR is essentially depending on built-in self test (BIST) with high diagnostic resolution. Fault Detection Fault Isolation Redundancy Allocation Fault Diagnosis Fault / Redundancy Management Redundancy management must monitor faults, replacements, available redundancy and must also re-establish a „working“ system state after power-down states.
Levels of Repair Transistors - Switch Level Replace transistors or transistor groups Losses by reconfiguration: (switched - off „good“ devices): Potentially small ( 20 – 50%) for transistor faults Overhead for test and diagnosis: Very high Repair overhead will dominate reliability! Gate Level Replace gates or logic cells Losses by reconfiguration: Medium (60 to 90 %) for single transistor faults Overhead for test and diagnosis: High Macro - Block Level Replace functional macros (ALU, FPU, CPU) Losses by reconfiguration: High, 99% or more Overhead for test and diagnosis: Maybe acceptable
The Fault Isolation Problem Load 1 Driver Load 2 Gate- short GND-shorts of input gates affect the whole fan-in network and make redundancy obsolete!!
Block-Level Repair & & SE & SE SE & Blocks of logic / RT elements (gates and larger) contain a redundant element each that can replace a faulty unit.
Switching Concept (2) inputs inputs outputs outputs Functional Block 1 Functional Block 1 Functional Block 2 Functional Block 2 Functional Block 3 Functional Block 3 Replace- ment Block Replace- ment Block Test in Test out Test in Test out 4 3
A Regular Switching Scheme The scheme is regular and scalable by nature, comprising always k functional blocks of the same nature plus 1 additional block for backup. Building blocks are separated by (pass-) transistor switches at inputs and outputs, providing a full isolation of a faulty block. Always 2 additional pass-transistors between two functional blocks. The reconfiguration scheme is regular in shifting functionality between blocks, which results in a simple scheme of administration. The functional access to the „spare“ block can be used for testing purposes. In any state of (re-) configuration, the potentially „faulty“ block is connected to test input / output terminals.
Overhead Depending on Block Size Transistors Basic Element Functional backup norm switch ext. switch 3 /4- 2-NAND 12 4 18 24 3 / 4 2-AND 18 6 18 24 3/4 2-XOR 18 6 18 24 H- Adder 36 12 24 30 F- Adder 90 30 30 36 For small basic blocks, the switches make the essential overhead (200%)! For larger basic blocks,the overhead can be reduced to about 30-50% ... not counting test- and administration overhead! Extract larger basic units from seemingly irregular logic netlists!!
5. Test and Repair Administration Test Generator Conf. Logic Conf. RLB RLB BIST BIST Logic RLB RLB Configurator and Status Memory Conf. Conf. RLB RLB BIST BIST RLB RLB System Monitoring Test Analyzer De-centralized test and control Centralized Control May be faulty!
Blocks, Switching, Administration Local (re-) configuration Remote (re-) configuration Columns of Switches Columns of Switches F-Unit F-Unit F-Unit F-Unit F-Unit F-Unit F-Unit F-Unit Red.-Unit Red.-Unit Red.-Unit Red.-Unit F-Unit F-Unit F-Unit F-Unit Conf.-Unit Conf.-Unit Decoder Decoder Conf.-Unit Conf.-Unit Global Control-Unit Global Control-Unit
Combining Test and Re-Configuration Reference Test input Test out Logic under Test Compare fault detect next state Config. Memory / Counter