Self-Checking Fault Detection using Discrepancy Mirrors

Self-Checking Fault Detection usingDiscrepancy Mirrors PDPTA 2005 Las Vegas Ronald F. DeMara, Carthik A. SharmaUniversity of Central Florida

Fault Handling Overview • Failure • Manifestation of a fault • Deviation from expected behavior • Detection • Identify occurrence of fault • Fully articulating inputs • Intermittently articulating inputs • Methods • Coding based schemes • Redundancy • Isolation • Physical location of fault PCI-based card used for Xilinx Virtex II-Pro Based Autonomous Repair Testbed

Ideal Detection Characteristics • Faults in the detector are covered by itself • Fault-secure • Self-testing • No “Golden Elements” • Multiple types of faults handled by same detector • Transient and Permanent faults • Logic and Interconnect faults • Minimum number of false-positives • Accuracy and reliability • Minimal power consumption • Verifiable correctness • Practical Assessment • Fitness assessment should be tractable

Discrepancy Mirror • Mechanism for Checking-the-Checker (“golden element” problem) • Makes checker part of configuration that competes for correctness [DeMara PDPTA-05] Fault Coverage

Discrepancy Mirror Circuit Fault Coverage

Discrepancy Mirror Truth Table • Discrepancy Mirror Truth Table ensures complete coverage of detector. • Single Point of Failure reduced to a stuck-at fault exposure for MATCH output (Wired-Or)

Discrepancy-Enabled Isolation

Discrepancy Mirror Approach • Selection Phase • Two candidates chosen from population • Use mutually exclusive resources • Carry out computation in tandem • Detection Phase • Discrepancy Mirror compares outputs • MATCH output signifies fault free configurations • Faults in the detector also covered • Preference Adjustment Process • Detector output over time indicates relative fitness • Relative fitness can be used to choose candidates

 = RS:  = (Hamming Distance) CRR Arrangement in SRAM FPGA • Configurations in Population • C = CL CR • CL = subset of left-half configurations • CR = subset of right-half configurations • |CL|=|CR |= |C|/2 • Discrepancy Operator • Baseline Discrepancy Operator is dyadic operator with binary output: • Z(Ci) is FPGA data throughput output of configuration Ci • Each half-configuration evaluates  using embedded checker (XNOR gate) within each individual • Any fault in checker lowers that individual’s fitness so that individual is no longer preferred and eventually undergoes repair WTA: (Equivalence)

Overview of FPGA operation • Competing Configurations • Configurations A and B are physically distinct • CA = subset consisting of ‘A’ configurations • CB = subset consisting of ‘B’ configurations • |CA|=|CB |= |C|/2 • Discrepancy Operator • Baseline Discrepancy Operator is dyadic operator with binary output: • Z(Ci) is FPGA data throughput output of configuration Ci • Each half-configuration evaluates  using embedded checker (XNOR gate) within each individual • Any fault in checker or functional logic lowers fitness of resources used by that individual leading to isolation SRAM-based FPGA INPUT DATA CONFIGURATION BIT STREAM Configuration B Configuration A Function Logic A Function Logic B ( NOTE: a non-volatile memory is already required to boot any SRAM FPGA from cold start ... this is not an additional chip ) OFF-CHIP EEPROM ` Discrepancy Mirror A Discrepancy Mirror B DATA OUTPUT FEEDBACK CONTROL Reconfiguration Algorithm

Discrepancy Mirror Schematic:CMOS • Pspice Schematic • 44 p- and n-channel • MOS Transistors • 1.5 micron minimum width • 600 nm length • Width of p-mos transistors • = 3*width of n-mos trans.

Discrepancy Mirror Schematic:Xilinx • Xilinx Schematic • Virtex-II Pro FPGA • ModelSim-II Simulator • Emulated (digital) • Pull-down Resistor

Discrepancy Mirror Simulation:CMOS Circuit • Transient Response • Behavior conforms to • specifications • Correct identification of • Discrepancy

Discrepancy Mirror Simulation:Xilinx ModelSim-II • Circuit Response • Output ‘High’ == 1 when input q1 == q2 • Output ‘Low’ when input q1 != q2. • In Xilinx FPGAs, ‘Low’ is not exactly equal to zero, but is a Logic ‘zero’ nevertheless.

Fault Location Experiments • Two experiments conducted • C-language program simulator • Locate fault by successive intersections • v-subsets or groups of resources • Fault identified after m comparisons – what is the value of m? • Identify number of iterations required to identify single-fault • Random inputs, Single stuck-at fault • Expected number of pairings over 100 simulations • One ‘resource’ equivalent to one CLB ( > 10 gates) • Experiment 1 • Perpetually articulating inputs • Experiment 2 • Intermittently articulating inputs

Fault Location Using Dueling Let U denote the set of all logic resources on the FPGA S denote the pool of resources suspected of being faulty Initially denotes the set of resources used by ithconfiguration. To isolate the fault, m successive intersections, are performed at the end of which |S| = 1 With pre-designed partitions to achieve maximal isolation • Isolation can be completed in 2n iterations, where n = | |

Analysis with Perpetually Articulating Inputs • Perpetually Articulating Inputs • No observed discrepancy • implies fault-free resources • Best Case (50% Utilized Capacity): • 11.1 pairings for 1,000 resources • 17.6 pairings for 100,000 resources • Most Demanding Case: • 63.7 pairings for 100,000 resources with 5% capacity utilization.

Analysis with Intermittently Articulating Inputs • Intermittently Articulating Inputs • Inputs may be such that fault is not articulated at the outputs • No observed discrepancy does not • imply fault-free resources • Only discrepant outputs provide fault-location information • Best Case (45% Utilized Capacity): • 42 pairings for 1,000 resources • 64.1 pairings for 100,000 resources • Most Demanding Case: • 478 pairings for 100,000 resources with 95% capacity utilization. 50% of the inputs articulate the fault

Experimental Results Summary • Number of iterations to detect faults depends on Utilized Capacity • Designs that utilize only a very few resources ( < 20%), or almost all ( > 80%) the resources on the FPGA pose difficult isolation problems • Each intersection exonerates (implicates) fewer individual resources • Method scales well • 11.1, 14.9, 17.6 pairings required for 1,000, 10,000, and 100,000 resources. Sub-linear increase in location time. • Current Work • Competitive Runtime Reconfiguration (CRR) framework under development which will utilize methods outlined • Investigation of Competitive Group Testing methods to enable faster fault isolation • Analysis of characteristics of isolation, dependency on parameters, optimal partitioning methods.

Backup Slides Follow

Accommodating Multi-bit Word Widths • Proof of concept • The present circuit works efficiently • Demonstrates important Dueling-enabled isolation method • Strategies • Use an array of detectors • attempt to minimize points of failure as word-width increases • Number of logic resources used is acceptable for smaller circuits • Create new circuit or scheme, combining fault tolerant coding-based methods with single-fault secure circuit • Current research focused on improving detector by investigating codes, and fault-secure circuits

Pull-down Resistor Considerations • Proof of concept • The present circuit works in a verifiable correct manner • Can utilize synthesized (digital) pull-down resistor which simulate the behavior of analog resistors • Demonstrates Dueling-enabled isolation method • Can be utilized without implementation problems for Custom-VLSI designs • Alternative Approach • Alternate detector circuits for FPGA implementation are under investigation • Avoid using Tri-state buffers, pull-down resistors and use native digital components available on FPGAs

Self-Checking Fault Detection using Discrepancy Mirrors

Self-Checking Fault Detection using Discrepancy Mirrors

Presentation Transcript

AUTOMATIC FAULT DETECTION BY USING WAVELET METHOD

Fault Detection Tools and Techniques

Line Fault Detection

Fault Detection by Examining Circuit Structure

Fault Detection and Isolation: an overview

Random Number Generation Using Low Discrepancy Points

Fault detection

Fault Analysis Using Pin

Self-Checking Circuits Delay-Insensitive Codes and Self-Checking Checkers

Aircraft Fault Detection and Classification Using Multi-Level Immune Learning Detection

Fault Detection

Sophistocation of Fault Detection

Self Checking Testbenches

FRONIUS Ground Fault Detection and Interruption

Fault Detection and Diagnosis (II)

Observers Data Only Fault Detection

Fault Detection and Isolation of an Aircraft using Set-Valued Observers

Random Number Generation Using Low Discrepancy Points

Management: Fault Detection and Troubleshooting

Fault detection

Fault Detection and Diagnosis