220 likes | 460 Views
Self-Checking Fault Detection using Discrepancy Mirrors. PDPTA 2005 Las Vegas. Ronald F. DeMara, Carthik A. Sharma University of Central Florida. Fault Handling Overview. Failure Manifestation of a fault Deviation from expected behavior Detection Identify occurrence of fault
E N D
Self-Checking Fault Detection usingDiscrepancy Mirrors PDPTA 2005 Las Vegas Ronald F. DeMara, Carthik A. SharmaUniversity of Central Florida
Fault Handling Overview • Failure • Manifestation of a fault • Deviation from expected behavior • Detection • Identify occurrence of fault • Fully articulating inputs • Intermittently articulating inputs • Methods • Coding based schemes • Redundancy • Isolation • Physical location of fault PCI-based card used for Xilinx Virtex II-Pro Based Autonomous Repair Testbed
Ideal Detection Characteristics • Faults in the detector are covered by itself • Fault-secure • Self-testing • No “Golden Elements” • Multiple types of faults handled by same detector • Transient and Permanent faults • Logic and Interconnect faults • Minimum number of false-positives • Accuracy and reliability • Minimal power consumption • Verifiable correctness • Practical Assessment • Fitness assessment should be tractable
Discrepancy Mirror • Mechanism for Checking-the-Checker (“golden element” problem) • Makes checker part of configuration that competes for correctness [DeMara PDPTA-05] Fault Coverage
Discrepancy Mirror Circuit Fault Coverage
Discrepancy Mirror Truth Table • Discrepancy Mirror Truth Table ensures complete coverage of detector. • Single Point of Failure reduced to a stuck-at fault exposure for MATCH output (Wired-Or)
Discrepancy Mirror Approach • Selection Phase • Two candidates chosen from population • Use mutually exclusive resources • Carry out computation in tandem • Detection Phase • Discrepancy Mirror compares outputs • MATCH output signifies fault free configurations • Faults in the detector also covered • Preference Adjustment Process • Detector output over time indicates relative fitness • Relative fitness can be used to choose candidates
= RS: = (Hamming Distance) CRR Arrangement in SRAM FPGA • Configurations in Population • C = CL CR • CL = subset of left-half configurations • CR = subset of right-half configurations • |CL|=|CR |= |C|/2 • Discrepancy Operator • Baseline Discrepancy Operator is dyadic operator with binary output: • Z(Ci) is FPGA data throughput output of configuration Ci • Each half-configuration evaluates using embedded checker (XNOR gate) within each individual • Any fault in checker lowers that individual’s fitness so that individual is no longer preferred and eventually undergoes repair WTA: (Equivalence)
Overview of FPGA operation • Competing Configurations • Configurations A and B are physically distinct • CA = subset consisting of ‘A’ configurations • CB = subset consisting of ‘B’ configurations • |CA|=|CB |= |C|/2 • Discrepancy Operator • Baseline Discrepancy Operator is dyadic operator with binary output: • Z(Ci) is FPGA data throughput output of configuration Ci • Each half-configuration evaluates using embedded checker (XNOR gate) within each individual • Any fault in checker or functional logic lowers fitness of resources used by that individual leading to isolation SRAM-based FPGA INPUT DATA CONFIGURATION BIT STREAM Configuration B Configuration A Function Logic A Function Logic B ( NOTE: a non-volatile memory is already required to boot any SRAM FPGA from cold start ... this is not an additional chip ) OFF-CHIP EEPROM ` Discrepancy Mirror A Discrepancy Mirror B DATA OUTPUT FEEDBACK CONTROL Reconfiguration Algorithm
Discrepancy Mirror Schematic:CMOS • Pspice Schematic • 44 p- and n-channel • MOS Transistors • 1.5 micron minimum width • 600 nm length • Width of p-mos transistors • = 3*width of n-mos trans.
Discrepancy Mirror Schematic:Xilinx • Xilinx Schematic • Virtex-II Pro FPGA • ModelSim-II Simulator • Emulated (digital) • Pull-down Resistor
Discrepancy Mirror Simulation:CMOS Circuit • Transient Response • Behavior conforms to • specifications • Correct identification of • Discrepancy
Discrepancy Mirror Simulation:Xilinx ModelSim-II • Circuit Response • Output ‘High’ == 1 when input q1 == q2 • Output ‘Low’ when input q1 != q2. • In Xilinx FPGAs, ‘Low’ is not exactly equal to zero, but is a Logic ‘zero’ nevertheless.
Fault Location Experiments • Two experiments conducted • C-language program simulator • Locate fault by successive intersections • v-subsets or groups of resources • Fault identified after m comparisons – what is the value of m? • Identify number of iterations required to identify single-fault • Random inputs, Single stuck-at fault • Expected number of pairings over 100 simulations • One ‘resource’ equivalent to one CLB ( > 10 gates) • Experiment 1 • Perpetually articulating inputs • Experiment 2 • Intermittently articulating inputs
Fault Location Using Dueling Let U denote the set of all logic resources on the FPGA S denote the pool of resources suspected of being faulty Initially denotes the set of resources used by ithconfiguration. To isolate the fault, m successive intersections, are performed at the end of which |S| = 1 With pre-designed partitions to achieve maximal isolation • Isolation can be completed in 2n iterations, where n = | |
Analysis with Perpetually Articulating Inputs • Perpetually Articulating Inputs • No observed discrepancy • implies fault-free resources • Best Case (50% Utilized Capacity): • 11.1 pairings for 1,000 resources • 17.6 pairings for 100,000 resources • Most Demanding Case: • 63.7 pairings for 100,000 resources with 5% capacity utilization.
Analysis with Intermittently Articulating Inputs • Intermittently Articulating Inputs • Inputs may be such that fault is not articulated at the outputs • No observed discrepancy does not • imply fault-free resources • Only discrepant outputs provide fault-location information • Best Case (45% Utilized Capacity): • 42 pairings for 1,000 resources • 64.1 pairings for 100,000 resources • Most Demanding Case: • 478 pairings for 100,000 resources with 95% capacity utilization. 50% of the inputs articulate the fault
Experimental Results Summary • Number of iterations to detect faults depends on Utilized Capacity • Designs that utilize only a very few resources ( < 20%), or almost all ( > 80%) the resources on the FPGA pose difficult isolation problems • Each intersection exonerates (implicates) fewer individual resources • Method scales well • 11.1, 14.9, 17.6 pairings required for 1,000, 10,000, and 100,000 resources. Sub-linear increase in location time. • Current Work • Competitive Runtime Reconfiguration (CRR) framework under development which will utilize methods outlined • Investigation of Competitive Group Testing methods to enable faster fault isolation • Analysis of characteristics of isolation, dependency on parameters, optimal partitioning methods.
Accommodating Multi-bit Word Widths • Proof of concept • The present circuit works efficiently • Demonstrates important Dueling-enabled isolation method • Strategies • Use an array of detectors • attempt to minimize points of failure as word-width increases • Number of logic resources used is acceptable for smaller circuits • Create new circuit or scheme, combining fault tolerant coding-based methods with single-fault secure circuit • Current research focused on improving detector by investigating codes, and fault-secure circuits
Pull-down Resistor Considerations • Proof of concept • The present circuit works in a verifiable correct manner • Can utilize synthesized (digital) pull-down resistor which simulate the behavior of analog resistors • Demonstrates Dueling-enabled isolation method • Can be utilized without implementation problems for Custom-VLSI designs • Alternative Approach • Alternate detector circuits for FPGA implementation are under investigation • Avoid using Tri-state buffers, pull-down resistors and use native digital components available on FPGAs