200 likes | 316 Views
Using Software Rules To Enhance FPGA Reliability. Chandru Mirchandani Lockheed-Martin September 7-9, 2005. MIRCHANDANI. 1. P226-W/MAPLD2005. FPGA Fault Tolerance. Historically realized through triple redundancy, error correcting codes and replicated elements
E N D
Using Software Rules To Enhance FPGA Reliability Chandru Mirchandani Lockheed-Martin September 7-9, 2005 MIRCHANDANI 1 P226-W/MAPLD2005
FPGA Fault Tolerance • Historically realized through triple redundancy, error correcting codes and replicated elements • The fault tolerance process is as good as the tests run to validate its performance, e.g. • When invalid data is not ignored due to an inherent fault in the lookup and compare sequence • The testing was not rigorous enough • The testing was not complete • Lack of real estate and logic on the device precludes the ideal solution, • Make educated judgment calls on how much is acceptable and for how long MIRCHANDANI
Reconfiguring FPGAs • Replicated circuitry or triple redundancy, achieved by having different devices or on the same device • Same device to replicate a complete circuit will not meet the constraint of lack of real estate and will decrease performance due to routing • Could be used to one’s advantage if sub-sets of the circuit were replicated • Yu and McCluskey - reconfiguring the chip so that a damaged configurable logic block (CLB) or routing resource is not used by a design MIRCHANDANI
Types of Errors • Yu and McCluskey – When concurrent error detection (CED) mechanisms detect an error for the first time, it is treated as a transient error; otherwise, it is treated as a permanent error • Transient error - the system recovers from corrupt data and resumes normal operation • Permanent fault - fault diagnosis is initiated to determine the location of the damaged resource, and a suitable configuration is chosen according to the available area • In the case of both types of errors, the design in VHDL, i.e. FPGA software is the key to success MIRCHANDANI
Software Reliability • Develop Criteria for Design Objective Acceptance • Prioritize tasks or functions in order of criticality • Develop metrics to measure performance of tasks with respect to constraints • Evaluate design options based on measured reliability metrics MIRCHANDANI 5 P226/MAPLD2005
Processor 1 Application A1 (I-ary) Application A1 (II-ary) Processor 2 Typical Software Options • Critical software functions are distributed as redundant instances on multiple processors, thus minimizing the loss of service due to a processor failure…….. MIRCHANDANI 6 P226/MAPLD2005
Redundant Instances of Software • Initially detect, contain and recover from faults as soon as possible, and in the event this is not possible • Allow the control to be passed on to the redundant instance within the reliability and availability requirements levied on the system • Finally, include language defined mechanisms to detect and prevent the propagation of errors MIRCHANDANI 7 P226/MAPLD2005
Methodology • Estimate the reliability based on instruction set and operational usage • Re-design critical elements to decrease risk • Re-evaluate the risk of failure based on a change in critical task design based on performance and requirements • Re-evaluate the reliability based on failure rate • Factor in the Uncertainty in Evaluation MIRCHANDANI 8 P226/MAPLD2005
Task Times MIRCHANDANI
FPGA System - Conceptual • Consider a FPGA-based system comprising of the Reading, Parsing and Pre-Processing Tasks….. …each Task is a subsystem MIRCHANDANI
Task Reliability Block Diagram (exp(-γh.uh.λhwi.t).exp(-γs.us.λswi.t) [1-{1-(exp(-(1-γh).λshwi.t).exp(-(1-γs).λsswi.t))}^2] AND OR MIRCHANDANI
Definitions MIRCHANDANI
Parameters & Derivations • Failure Intensity: λshwi = λhwi.uh.(1-γh) • Failure Intensity: λsswi = λswi.us.(1-γs) • Common Cause: λhwi.uh.(γh) and λswi.us.(γs) • Execution Time t: ei . t • RSSi : Subsystem Reliability • System Reliability RS : RSS1 .RSS2 .RSS3 MIRCHANDANI 13 P226/MAPLD2005
Extending the Rules • The programmed design, be it the original duplex design, duplicated or diverse, or the option for re-configuration, will optimize whatever option is used to enhance Fault Tolerance • For example, in the Reading Task, it is shown that the area usage and operational profile have an effect on the predicted overall reliability of the FPGA-based design • Yu and McCluskey, state that the designs of the CED techniques are area dependent, more conservative a design in terms of area, less efficiently will the error detection algorithm perform, however, but more efficiently or optimally the re-configured design in the event of a permanent failure. MIRCHANDANI
Further Extension • Area usage has a higher propensity for multiple faults, the operational profile that exercises a part of the code more often, then the design and its associated code has a greater propensity for failures • The common cause fractions used in the paper are relative numbers to illustrate the model • Redundancy of one, the fraction attributed to hardware common cause failure is 1 %. This implies that there is an equal chance for a common defect running in the hardware, in this case the FPGA, to manifest itself anywhere in the active area. MIRCHANDANI
Assertions • The common cause fractions used in the paper are relative numbers to illustrate the model • Redundancy of one, the fraction attributed to hardware common cause failure is 1 %. This implies that there is an equal chance for a common defect running in the hardware, in this case the FPGA, to manifest itself anywhere in the active area. • Implemented on different devices, this fraction drops to ¼ % because now the physical defects are almost negligible, and the only common effects are more environmental, i.e. temperature, power and external stresses. MIRCHANDANI
More Assertions • Software common cause fraction is high in both cases, since we assume nearly all software failures are common cause, very little change from same device to different device, since the design implemented is the same, but because the devices are different, this a slight chance that certain timing conditions may vary and hence the ¼ % variation • Diverse design paradigm, the hardware dependence remains in the same ratio relatively, but the software fractions vary drastically. In the same device, the common cause fraction is 50 % and it drops to 10 % in the case of diverse designs on different devices MIRCHANDANI
System Configuration Options MIRCHANDANI
Results MIRCHANDANI 19 P226/MAPLD2005
Conclusions • Cost and Schedule Slips • Development Delays and Costs • Adaptive Model • Optimization and Design Constraints Contact Address: chandru.j.mirchandani@lmco.com MIRCHANDANI 20 P226/MAPLD2005