310 likes | 457 Views
Improving FPGA Design Robustness with Partial TMR. Brian Pratt 1,2 Michael Caffrey, Paul Graham 2 Eric Johnson, Keith Morgan, Michael Wirthlin 1 1 Brigham Young University Department of Electrical Engineering 2 Los Alamos National Laboratory. x. MTBF. Reliability constraint.
E N D
Improving FPGA Design Robustness with Partial TMR Brian Pratt 1,2 Michael Caffrey, Paul Graham 2 Eric Johnson, Keith Morgan, Michael Wirthlin 1 1 Brigham Young University Department of Electrical Engineering 2 Los Alamos National Laboratory 1
x MTBF Reliability constraint Area constraint Area Cost Motivation for Partial TMR • Factors of fault-tolerant computing: • Availability • Reliability • Mitigation Cost • Full TMR • Expensive in terms of power, speed, area, etc. • Worthwhile if affordable! 2
Motivation for Partial TMR • Partial TMR offers: • Mitigation of most sensitive design structures • Increased availability of a system by decreasing number of system resets • Decreased mitigation cost over full TMR • Suitability of Partial TMR is application dependent • Reduced reliability compared to full TMR 3
1 0 0 1 0 1 1 0 1 0 1 0 1 0 0 0 1 1 0 0 0 1 Scrubbing • Must be included with Partial Mitigation • Continuously ‘read’ and ‘clean’ configuration memory • Single bit will be upset no longer than ts ts = time for one scrub 4
Non-Persistent Errors • An SEU in the non-persistent cross-section will cause a temporary interruption of service • Requires partial reconfiguration to correct Scrubbing Repairs Configuration error magnitude Correct Output time cycle error = delta between outputs of a golden and DUT circuit 5
Persistent Errors • An SEU in the persistent cross-section will cause a permanent interruption of service • Requires full system reset to correct Scrubbing Repairs Configuration error magnitude Incorrect Output time cycle error = delta between outputs of a golden and DUT circuit 6
Non-Persistent Circuit Structures Logic FF Logic FF • Generally consists of circuit components and routing in a feed-forward path Logic FF Logic FF Logic FF 7
Persistent Circuit Structures Logic FF Logic FF • Generally consists of circuit components and routing in, or contributing to, a feed-back path Logic FF Logic FF Logic FF 8
Partial Mitigation Logic FF Logic FF TMR Logic FF Logic FF Logic FF • Apply a mitigation technique to just the persistent cross section 9
Limitations of Partial Mitigation • Does not prevent all errors • System must be corrected with configuration bitstream scrubbing • Circuit configuration can be incorrect between scrubbing • Non-persistent errors remain 10
Automated Partial TMR • Analyze an EDIF source file for feedback structures • Protect these sections with TMR to reduce persistent cross section 11
BLTmr Partial TMR Tool • BYU-LANL Triple Modular Redundancy: Configurable Reliability • Limit mitigation to minimize: • design resource requirements • power consumption • Mitigation focused on persistent circuit structures 12
BLTmr Partial TMR Tool • Design Divided into three sections: • Feedback, Input to FB, Output Logic FF Logic FF Logic FF Logic FF Logic FF 13
Logic FF Logic FF Logic FF Logic FF Logic FF BLTmr Partial TMR Tool • Design Divided into three sections: • Feedback, Input to FB, Output 14
Logic FF Logic FF Logic FF Logic FF Logic FF BLTmr Partial TMR Tool • Design Divided into three sections: • Feedback, Input to FB, Output 15
Logic FF Logic FF Logic FF Logic FF Logic FF BLTmr Partial TMR Tool • Design Divided into three sections: • Feedback, Input to FB, Output 16
BLTmr Tool Options • BLTmr Tool applies TMR mitigation to subsections of the design: • Feedback Only • Feedback + Input to Feedback • FB + Input to FB + Output (Full TMR) 17
BLTmr Tool Options • BLTmr Tool applies TMR mitigation to subsections of the design: • Feedback Only • Feedback + Input to Feedback • FB + Input to FB + Output (Full TMR) 18
BLTmr Tool Options • BLTmr Tool applies TMR mitigation to subsections of the design: • Feedback Only • Feedback + Input to Feedback • FB + Input to FB + Output (Full TMR) 19
BLTmr Tool Options • BLTmr Tool applies TMR mitigation to subsections of the design: • Feedback Only • Feedback + Input to Feedback • FB + Input to FB + Output (Full TMR) 20
BLTmr Tool Flow Partially Mitigated Design Original Design User Constraints • BYU EDIF development environment reads in user design • Design organized into graph structure for analysis Analysis (Feedback, Input to FB, etc.) Create Design Database Parse EDIF Cell Triplication Voter Insertion 21
BLTmr Tool Flow Partially Mitigated Design Original Design User Constraints • User may direct mitigation • Design analyzed to classify components as described Analysis (Feedback, Input to FB, etc.) Create Design Database Parse EDIF Cell Triplication Voter Insertion 22
BLTmr Tool Flow Partially Mitigated Design Original Design User Constraints • Circuit elements triplicated • Voters inserted • Mitigated design written in EDIF format Analysis (Feedback, Input to FB, etc.) Create Design Database Parse EDIF Cell Triplication Voter Insertion 23
Example Circuits • Tests on two designs • DSP Kernel • Synthetic Design • LFSR modules feeding into an add-multiply tree 24
Unmitigated Fault Analysis FPGA Editor Layout Sensitivity Map Persistence Map DSP Kernel 575,448 bits (9.9%) 13,841 bits (0.23%) 5,746 slices (46%) Synthetic Design 2,538 slices (20%) 189,835 bits (3.3%) 77,159 bits (1.3%) 25
Experimental Results – Design #1DSP Kernel FPGA Editor Layout Sensitivity Map Persistence Map Unmitigated 575,448 (9.90%) 13,841 (0.24%) 5,746 slices (46%) Partial TMR applied to Feedback & Input to FB 569,700 (9.81%) 152 (0.0026%) 8,036 slices (65%) 26
Experimental Results – Design #2Synthetic (LFSR/Mult) FPGA Editor Layout Sensitivity Map Persistence Map Unmitigated 189,835 (3.27%) 77,159 (1.33%) 2,538 slices (20%) Full TMR Applied 671 (0.012%) 20,256 (0.35%) 11,961 slices (97%) 27
Experimental Results * Full TMR could not be applied to DSP Kernel due to FPGA resource constraints “Qpro Virtex 2.5V radiation hardened FPGAs”, Xilinx Inc., DS028 (v1.2), Nov. 5, 2001. 28
Experimental Results • GPS orbit (22,200 km altitude, 55° inclination) • AP-8 Solar Minimum, JPL Solar Proton Quiet, CRÈME 96 Solar Minimum 29
Summary of Results * Unmitigated to Partial TMR of Feedback + Input to FB ‡ Unmitigated to Full TMR ‡‡GPS orbit; AP-8 Solar Minimum, JPL Solar Proton Quiet, CRÈME 96 Solar Minimum 30
Conclusions • Pros: Partial TMR (BLTmr) as fault mitigation offers: • Increased system availability due to fewer system resets • More “affordable” fault mitigation than full TMR • Critical design areas are mitigated with an automated tool • Cons: • Much of the design may be unmitigated, leaving sensitive sections • May result in temporary errors 31