240 likes | 349 Views
Review of a Mission-Critical, Digital System for an Air force Project. Rich Katz 1 , Rod Barto 1 , and Kevin Hames 2 1 NASA Office of Logic Design 2 NASA Johnson Space Center. Introduction.
E N D
Review of aMission-Critical, Digital Systemfor an Air force Project Rich Katz 1, Rod Barto 1, and Kevin Hames 2 1 NASA Office of Logic Design 2 NASA Johnson Space Center Page 1
Introduction • January 2003: Air Force request to the NASA Office of Logic Design to independently assess electronics design. • Rapid sampled assessment performed • Review Subject: Safety of critical missile electronics containing an EEPROM-based Programmable Logic Device (PLD) Page 2
Power Inputs EEPROM-based PLD Rocket Motor Clock … “Stuff” JTAG Control Simplified Block Diagram All Lines High = “Bad” Is it safe? Note: The PLD and many other devices are consumer-grade COTS. Page 3
Safety Criteria and Assessment • No Single Point Failure Requirement • Requirement: Probability (Mishap) < 10-6 • Air Force Criteria • Contractor calculated reliability orders of magnitude better than requirement • Contractor could not show work – multiple numbers in different places. • Did not account for all single point and common mode failures, as will be shown • Safety has issues Page 4
Some Technical Areas Examined • Testability and JTAG • Power Supply • Timing Analysis • Finite State Machines • Proper Termination of Device Pins • Device Configuration Retention • Quality and Manufacturer’s position • Synchronization and Metastable States Page 5
IEEE JTAG 1149.1 Interface TRST* Line Not Implemented Static: Not “trying” to drive TAP Controller into TEST-LOGIC-RESET state • Both devices here are consumer-grade COTS • 54ABT8996 Is Available but not used • PLD internals not tested by Built-In-Test (BIT) PLD Page 6
I/O Pin Structure • Structure Common to all I/O’s of Interest • Common Mode Failures • “MODE” Line • Instruction Register • TAP Controller • External JTAG Control • TCLK not running with TMS=‘1’ to mitigate Real Data Blocked Q Update Register I/O Element I/O Cell Circuitry JTAG Circuitry Page 7
IEEE JTAG 1149.1TAP Controller and Instruction Register TAP Controller (State Machine) TCK Shift Register is undefined in TEST-LOGIC-RESET State TRST* (optional) Shift CLK Shift Register TDI TDO Reset Chip Control Parallel Latch Latch Page 8
Power Supply • The manufacturer states, in their “Operating Requirements for XXXXX Devices Data Sheet” ... Slower rise times can cause incorrect device initialization and functional failure. • Power rise time requirements are not known to the Project nor are they in the data sheet. Can not be shown that the device will properly initialize and function. Page 9
Timing Analysis • Logic Design described in VHDL and synthesized • Examination of the tool’s output file showed: • ** PROJECT TIMING MESSAGES ** • Warning: Found ripple clock -- warning messages and Report File information on tco, tsu, and fmax may be inaccurate • No obvious ripple counters in the design. Contractor engineers had not examined the output file and could not explain either the apparent presence of a ripple counter or the impact of the warning message, if any, above. Page 10
Counter: Unused States -- SYNCHRONIZATION CONTROLLER ************************************************* sync_ctrl: PROCESS (g_rst_l_pin, g_clk_pin) BEGIN IF (g_rst_l_pin = '0') THEN -- NO GLOBAL PRESET IN PLD ... ELSIF (g_clk_pin'EVENT AND g_clk_pin = '1') THEN -- CLOCK RISING EDGE -- SYNCHRONIZATION COUNTER IF (sync_cnt_rst_l = '0') THEN -- RESET sync_cnt <= 0 ; ELSE -- INCREMENT sync_cnt <= (sync_cnt + 1); END IF; -- SYNCHRONIZATION COUNTER RESET IF (sync_cnt = 48) THEN -- RESET sync_cnt_rst_l <= '0'; ELSE -- DO NOT RESET sync_cnt_rst_l <= '1'; END IF; Unused states not defined Page 12
Proper Termination of Device Pins From the data sheet:: During in-system programming, each device's VPP pin must be connected to the 5.0-V power supply. During normal device operation, the VPP pin is pulled up internally and can be connected to the 5.0-V supply or left unconnected. The contractor has had significant number of devices that failed to program (15 out of 250), cause not known. However, the manufacturer states in the data sheet: XXXX EPLDs are fully functionally tested. Complete testing of each programmable EEPROM bit and all logic functionality ensures 100% programming yield. There is no mechanism in the system, as designed, to verify that quantity of charge stored in the EEPROM cells. There is no provision for testing whether the logic configuration will be correct, which requires the correct state of all EEPROM cells. The functionality and safety of the system, after decades of storage, can not be guaranteed. Page 14
Quality and Manufacturer’s Position XXXXXX's products are not authorized for use as critical components in life support devices or systems without the express written approval of the president of XXXXXX Corporation. As used herein: 1. Life support devices or systems are devices or systems that (a) are intended for surgical implant into the body or (b) support or sustain life and whose failure to perform when properly used in accordance with instructions for use provided in the labeling can be reasonably expected to result in a significant injury to the user. 2. A critical component is any component of a life support device or system whose failure to perform can be reasonably expected to cause the failure of the life support device or system, or to affect its safety or effectiveness. Page 15
Other Issues Page 17
Unterminated CMOS Inputs Page 18
Unterminated RS-422 Clock CLOCK DATA Page 19
Transition Time Into PLDReset I/F Circuit – Other inputs similar To PLD and 10 kohm pull-up Page 20
Reset Input: Transition TimeDrilling Down Into Device PLD spec is 40 ns. tR >> 40 ns. Page 21
Technical Lessons [Re-]Learned • Violations of part manufacturers’ specifications • Did not look at tool reports and unable to explain them. • High level design methodology used – abstract models. • Designers not familiar with the state machine encodings used, implication of unused states. • Consumer grade vs. military grade devices • Military grade available in some cases – direct substitution • Military grade available with proper part choices • “Upscreening” issues • Input transition time requirements not met. • Signal and clock terminations not properly implemented • VCC waveform susceptibility Page 22
Factors Contributing to Project Problems • Original contractor group sold; moved to another state • Few original engineers followed and continuity lost • New contractor staff not fully cognizant of design • Worst case analysis not performed • Not a contractual requirement • Contractor processes did not require it internally • No independent analysis Page 23
Factors Contributing Successful ReviewBy The Independent Assessment Team • Safety engineer had issues and concerns; solidly backed by his management. • Project worked to rapidly resolve issues and concerns • Chartered Independent Assessment Team • Established and ensured communications and transfer of data between the design group and IAT • Took a neutral technical position and let the IAT perform its independent assessment using its own methods. • Contractor, Project, Safety, and IAT all had safety and mission success as primary goal. • Consensus rapidly achieved (hours) and efforts turned to improving system. Page 24
Conclusions • Small, seemingly inconsequential design errors can have a major impact on safety and/or mission success • Can not be determined by test • PLD (and other) device mechanisms must be well understood to construct an adequate model for analysis • Designers never saw their “design”; shielded by abstracted models • Hardware components • Logic designed by software that does not know circuit criticality • Configuration storage mechanisms in selected reconfigurable needlessly decreases reliability and system safety • It is noted that the PLD was not intended to be reconfigured in system. Page 25
Conclusions (cont’d) • Root cause of all failures must be determined as rapidly as practical. • “Acceptable yields” do not mean an acceptable product for critical systems. • A number was presented for system safety. • All single point and common mode failures not considered invalidating it. • Unfamiliar with the implication of JTAG 1149.1 maloperation. • Contractor knew how it worked but not how it failed. Page 26