370 likes | 1.05k Views
NASA Advisory NA-GSFC-2005-04 Application of Hitachi 1-Mbit Die Based EEPROM Technology to Space Applications. Jose M. Florez, Chief Engineer Electrical Engineering Division Applied Engineering and Technology Directorate. Agenda. EEPROM Problem Background Documented EEPROM Failures
E N D
NASA Advisory NA-GSFC-2005-04 Application of Hitachi 1-Mbit Die Based EEPROM Technology to Space Applications Jose M. Florez, Chief Engineer Electrical Engineering Division Applied Engineering and Technology Directorate AETD EEPROM Advisory Briefing
Agenda • EEPROM Problem Background • Documented EEPROM Failures • Types of EEPROM Failures • Root Cause • GSFC Advisory Recommendation • GSFC Advisory Proposed Mitigation Techniques • Status of GSFC Missions • Conclusions AETD EEPROM Advisory Briefing
EEPROM Problem Background • Reports of memory retention failures associated with the use of Electrically Erasable Programmable Read Only Memory (EEPROM) technology-based devices used by the aerospace industry are well documented. • Failures center around systems based on the Hitachi HCN58C1001 1-Mbit commercial die. • The Hitachi die is packaged by various vendors into either single chip packages or multi-chip modules. Space Electronics Inc. (SEI), subsequently purchased by Maxwell, Maxwell and Austin Semiconductor are among the manufacturers of space-qualified EEPROM packages currently under use, or being base-lined for use, by many space programs. • These parts are used in custom in-house designs as well as integrated into commercially available products as is the case with flight computer boards available from several vendors. AETD EEPROM Advisory Briefing
Documented EEPROM Failures • HN58C1001 Die Documented Failures in Aerospace Applications • Date of FailureProject/FailureStagePackagedConfigurationFailure • SubsystemLocationby • Nov. 1996 Cassini AACS JPL EM Board Test SEI * 1-chip Block of 64 bytes (half page) • BAIL • Dec. 1997 Mars 98 RAD6000 SEI Life Test SEI * 8-chip MCM 7 blocks of 8 bytes (not consecutive) on one die • Apr. 1999 FHLP RAD6000 LMFS Manassas Flight Board Test SEI * 8-chip MCM Initially reported as page failure. • 2001 A2100 BAEManassas Board Test Austin 1-chip Single bit write failure at -20 C, works if warmed to -10 C • Unknown HIRDLS BAEManassas Board Test Austin 1-chip Two failures, one at -20 C, one at room temp., no other details • Unknown CIBRS BAEManassas Board Test Austin 1-chip Failed at -20 C, no other details • 2001 ? (Classified) TRW --------- ------- --------- No details available • Jan. 2002 Genesis Flight Flight SEI * 8-chip MCM Two single bit read failures on two • Aug. 2002 RAD6000 die in same MCM. • Oct. 2002 Deep Impact Ball Board Test Austin 4-chip MCM Originally reported as single bit failure, but later as block of 48 bytes (not including original bit) • Dec. 2002 MER NVMCAM JPL BreadboardTest Maxwell 4-chip MCM Single bit read failure with degraded output timing • Dec. 2002 MER NVMCAM JPL BreadboardTest Maxwell 4-chip MCM Page failure • Dec. 2002 MER NVMCAM JPL Flight BoardTest Maxwell 4-chip MCM Single bit? • Mar. 2003 MER AVS KSC Pre-launchTest Maxwell 8-chip MCM Checksum error, repeatable. Multiple • (RAD6000) errors within one page. • Apr. 2003 MRO Electra Cincinnati Board Test Maxwell 4-chip MCM Output buffers active • Nov. 2003 ST-5 C&DH GSFC Board Test SEI 4-chip MCM Single bit failure • Stereo/Secchi Stanford Flight Board Test Austin 1-chip MCM Two page failures • GLAST/LAT Stanford Breadboard Test Austin 4-chip MCM Single Page Failure • * SEI was purchased by Maxwell in 2000 AETD EEPROM Advisory Briefing
Types of EEPROM Failures • The apparent cause of the failures is a lack of data retention resulting from cell discharge, which has been termed “weak cell” or “weak bit”. • Failures are manifested in one of two forms: • single bits programmed with “0” (charged state) read back as a “1” (discharged state), or • 128-byte pages appear corrupted when read. • Oscilloscope pictures of “weak bits” reveal a variety of output voltage profiles • Some traces appear normal for a charged level (“0”). • Others start as a charged level (“0”), and then drop toward a discharged state (“1”). • Still others oscillate between “0” and “1”, resulting in opportunities to sample either a one or a zero in alternating fashion. AETD EEPROM Advisory Briefing
Sample Failures Multiple Read Bit Oscillation Single Read Bit Oscillation Single Bit Cell Discharge AETD EEPROM Advisory Briefing
Root Cause • The exact root cause, or failure mechanism, has not been determined. • “Weak cells” are believed to result from either process-induced defects or poor programming. • It has been shown that the Hitachi die is very sensitive to program timing. “Weak cells” can be induced by poor timing and/or noise margin during programming. In at least one of the documented failures, the problem was traced to marginal design timing and eliminated by simply modifying the design to meet manufacturer specifications. • Two factors that influence degradation of data retention characteristics are high temperature and an increase in the number of erase/write cycles. AETD EEPROM Advisory Briefing
GSFC Advisory • Based on the number of reported Hitachi HCN58C1001 die based EEPROM failures, the occurrence of an EEPROM data retention failure in flight can not be discarded. • For that reason, the AETD has generated NASA-Wide Advisory NA-GSFC-2005-04 providing recommendations to present and future programs on strategies to retire, or as a minimum mitigate, the risk of EEPROM data retention failure as a mission failure possibility. AETD EEPROM Advisory Briefing
GSFC Advisory Recommendation • The recommended approach to retire the inherent risks associated with the use of Hitachi HCN58C1001 based EEPROM technology on spaceflight applications is to limit its use to non-critical code applications only. • All mission critical software functions, including boot code capable of performing such basic functions as executable code memory checking, basic command and telemetry capabilities to load and dump memory contents and safe-hold mode, must be stored in Programmable Read Only Memory (PROM) or other similar permanent storage technology. • At start-up, the PROM resident boot code can verify the integrity of the executable mission code resident in other forms of memory storage before turning control over to it. • Rewritable memory can then be utilized to obtain the benefits associated with re-programmability for the implementation of patches or other modifications to the code via S/C or ground command. • All new designs must be required to implement this recommendation in order to provide a low risk implementation. AETD EEPROM Advisory Briefing
GSFC Advisory Proposed Mitigation Techniques • Without the implementation of mitigation techniques, the use of Hitachi HCN58C1001EEPROM technology for critical code will result in a high risk implementation. • Starting from that baseline, the overall strategy centers around the ability to effectively and safely detect an EEPROM bit error in such a way to allow the system to continue operating, in a predictable manner, from an alternate memory source while the error is corrected either from the S/C or the ground. AETD EEPROM Advisory Briefing
GSFC Advisory Proposed Mitigation Techniques • Since a large number of mitigation strategies is possible, each program must be assessed individually to determine its susceptibilities to EEPROM failures. It will then be the responsibility of each program to determine the level of risk they are willing to accept. AETD EEPROM Advisory Briefing
GSFC Advisory Proposed Mitigation Techniques • Mitigation strategies can be applied at three different levels: • System Architecture • Component Level • System Level Test AETD EEPROM Advisory Briefing
GSFC Advisory Proposed Mitigation Techniques • System architecture plays a critical role in preventing EEPROM failure from being a mission ending consideration. • The implementation of redundancy can reduce the risk of mission failure due to a single point failure. A hardware based command determines from which EEPROM module the boot code will execute. • The design can incorporate the capability to provide Direct Memory Access (DMA) to the EEPROM independent of local controller or processor involvement. • Multiple copies of the flight code to be stored in multiple EEPROM modules. A hardware based command determines from which EEPROM module the boot code will execute. AETD EEPROM Advisory Briefing
GSFC Advisory Proposed Mitigation Techniques • Multiple copies of the flight code can be stored in multiple EEPROM modules, but with the decision from which EEPROM module the boot code will execute being made by a small block of decider code in the startup EEPROM. In this case the decider code is required to be already executing properly in order to verify and select the boot code EEPROM • The least desirable approach consists of a single EEPROM containing the boot code and a single copy of the executable code • Increasing the read access time of the EEPROM as much as feasible results in an improved opportunity to mitigate the “weak bit” effect, except for the case when the voltage oscillates. AETD EEPROM Advisory Briefing
GSFC Advisory Proposed Mitigation Techniques • The use of Error Detection and Correction (EDAC) is recommended, but can not be relied upon totally to compensate for single “weak bits”. The reason being that in the case of an oscillating bit failure, there is no guarantee that the EDAC circuit will operate properly with voltage transitions applied to its inputs. • Checksum computation of the EEPROM contents must be performed at the highest possible rate. Usually, this function is performed by software operating continuously in the background since normal processor utilization results in background cycles in the order of seconds to a few minutes. • In applications that include the ability to rewrite the contents of the EEPROM in flight, a software write protect feature can be implemented. AETD EEPROM Advisory Briefing
GSFC Advisory Proposed Mitigation Techniques • Component level mitigation strategies: • At the most basic level, designers must insure that all manufacturer component specifications and recommendations are properly implemented in the design. • During procurement, component level screening must be geared towards improving data retention tests in order to improve the chance to identify marginal parts. • At the card level, screening for “weak bits” can be accomplished by programming the EEPROM to all “0’s” if possible, or the program content otherwise, and verifying the content by reading the part as fast as possible while at the maximum acceptable temperature. As many read cycles as possible should be accumulated, and the read-only test should be looped to minimize the escape of an oscillating data bit. AETD EEPROM Advisory Briefing
GSFC Advisory Proposed Mitigation Techniques • System level test: • Screening will ideally consists of writing all zeroes to the EEPROM, or the program content otherwise, and then executing power cycles and thermal cycling while verifying that the EEPROM data remains as initially programmed. Since the objective is to weed out devices with “weak bits” before launch, as many read cycles as possible should be accumulated, and the read-only test should be looped to minimize the escape of an oscillating data bit. AETD EEPROM Advisory Briefing
Status of GSFC Missions • GSFC missions currently in the design phase have retired the risk of EEPROM failure by storing critical flight code in PROM. • The AETD has undertaken an effort to review all GSFC missions currently in the fabrication and test stages in order to assess the risk level involved with each implementation. AETD EEPROM Advisory Briefing
Conclusions • Reports of Hitachi HCN58C1001 die based EEPROM data retention failures have been documented dating back to 1996. • Despite these reports, EEPROM usage for critical flight software has continued to date due in great part to a lack of design guidance to retire, or at least mitigate, the impact of these failures in flight missions. • The AETD has taken the steps to generate a NASA-Wide Advisory to provide the long overdue guidance. The document has been released as NA-GSFC-2005-04. • In addition to generating the advisory, the AETD has undertaken an effort to assess the susceptibility of all Goddard programs to EEPROM failure. • All new developments are being required to retire the risk. • The level of mitigation implemented in on-going programs is being assessed on a project-by-project basis. AETD EEPROM Advisory Briefing