1 / 20

Reliability of Actel devices used in three programs at GSFC

Dr Henning Leidecker Code 562 NASA/GSFC 16 February 2005. Reliability of Actel devices used in three programs at GSFC. Estimates for the success of a minimum mission:.

jersey
Download Presentation

Reliability of Actel devices used in three programs at GSFC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dr Henning Leidecker Code 562 NASA/GSFC 16 February 2005 Reliability of Actel devices used in three programs at GSFC

  2. Estimates for the success of a minimum mission: • 99.998% for ST5 --- There are three spacecraft, each using 5 Actel devices. One spacecraft has to work for at least 90 days after launch. • 98% for the Image Processor in BAT on the Swift mission. There are two boxes, each with 10 Actel devices; either box can do the job. Mission duration is two years. • 80% to 90% for Stereo. There are two spacecraft, each using 18 Actel devices in an essential manner. Both spacecraft must work for 240 days after launch.

  3. Definitions of 'RELIABILITY' • Reliability (black and white): a device-type is reliable when --- and only when --- it is certain to work. Otherwise it is unreliable. • Reliability (gray): the reliability of a device-type is the probability P(t) that it will work throughout the mission. • Reliability (practical): we can use the concepts of 'reliability methods' when --- and only when --- we are confident we know P(t).

  4. How do we get P(t) for the Actel devices, MEC foundry, 0.25μm technology, 'old' algorithm? • These parts do not exist as a 'defined' part until the user burns in their particular circuit. Thus, different circuits make for different parts-types, even if all start as Actel's. • User-handling has damaged these devices --- for example, ESD damage has been reported at 200 volts and below. • User programming uses a user-programmer, which may be uniquely murderous. • Different circuits, including clock speed, have essential influences on P(t). • So there is no universal P(t) valid for all Actel devices.

  5. Actel's report on failure rates: I • www.actel.com/documents/RelGuide.pdf: • High temperature operating life (HTOL): 125°C for 1,000 hours at 3.0 volts for 0.25 μm technology. • Low temperature operating life (LTOL): -55°C for 1,000 hours at 3.0 volts for 0.25 μm technology. • Failure rates in FIT (= failures per 109 hr): Q1'99 & Q2'99: 145; Q3'99 & Q4'99:74.7; Q1'00: 112.33 • www.actel.com/documents/ReliabilityReportQ304.pdf: • 0 failures in 6.35E+07 hours, for a failure rate of 14.41 FIT with a confidence of 60% for 0.25 μm MEC CMOS FPGA devices. • This implies that P(t) = exp(-t/τ) with τ = MTTF = 69.4 Mhr.

  6. Actel's report on failure rates: II • http://www.actel.com/documents/RelGuide.pdf • The intent of an HTOL test is to operate a device dynamically (meaning the device is powered up with I/Os and internal nodes toggling to simulate actual system use) at a high temperature (usually 125°C or 150°C) and extrapolate the failure rate to typical operating conditions. • To evaluate Actel's FPGAs, we program an actual design application into most devices (some units are burned in unprogrammed) and perform a dynamic burn-in by toggling the clock pins at 1 MHz or higher. The designs selected use 85 to 97 percent of the available logic modules and 85 to 94 percent of the I/Os.

  7. Independent Assessment Team Test---Aerospace and many others: I • A test designed by this team, run at Boeing, studied almost 1,000 Actel MEC 0.25 μm devices, programmed using various algorithms. • The design burned oscillator circuits into each device, and signalled an alarm whenever the period decreased beyond a threshold. • Also, devices were periodically removed from this test and checked using ATE: sufficient changes here would also signal an alarm.

  8. Independent Assessment Test: II • Almost 1% of these devices suffered ESD damage, despite the 'usual precautions.' This and other evidence suggests an ESD threshold as low as 200 volts and perhaps even lower. Normal wrist straps and other precautions are not enough to prevent ESD damage at this level. • More than four dozen devices signalled an alarm, despite being operated within Actel's guidelines. Each of these devices might have shown a functional failure when running some other circuit, especially a circuit that uses a fast clock. • Except for the ESD damaged devices, all oscillator circuits ran until removed from the test. In this sense, there were NO failures.

  9. Independent Assessment Test: III • Each 'alarming' device was found to contain a single antifuse that had increased its propagation delay, Tpd ;this is caused by an increase in its resistance (nominally 25Ω). • The change, ΔTpd, ranges from a few nanoseconds up to a thousand nanoseconds. The distribution of 19 values collected by Rich Katz is a good fit to a log-normal distribution with a mean of 40 ns; half the values are between 13 ns and 100 ns. • Each device is removed from testing when it first 'alarms'; hence, we cannot tell whether a device that been found to contain one antifuse with an increased Tpd is statistically more likely to contain others. Or not.

  10. Independent Assessment Test: IV • Individual devices 'alarm' at times Talrm ranging from a small fraction of an hour, out to thousands of hours. • This is an unusually broad distribution of times. • Thus, one or no device falls into each time-bin, and statistical fluctuations from a smooth distribution are unusually large. Careful statistical analysis is needed. • There seems to be an inverse relationship between TalrmandΔTpd: large values ofΔTpd tend to happen at early values of Talrm, and vice versa.

  11. Independent Assessment Test: V • The 'alarm' times, Talrm, have been fit using a Weibull distribution, and also using a 2-Weibull distribution. • To fit using any single distribution (such as a Weibull) suggests that all the devices are subject to the same statistical behaviour. There is some reason to believe that the anomalous devices are a sub-population, and should be fit using one distribution, while the remaining devices should be fit using a distinct distribution.

  12. Independent Assessment Test: VI • One Weibull fit: P[1](t | β, η) = exp[ -(t/β)η ] • β ~ 0.16 and η ~ 8 Ghr. • Two Weibull fit: P[2](t | β, η, ω; β', η', ω') = = ω · P[1](t | β, η) + ω' · P[1](t | β', η') • β ~ 0.24, η ~ 24 hr, ω ~ 8% and • β' ≡ 1 (exactly), η' ~ 1/(100FIT) = 10 Mhr, ω' ~ 92%.

  13. Independent Assessment Test: VII

  14. Independent Assessment Test: VIII

  15. Screening: I • Suppose we have P(t) for a parts-type. • Suppose we run examples for a time τand discard failures. • Then the probability of success for the remaining parts, given this screen lasting the time τ, is P( t | τ ) = P( t ) / P( τ ) for t > τ . • This screen improves reliability when the failure rate decreases with time; has no effect for a constant failure rate; and degrades reliability for an increasing failure rate.

  16. Screening: II • For all studies of Actel test data, P(t) shows a decreasing failure rate; hence, screening improves reliability. The amount of improvement after a given time depends on the details of P(t). • Since the failure rate never falls to zero, then all screens, no matter how long, are 'leaky': parts may escape the screen and go on to fail in service --- this is a commonplace result. • The 2-Weibull predicts a given screen is more effective than the 1-Weibull predicts.

  17. Screening: III • How do we recognize 'failures' during a screen? • The Independent Team Testing never found P(t) for any particular user's circuits; rather, it found P(t) for the moment when the 'alarm' is triggered. • This may induce a functional failure in any given 'user circuit', or it may not. • Can we relate the available P(t) to the needs of a given user's circuit?

  18. Screening: IV • Suppose an 'alarm' event at Talrmcauses a functional failure with a probabilityΦ, and suppose we use 'functional failures' to discard parts during a screen lasting the time τ. • P( t | τ ) = = {1 – Φ · [1-P(t)]}/{1 – Φ · [1-P(τ)]} for t > τ . • Suppose Φ = 1. Then this reverts back. • Suppose Φ = 0. Then P( t | τ ) = 1. (!) • Suppose Φ = 0.5. Then P( t | τ ) improves.

  19. Screening: V • But supposeΦincreases with time? Suppose it is small during screening, so that we do not detect “bad” devices since these do not induce functional failures, and so most of these “bad” devices leak though the sceen. Later, with a large value ofΦ, the “bad” devices cause abundant functional failures. • Actually, it seems to be the other way around: the large values ofΔTpdtend to happen sooner, and not later. So a functional screen may be better than we might hope.

  20. Screening: VI • With no screening, the fraction of 'alarming' devices is roughly 10%. • After a alarm-based screen of 1000 hours, the remaining fraction of 'alarming' devices is roughly 2%. • After a functional failure based screen of 1000 hours, the fraction of devices that will go on to functionally fail is roughly 0.5% to 1%, depending on criticality of timing for functional failures. • This makes some sense of the values Actel published on their web site, and also on such uses as MER: 64 Actel devices in use for 13 months with NO functional failures.

More Related