280 likes | 495 Views
12/7/2011. Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved. 2. Instructors. France Antelme Ario Bigattini Jaseem Masood Steven Woody Coordinator: Prof Ravi KumarWebsite:SSU-ES || CES 592. 12/7/2011. Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved.
E N D
1. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 1 CES 592 Telecommunications System Product VerificationSonoma State University Class Lecture 6:
Hardware Verification 3
Stress testing and metrics
2. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 2 Instructors France Antelme
Ario Bigattini
Jaseem Masood
Steven Woody
Coordinator: Prof Ravi Kumar
Website:SSU-ES || CES 592
3. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 3 Elements of Hardware Verification
4. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 4 The Product Development Cyclei.e. Stage 1 of the Product Lifecycle
5. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 5 Product Verification Phase
6. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 6 Product Verification Flow Chart Do you adopt a single or multiple load approach?
Integration Overflows but System Test time cannot be compromised.
How could code be ported over so late? Requirements tracking is important.
How do you know which load is ready for Beta?
Do you adopt a single or multiple load approach?
Integration Overflows but System Test time cannot be compromised.
How could code be ported over so late? Requirements tracking is important.
How do you know which load is ready for Beta?
7. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 7 Telecom Product VerificationProduct Reliability Product Reliability is a key focus in today’s Telecom market:
Market requires very high in-service reliability i.e. 99.999% in the Carrier Market
Failures in the field can result in extremely high cost to the Vendor due to warrantee repairs and/or replacement of large numbers of faulty equipment in the field. Class A recalls etc.
Loss of market space due to perception by customers of low reliability.
8. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 8 Telecom Product VerificationProduct Reliability Most manufacturers execute a Reliability calculation upon their product based upon US Department of Defense Mil-217F specification, modified according to Telcordia GR-332
This predicts a Mean Time Between Failure (MTBF) for the product, but does not actually demonstrate the time to failure or predict the failure mode.
In reality, many defense contractors as well as Industry in general have seen fairly wide variance between MTBF predictions and actual failures in the field.
Stress testing is designed to discover, in a short period of time, the failure modes that will occur in normal operation over the lifecycle of the product
Reference: MIL-HDBK-217 vs. HALT/HASS
9. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 9 Telecom Product VerificationBrief introduction to Reliability theory In regard to product reliability, there is a function defined as the hazard rate:
Zd(t) = [n(ti)-n(ti+?ti)]/n(ti)
?ti
for ti <t<ti+?ti
The hazard rate is a measure of the instantaneous speed of failure of any given product
10. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 10 Telecom Product VerificationProduct Reliability - Bathtub curve
11. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 11 Telecom Product Verification Product Reliability theory The theory behind Product Reliability and failure mechanisms is typically an element in most Engineering courses and is known as Reliability Theory or Reliability Engineering.
The study of reliability theory is an important tool available to any HW Product Verification team when developing a HW Test strategy
Reference: Probabilistic Reliability : An Engineering approach, Martin L. Shoman, McGraw -Hill
12. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 12 Telecom Product Verification Product Reliability – Bathtub curve There are three parts of the Hazard rate curve (also known in reliability theory as the “Bathtub curve”:
Infant Mortality or Initial failure mode: these occur relatively soon after product leaves the manufacturing floor and due to initial weaknesses or defects, poor insulation, weak parts, bad assembly etc
Middle life, Random failure or constant hazard rate mode: This part of the curve seems to occur when stress due to the operating environment exceed the design capability of the equipment. It is difficult to predict when these failures occur in a deterministic fashion and thus these are known as random failures
Wearout or rising failure rate mode: As the product begins to reach old age, elements in the product begin to deteriorate and fail
All products appear to demonstrate the bathtub curve for failures of the product.
13. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 13 Telecom Product VerificationReliability evaluation There are two approaches to evaluating product Reliability:
Stress Testing: A small sample size of EUT is subjected to a large combined load of several stresses until the first failure mode is reached. This failure is remedied and the cycle repeated. This recursive action is repeated until the development team deems there to be sufficient margin between the normal operating range of the product and the failure point of the product. Reliability is this enhanced by performance of the stress tests.
Reliability Growth Demonstration Testing (RGDT) and similar approaches: A large sample size of EUT is subjected to normal operating range stresses and the failure rate of this sample batch is utilized to calculate the reliability of the product in terms of MTBF. If it is within the market requirements then no remedial action is taken. Reliability is not enhanced by this testing, the actual MTBF is merely demonstrated
14. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 14 Telecom Product VerificationStress testing – Strategy The central focus of Stress Testing is to increase the reliability of a product i.e. reduce failures in the field and increase production yields.
Stress testing is performed by exerting enhanced stresses on an Equipment Under Test in order to try to excite failures modes in a short time that would ordinarily only occur after a very long period of time i.e. those that have a very low probability of occurrence under normal operating conditions. ie the lifecycle from production to failure is “accelerated” to occur in a very short period of time.
Stress testing upon a small test sample size is also utilized as a substitute for a large test sample size operating over the normal operating range
15. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 15 Telecom Product VerificationHardware Stress Testing Hardware Verification: Stress Testing:
The originator of Stress Testing requirements was the US Navy Procurement Department in 1979. This was termed Environmental Stress Screen (ESS) and was brought about by the fact that the Navy was suffering a severe reliability problem with equipment.
There are many slightly varying versions of Environmental Stress Screening (ESS) currently in use by Industry, some of the more common are as follows:
HALT (Highly Accelerated Lifecycle testing)
HASS (Highly Accelerated Stress Screen)
ALT (Accelerated Lifecycle Testing)
AST (Accelerated Stress Testing)
STRIFE ( STRess Including LiFE Testing) – Hewlett Packard
Not all companies apply Stress testing but it is becoming increasingly accepted in Industry
16. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 16 Telecom Product VerificationHardware Stress Testing Hardware Verification: Stress Testing:
A popular example of Stress testing is HALT/HASS testing. Grant K. Hobbs is the main proponent of HALT and HASS testing.
Each Company adapts general Stress testing principles to suit its products and experience.
Stress testing is not required by all customers in the Telecom industry but customer mandated stress testing requirements are becoming very common.
17. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 17 Telecom Product Verification Hardware Stress Testing Useful Links:
Origins of ESS and HALT:
http://www.qualitymag.com/qty/cda/articleinformation/coverstory/bnpcoverstoryitem/0,,99106,00+en-uss_01dbc.html
ALT/AST:http://rac.alionscience.com/pdf/acc.pdf
A hint at the mathematics behind Accelerated Lifecycle testing:
Test & Measurement World - Accelerated life tests yield failure data fast - 5/1/2004 - Test & Measurement World - CA412942
18. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 18 Telecom Product Verification Introduction Hardware Verification: Stress Testing: Equipment Under Test (EUT) is subjected to external stresses to excite failures
HALT: Highly Accelerated Lifecycle Tests: EUT subjected to stress to determine failure modes
HASS: Highly Accelerated Stress Screen: EUT is subjected to elevated stresses in order to ensure that there is high yield in production
The point of this type of test activity is to use elevated stresses in order to excite failure modes of very low probability of occurrence in a short period of time. The failure mode is remedied and the remedy is used to drive higher yields in production and lower customer failure rates
19. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 19 Telecom Product VerificationHALT Testing Highly Accelerated Life Cycle testing is a Design Stress Test. i.e. It is applied to a product prior to release to Production.
HALT stress is generally applied to a product until it fails in some fashion. Once the failure has been analyzed, then the product is subjected to stress until the next failure mode is detected. This is remedied and the cycle is repeated until the product is judged to be sufficiently rugged to be placed in production.
HALT stresses are considerably above normal operating parameters and generally consist of concurrent Thermal and Mechanical Stresses
HALT is designed to simulate in a very short period of time the stress that a product would face in the field after many years of the standard field operating environment.
The main focus of HALT is to stimulate the Middle Life or Random failures part of the Bathtub curve in a short space of time
20. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 20 Telecom Product VerificationStresses applied during HALT testing Typical stresses concurrently applied during HALT testing:
All axis Vibration
High Rate Temperature Cycling
Electrical Stressing such as Power Cycling, Voltage variation etc.
Humidity
These stresses are beyond the normal design operating parameters of the device under test.
21. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 21 Telecom Product VerificationEffect of doing Halt Testing HALT has the following effect on the Bathtub curve:
Lowers the bottom part of the bathtub by removing design weaknesses and improving margin between normal operating conditions and the stress conditions. i.e. the product is less stressed during normal operation = fewer random failures.
Extends the length of the Middle life, Random failure or constant hazard rate part of the Bathtub curve so that the product has a longer life before the onset of “wearout” phase of the bathtub
22. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 22 Telecom Product VerificationHASS Testing Highly Accelerated Stress Screen testing is a Production Stress Test. i.e. It is applied to a product during Production.
HASS assumes that a HALT test has previously been applied to a product and remedies applied to the failure modes discovered
HASS stresses generally consist of concurrent Thermal and Mechanical Stresses and are much less than would be applied during the HALT test and is not designed to find design failure modes
The main focus of HASS is to stimulate the Infant mortality part of the Bathtub curve in a short space of time.
HASS is designed to excite the types failures that occur due to manufacturing process problems i.e cracked solder joints etc. that might otherwise only be discovered after a short period of time in the field.
Highly Accelerated Stress Audit (HASA) is a modified variant of HASS where only sample batches of the Production equipment are subjected to a stress screen. This is typically used where the market cannot tolerate the higher costs associated with HASS testing on each production item.
23. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 23 Telecom Product VerificationHALT/HASS Testing concept
24. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 24 Telecom Product VerificationHALT/HASS Definitions
25. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 25 Telecom Product VerificationAim of doing Halt Testing Clearly the aim of HALT/HASS is not to demonstrate adherence to the actual calculated reliability of a product.
HALT is intended to increase the margin between the actual operating range of the EUT and the point at which the equipment fails when subjected to stress beyond it’s normal operating range
26. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 26 Telecom Product VerificationNumber of Samples to be tested How many samples should be subjected to Stress Testing?:
As few as possible. Stress level is substituted for sample size. Generally 1 sample is acceptable for each HALT Run and that single sample is likely usable for multiple HALT runs unless it is totally destroyed by the HALT cycle.
27. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 27 Telecom Product VerificationWhen/How should HALT be executed HALT should be executed before release to Production and with enough time to allow remedies to be applied and verified
In order to execute HALT testing the SW and HW must function well enough to run through the HALT cycle without failure.
Where in the development cycle the HALT is executed is the prerogative of the Hardware development team and depends upon the detailed development process followed by each team.
Generally Product Verification team should try to execute HALT several times before release to Production so that the lowest N failure modes can be resolved.
28. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 28 Telecom Product VerificationHALT Lab
29. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 29 Telecom Product VerificationHALT chamber
30. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 30 Telecom Product VerificationHALT chamber
31. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 31 Telecom Product VerificationHALT chamber
32. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 32 Telecom Product VerificationHALT chamber
33. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 33 Telecom Product VerificationMetrics -background Why do we need to apply metrics to our test activities?:
So that we have an indication of the quality of the product before we release it to Manufacturing i.e. the number of defects detected by Product Verification
So that when defects are detected in the field, we can correlate these to the defects found during Product verification. We thus have a measure of the defects undetected by the Product Verification Team i.e Quality of the Test organization, strategy etc.
So that we have a measure of which areas of the Hardware Verification Cycle are most troublesome and we are thus able to make recursive updates to all of these – i.e continual improvement
Based upon metrics from previous releases, we are able to estimate the number of Defects that are likely to occur in a new release and are able to use this information to plan and schedule accordingly.
We are able to track development performance from one release to another, allowing recursive changes to be made to all processes to improve quality levels earlier in the development cycle.
34. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 34 Telecom Product VerificationMetrics -background The Pareto Principle states that only a "vital few" factors are responsible for producing most of the problems. This principle can be applied to quality improvement to the extent that a great majority of problems (80%) are produced by a few key causes (20%). If we correct these few key causes, we will have a greater probability of success. From : Quality Guide - Pareto Analysis
Pareto’s principle is due to Italian Economist Vilfredo Pareto (1848 – 1923) and while originally developed for wealth distribution in populations, it can very well be applied to testing and defect distribution. Please see : Vilfredo Pareto
35. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 35 Telecom Product Verification Metrics- Pareto Analysis Pareto Analysis
Utilize Pareto analysis to analyze the failures and determine which areas are likely to prove the most troublesome
Defects must by logged based upon a hierarchical scale of Severity based upon the anticipated effect on the Customer and agreement between the members of the Project Team (i.e. PLM, Development, Program management, Product Verification etc) Severity 1, 2, 3 etc
36. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 36 Telecom Product Verification Metrics- Pareto Analysis Here are a few hyperlinks to discussions about Pareto Analysis
Quality Guide - Pareto Analysis
37. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 37 Telecom Product Verification Metrics- Defect Tracking Every Product Verification team must have a Defect Tracking database that is able to log defects by the following minimum set of data:
HW release number or description
Severity of the Defect
Equipment Under Test down to the basic constituent sub assemblies
Area of HW test
Description of the failure mode with attached data files from the EUT
It is important that all defects found (i.e. from Physical Layer/Standards testing, Compliance Testing and Stress Testing) are logged in the database along with all supporting data
Some teams log defects found in Unit Test as well as Formal HW Verification Test phase.
This allows the Product Verification Team to compile metrics for each release to gauge performance of the Product Verification Team and the test strategy.
38. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 38 Telecom Product Verification Metrics- Defect Tracking The metrics being generated and analyzed depend upon the product Development team and the Product but the following should be considered:
Breakdown of defects by severity
Pareto Analysis of the types of Defects found on a card by card basis
Comparison of Defects found at different stages of the development cycle
Comparison of Defects from release to Release
Graphing of defects found in each area of testing
A the conclusion of each Release the Product Verification Team should analyze the results of that release in comparison to previous releases and utilize that data to feed back improvements in Development Process in General and the Test Strategy in Particular
39. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 39 Telecom Product Verification Metrics- Example Graphs
40. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 40 Telecom Product VerificationTesting in Production What Happens after Release to Manufacturing and large scale production?
This is a subject on it’s own and is not the subject of this course
Generally Manufacturing has their own Production test suite that verifies production quality. Generally this consists of testing a subset of the Physical Layer or Standards based test plan
Once a product has received Compliance/Agency approvals on a specific type of equipment, these tests are not repeated in Production. If any change is made to the design, then the Independent Compliance Test Lab needs to evaluate whether the Compliance and Agency approvals have been affects and, if so, which set of Compliance/Agency approvals test must be repeated
There will generally be some kind of “Burn-in” testing where a mild stress such as thermal cycling is applied to either all the production equipment, or a sampling, to excite infant mortality of production process defects. These failures will be remedied and then directed to retest and once they pass the Production tests, they would generally be shipped to customers
41. 12/8/2011 Steven Woody, France Antelme, Ario Bigattini, Jaseem Masood all rights reserved 41 Telecom Product VerificationUseful References An introduction to Error Analysis: The study of uncertainties in Physical Measurements, John R. Taylor, University Science Books
Fiber Optics Test and Measurement, Editor: Dennis Derickson, Prentice-Hall
Understanding Jitter and Wander Measurements and Standards 2nd Edition, Agilent Technologies
EMC and the Printed Circuit Board, Mark I.Montrose, IEEE Press
Printed Circuit Board Design Techniques for EMC Compliance , Mark I. Montrose, IEEE Press
Hobbs, G., Accelerated Reliability Engineering HALT and HASS, John Wiley & Sons, 2000.
MIL-HDBK-217, Reliability Prediction of Electronic Equipment, U.S. Department of Defense.
Bellcore TR-332, Issue 6, Reliability Prediction Procedure for Electronic Equipment, Telcordia Technologies
Shoman, Martin L., Probabilistic Reliability : An Engineering approach, McGraw -Hill How do you know what the market needs are? Will they change by the time you are ready to deliver?
What is a better marketing strategy? First in the market or best quality?
How well understood do these requirements need to be?
Why do you need requirements. How is the organization to know if it works like the customer wants, what to test or how it supposed to work?
Can you leverage other product implementations within the corporation?How do you know what the market needs are? Will they change by the time you are ready to deliver?
What is a better marketing strategy? First in the market or best quality?
How well understood do these requirements need to be?
Why do you need requirements. How is the organization to know if it works like the customer wants, what to test or how it supposed to work?
Can you leverage other product implementations within the corporation?