Self-Learning and Adaptive Functional Fault Diagnosis A Look at What is Possible with “Data”

Self-Learning and Adaptive Functional Fault DiagnosisA Look at What is Possible with “Data” Krishnendu Chakrabarty Department of Electrical & Computer Engineering Duke University Durham, NC

Acknowledgments • Fangming Ye, Duke University • Zhaobo Zhang, Huawei • XinliGu, Huawei • Sponsors: Cisco (until 2011), Huawei (since 2011)

ITC 2014 Tutorial • Tutorial 9: • TEST, DIAGNOSIS, AND ROOT-CAUSE IDENTIFICATION OF FAILURES FOR BOARDS AND SYSTEMS • Presenters: KRISHNENDU CHAKRABARTY, WILLIAM EKLOW, ZOE CONROY • The gap between working silicon and a working board/system is becoming more significant as technology scales and complexity grows. The result of this increasing gap is failures at the system level that cannot be duplicated at the component level. These failures are most often referred to as “NTFs” (No Trouble Founds). The problem will only get worse as technology scales and will be compounded as new packaging techniques (SiP, SoC, 3D) extend and expand Moore’s law. This tutorial will provide detailed background on NTFs and will present DFT, test, and root-cause identification solutions at the board/system level.

Automated Diagnosis: Wishlist Fixed board Failed board • Automated and accurate diagnosis • Reduced diagnosis and repair costs • Identify the most-effective syndromes • Accelerate product release • Self-learning Functional Test Automated diagnosis system • Report ambiguity • Develop new tests • Analyze and update current test set

What Data Should We Collect? • Pass/fail information, extent of mismatch with expected values, performance marginalities, etc. • Counter values, BIST signatures, sensor data • Example: A segment of traffic-test log • ## Summary: Interfaces< r2d2 -- metro > counts - Fail(mismatch) • ……464. (00000247) ERR EG R2D2_ARIC_CP_DBUS_CRC_ERR • …… • Error: (0000010A) DIAGERR_ERRISO_INVALID_PKT_CNT: Packet count invalid

What Can We Do With This Data? • Train machine-learning models for root-cause localization • Identify redundant syndromes (i.e., test outcome data) and do data pruning • Identify deficiencies of tests in terms of diagnostic ambiguity and provide guidance for test redesign • “Fill in the gaps” when the data is not sufficient for precise diagnosis

Key Challenges and Solutions • How to improve diagnosis accuracy? • Support-vector machines and incremental learning [Ye et al., TCAD’14] • Multiple classifiers and weighted-majority voting[Ye et al. TCAD’13] • How to speed up diagnosis? • Decision trees [Ye et al. ATS’12] • How to do diagnosis using incomplete information? • Imputation methods [Ye et al. ATS’13] • What can we learn from past diagnosis? • Root-cause and syndrome analysis [Ye et al. ETS’13] • Knowledge discovery and knowledge transfer [Ye et al. ITC’14]

Syndrome and Root-Cause • A segment of traffic-test log • Syndromes are test outcomes parsed from log • Root causes are replaced components, e.g. C37 • ## Summary: Interfaces< r2d2 -- metro > counts - Fail(mismatch) • ……464. (00000247) ERR EG R2D2_ARIC_CP_DBUS_CRC_ERR • …… • Error: (0000010A) DIAGERR_ERRISO_INVALID_PKT_CNT: Packet count invalid Syndrome Root cause

Success Ratio • Success ratio = ¾ = 75% Predicted root cause Actual root cause

Using MachineLearning (Support-Vector Machines) • A binary classifier based on statistical learning • Train the model with data • Using trained model for diagnosis Line2(optimal separation) Line1 (feasible separation) Margin Margin (a) (b)

Example • Suppose we have 6 cases (successful debugged boards) for training • Let x1, x2, x3be three syndromes. If the syndrome manifests itself, we record it as 1, and 0 otherwise. • Suppose that the board has two candidate root causes A and B, and we encode them as y = 1 and y = −1 • Merge the syndromes and the known root causes into matrix A = [B|C], where the left (B) side refers to syndromes, while the right side (C) refers to the corresponding fault classes

Example (Contd.) • SVM calculation: w1= 1.99, w2= 0, w3= 0 and b= −1.00 • Therefore, the classifier is x3 x2 x1 • Given a new failing system with syndrome [1 0 0], then f(x) > 0. The root cause for this failing system is A. • Given another new failing system with syndrome [0 1 0], f(x) < 0. The root cause for this failing system is B. y = 1 y = −1 Distance between classifier and support vector

Results for Different Kernels Circa 2011 (Manufactured Boards) • 811 boards for training, 212 for test • Diagnostic accuracy under different kernel functions

Incremental SVMs Initial training set Support vector extraction Combined Training Set Extracted support vectors New training cases

New training (incoming) data IncrementalLearningFlow Chart Failing boards Preparation stage Extract fault syndromes and repair actions as additional training set S Learning stage Existing SVM model Existing support vectors S* Optimization problem for combined training set (S US*) Solve and update SVM model More new training data? Yes No Final SVM model Diagnosis stage for new systems Determine root cause based on the output of final SVM model

Comparison of Training Time SVM model training time (seconds) Number of training cases in SVMs

Comparison of Success Rates SVM model success rate (percentage) Number of training cases in SVMs

Typical Diagnosis Systems • Number of syndromes (1,000 per board) • Diagnosis time (up to several hours per board) • Often requires manual diagnosis How to select Useful syndromes The complete set of syndromes

Comparison of Diagnosis Procedures Decision Tree-Based Diagnosis Traditional Diagnosis • Start Diagnosis • Start Diagnosis • Observe ONE syndrome Confident about root cause? Observe ALL syndromes No Yes Predicted root cause Predicted root cause

Decision Trees A 1 yes • Internal Nodes • Can branch to two child nodes • Represent syndromes • Terminal Nodes • Do not branch • Contain class information S 2 no yes A 2 yes A 1 S 1 S 4 yes no no S 3 A 4 no A 3

Decision Trees A 1 A 1 yes yes • We may reach root causeA1 in two different testsequences. • Start from the mostdis-criminative syndrome S1 • If S1 manifests itself, we then consider syndrome S2 • If S2 manifests itself, wecan determine A1 to be the root cause S 2 S 2 no yes A 2 yes yes A 1 S 1 S 1 S 4 yes no no S 3 A 4 no A 3

Decision Trees A 1 yes • We may reach root causeA1 in two different testsequences • Start from the mostdis-criminative syndrome S1 • If S1 pass, we will considersyndrome S3 • If S3 manifests itself, we will consider syndrome S4 • If S4 manifests itself, then we can determine A1to be the root cause S 2 no no A 2 yes yes A 1 A 1 S 1 S 1 S 4 S 4 yes yes no no yes S 3 S 3 A 4 no A 3

Training Of Decision Trees (Syndrome Identification) • Goals: • Rank syndromes • Minimize ambiguity • Reduce tree depth • Three popular criteria can be used for training decision trees • Information Gain • GiniIndex • Twoing Class 1 Class 2 Class 1 Class 2

Diagnosis Using Decision Trees • Training Data Preparation • Extract all the fault syndromes and the repair actions from historical data • DT Architecture Design • Design inputs, outputs, splitting criterion, pruning • DT Training • Generate a tree-based predictive model and assess the performance • DT-based Diagnosis • Traverse from the root node of DTs and obtain the root cause at the leaf node

Diagnosis Using Decision Trees • Start Diagnosis • Observe the syndrome at the root of DTs Adaptive Diagnosis Select and observe the new syndrome based on the observation of current syndrome Leaf Node? Yes Predict Root Cause Generate root cause for the failing board No

Experiments • Experiments performed on industrial boards currently in production • Tens of ASICs, hundreds of passive components • All the boards under analysis failed traffic test • A comprehensive functional test set for fault isolation, run through all components

Comparison Of Different Decision-Tree Architectures Total number of syndromes used for diagnosis 5x 6x 6x

Comparison Of Different Decision-Tree Architectures Average Number of syndromes used for diagnosis 17X 15X 18X

Comparison Between DTs And SVMs • Success rates (SR) obtained for Board 3 • SR obtained by DTs are similar to SR obtained by SVMs

Comparison Between DTs And ANNs • Success rates (SR) obtained for Board 3 • SR obtained by DTs are similar to SR obtained by ANNs

Information-Theoretic Syndrome and Root-Cause Analysis for Guiding Diagnosis Analysis of diagnosis performance Feedback for guiding test improvement

Problem Statement Syndrome set from team 1 Syndrome set from team 3 • Lack of diagnosis performance evaluation for individual root cause or individual syndrome • Redundant syndromes Reducedsyndrome set Redundant syndrome set Syndrome set from team 2 Complete set of syndromes

Problem Statement • Lack of diagnosis performance evaluation for individual root cause or individual syndrome • Redundant syndromes • Ambiguous root-cause pairs Which root cause is hard to diagnosis? Complete set of root causes

Analysis Framework Automated diagnosis system Syndromes Root-cause prediction Root-cause analysis Syndrome analysis Update Test design team Minimum-redundancy Maximum-relevance Class-relevance analysis (precision, recall) Synthetic boards Add/Drop tests A reduced set of syndromes A set of redundant syndromes Root causes with high ambiguity Root causes with low ambiguity Feedback

Syndrome Analysis • Problem • Which syndrome is useful for diagnosis? And, which is not? • Method • Select useful syndromes • Minimum-redundancy maximum-relevance (mRMR) method

mRMR Method (demo)

Experimental Setup • Dataset • Industrial boards in high-volume production • Each board has tens of ASICs, hundreds of passive components • A comprehensive functional test run though all components

Experimental Setup • Selected diagnosis systems • Support-vector machines • Artificial neural networks • Decision trees • Majority-weight voting

Results (Syndrome Analysis) Support-vector machines Artificial neural networks Decision trees Weighted-majority voting

Results (Syndrome Analysis) Support-vector machines

Root-Cause Analysis • Problem • Which root-cause is hard to isolate from another root-cause? • Which root-cause is hard to isolate from other root causes? • Method • Screen root causes of high ambiguity • Statistical metrics (e.g., precision, recall)

Metrics for Root-Cause Analysis Actual root cause Predicted root cause TP (True Positive) Root cause A Root cause A FP (False Positive) FN (False Negative) Other root causes (Root cause B, C, …) Other root causes (Root cause B, C, …) TN (True Negative)

Metrics for Root-Cause Analysis Actual root cause Predicted root cause TP (True Positive) Root cause A Root cause A FP (False Positive) Other root causes (Root cause B, C, …) Other root causes (Root cause B, C, …)

Metrics for Root-Cause Analysis Actual root cause Predicted root cause TP (True Positive) Root cause A Root cause A FN (False Negative) Other root causes (Root cause B, C, …) Other root causes (Root cause B, C, …)

Root-Cause Analysis (demo) Predicted root cause Actual root cause

Self-Learning and Adaptive Functional Fault Diagnosis A Look at What is Possible with “Data”

Self-Learning and Adaptive Functional Fault Diagnosis A Look at What is Possible with “Data”

Presentation Transcript

Diagnosis of Pulmonary Tuberculosis

Web Services

Testing Semiconductor Memories

Fault Tolerance in Distributed Systems

Fault-Tolerance: Practice Chapter 7

Strategic Formulation

Adaptive Query Processing with Eddies

Molecular Diagnostics Why ?

Functional Annotation

Principles of Adaptive Thermal Comfort

LABORATORY DIAGNOSIS OF PARASITIC INFECTIONS

ORAL DIAGNOSIS

An Introduction to Adaptive Learning

Differential Diagnosis I

CS 347: Parallel and Distributed Data Management Notes X: S4

Adaptive Hypermedia From Concepts to Authoring

Venothrombotic Disease Diagnosis and Treatment

Utilizing DeltaV Adaptive Control

Predictive Learning from Data

Testing Semiconductor Memories