1 / 16

Failure Prediction in Hardware Systems

Failure Prediction in Hardware Systems. Douglas Turnbull Neil Alldrin CSE 221: Operating System Final Project Fall 2003. 1. Background. If we can predict failure, we can take preventative action to avoid costly failures. System Specifications: 18 Hot Swappable System Boards

keely
Download Presentation

Failure Prediction in Hardware Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Failure Prediction in Hardware Systems Douglas Turnbull Neil Alldrin CSE 221: Operating System Final Project Fall 2003 1

  2. Background • If we can predict failure, we can take preventative action to avoid costly failures. • System Specifications: • 18 Hot Swappable System Boards • 4 Processors per Board • 18 Sensors per Board • Measures various temperatures and voltages Using sensors from a high-end server, can we predict system board failures. 2

  3. Sensor Logs • Each board has an associated Sensor Log: • About every minute, the sensors are sampled and the • measurements are stored in the sensor logs. • System board failures are also record in the sensor log. We need to extract a data set from these logs to represent failure events (positive examples) and normal operating conditions (negative examples). We accomplish this using a Windowing Abstraction. 3

  4. Windowing Abstraction • Sensor Window – Adjacent entries in the sensor log that are used to • predict failures • Potential Failure Window – An example is labeled as positive or • negative if a failure occurs in the potential failure window. 4

  5. Feature Vectors Feature Vectors are created from the data in a sensor window. There are two types of feature vectors: Raw Feature Vectors – a vector all the sensor measurement in a sensor window. Summary Feature Vectors – the mean, standard deviation, range and slope for each of the sensors in a sensor window. 5

  6. Classification A classifier assigns labels (positive or negative) to novel feature vectors after it has been trained using a set of feature vectors with known labels. Many classifiers can be used, such as SVMs, Bayesian mixture models, and neural networks. We use a Radial Basis Function(RBF) network, a special form or a neural network, because it is computationally efficient. 6

  7. We must consider two rates when evaluating our prediction system. True Positive Rate (tpr) – A measure of our ability to correctly predict true failures. tpr = Correctly Predicted Failures / Total Number of True Failures False Positive Rate (fpr) – A measure of the number of mispredictions. fpr = incorrectly Predicted Failures / Total Number of Non-Failures Evaluation Predictions Ground Truth Failure Non-failure Failure Prediction Non-failure 7

  8. Preliminary Results • Observations: • Summary feature vectors have lower false positive rates thanRaw Feature Vectors. • 2. Window size does not seem to matter. • How can we improve these results? 8

  9. Feature Subset Selection We can further improve prediction accuracy (and reduce computation) by reducing the number of features used by our classifier. Feature are selected automatically using Forward Stepwise Selection. 9

  10. Results 10

  11. Best Results We find the best prediction results with Summary Feature Vectors using 2/3 of the summary features: 0.87 True Positive Rate (tpr) 0.10 False Positive Rate (fpr) Our data set assumes that we are equally likely to find a failure as a non-failure. When one considers that there are very few failures in most hardware system, even a low false positive rate will produce many false positives. 11

  12. Future Work • Implement other classifiers – SVMS, Bayesian Mixture Models • Develop a larger data set with more examples of failures • Apply framework to other hardware system such as personal computers • Modify operating system to take advantage of failure prediction • Migrate processes to other system boards • Run diagnostic tests • Turn off suspect system boards • Backup data 12

  13. The End Questions? 13

  14. RBF Network 14

  15. The value of a prediction system can be summarized as, Value = (benefit of predicted failure) * tpr – (cost of mispredicted failure) * fpr Value of a prediction system 15

  16. Template 16

More Related