190 likes | 318 Views
Failure Prediction in Hardware Systems. Douglas Turnbull Neil Alldrin CSE 221: Operating System Final Project Fall 2003. 1. Background. If we can predict failure, we can take preventative action to avoid costly failures. System Specifications: 18 Hot Swappable System Boards
E N D
Failure Prediction in Hardware Systems Douglas Turnbull Neil Alldrin CSE 221: Operating System Final Project Fall 2003 1
Background • If we can predict failure, we can take preventative action to avoid costly failures. • System Specifications: • 18 Hot Swappable System Boards • 4 Processors per Board • 18 Sensors per Board • Measures various temperatures and voltages Using sensors from a high-end server, can we predict system board failures. 2
Sensor Logs • Each board has an associated Sensor Log: • About every minute, the sensors are sampled and the • measurements are stored in the sensor logs. • System board failures are also record in the sensor log. We need to extract a data set from these logs to represent failure events (positive examples) and normal operating conditions (negative examples). We accomplish this using a Windowing Abstraction. 3
Windowing Abstraction • Sensor Window – Adjacent entries in the sensor log that are used to • predict failures • Potential Failure Window – An example is labeled as positive or • negative if a failure occurs in the potential failure window. 4
Feature Vectors Feature Vectors are created from the data in a sensor window. There are two types of feature vectors: Raw Feature Vectors – a vector all the sensor measurement in a sensor window. Summary Feature Vectors – the mean, standard deviation, range and slope for each of the sensors in a sensor window. 5
Classification A classifier assigns labels (positive or negative) to novel feature vectors after it has been trained using a set of feature vectors with known labels. Many classifiers can be used, such as SVMs, Bayesian mixture models, and neural networks. We use a Radial Basis Function(RBF) network, a special form or a neural network, because it is computationally efficient. 6
We must consider two rates when evaluating our prediction system. True Positive Rate (tpr) – A measure of our ability to correctly predict true failures. tpr = Correctly Predicted Failures / Total Number of True Failures False Positive Rate (fpr) – A measure of the number of mispredictions. fpr = incorrectly Predicted Failures / Total Number of Non-Failures Evaluation Predictions Ground Truth Failure Non-failure Failure Prediction Non-failure 7
Preliminary Results • Observations: • Summary feature vectors have lower false positive rates thanRaw Feature Vectors. • 2. Window size does not seem to matter. • How can we improve these results? 8
Feature Subset Selection We can further improve prediction accuracy (and reduce computation) by reducing the number of features used by our classifier. Feature are selected automatically using Forward Stepwise Selection. 9
Results 10
Best Results We find the best prediction results with Summary Feature Vectors using 2/3 of the summary features: 0.87 True Positive Rate (tpr) 0.10 False Positive Rate (fpr) Our data set assumes that we are equally likely to find a failure as a non-failure. When one considers that there are very few failures in most hardware system, even a low false positive rate will produce many false positives. 11
Future Work • Implement other classifiers – SVMS, Bayesian Mixture Models • Develop a larger data set with more examples of failures • Apply framework to other hardware system such as personal computers • Modify operating system to take advantage of failure prediction • Migrate processes to other system boards • Run diagnostic tests • Turn off suspect system boards • Backup data 12
The End Questions? 13
RBF Network 14
The value of a prediction system can be summarized as, Value = (benefit of predicted failure) * tpr – (cost of mispredicted failure) * fpr Value of a prediction system 15
Template 16