SAND2010-4169C. Quantifying Effectiveness of Failure Prediction and Response in HPC Systems: Methodology and Example. Jackson Mayo, James Brandt, Frank Chen, Vincent De Sapio, Ann Gentile, Philippe P é bay, Diana Roe, David Thompson, and Matthew Wong Sandia National Laboratories Livermore, CA
Acknowledgments • This work was supported by the U.S. Department of Energy, Office of Defense Programs • Sandia is a multiprogram laboratory operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy under contract DE-AC04-94AL85000
Overview • OVIS project goals and techniques • Considerations for evaluating HPC failure prediction • Example failure mode and predictor • Example quantification of predictor effectiveness
The OVIS project aims to discoverHPC failure predictors • Probabilistic failure predictioncan enable smarter resourcemanagement and checkpointing,and extend HPC scaling • Challenges have limited progress on failure prediction • Complex interactions among resources and environment • Scaling of data analysis to millions of observables • Relative sparsity of data on failures and causes • Need for actionable, cost-effective predictors • OVIS (http://ovis.ca.sandia.gov) is open-source software for exploration and monitoring of large-scale data streams, e.g., from HPC sensors
The OVIS project aims to discoverHPC failure predictors • Robust, scalable infrastructure for data collection/analysis
The OVIS project aims to discoverHPC failure predictors • Analysis engines that learn statistical models from data and monitor for outliers (potential failure predictors) Correlative Graph clustering Bayesian
The OVIS project aims to discoverHPC failure predictors • Flexible user interface for data exploration
Evaluation of HPC failure predictionconfronts several challenges • Lack of plausible failure predictors • Some previous studies focused on possible responses without reference to a predictor • Lack of response cost information • Diverse costs may need to be estimated (downtime, hardware, labor) • Complex temporal features of prediction and response • Cost of action or inaction depends on prediction timing • Response to an alarm (e.g., hardware replacement) can alter subsequent events • Historical data do not fully reveal what would have happened if alarms had been acted upon
Two general approaches offermetrics for prediction effectiveness • Correlation of predictor with failure • Consider predictor as a classifier that converts available observations into a statement about future behavior • Simplest case: for a specific component and time frame, classifier predicts failure or non-failure (binary classification) • Use established metrics for classifier performance • Cost-benefit of response driven by predictor • Use historical data to estimate costs of acting on predictions • More stringent test because even a better-than-chance predictor may not be worth acting on • Requires choice of response and understanding of its impact on the system; may be relatable to classifier metrics
Classifier metrics assessability to predict failure • Classifiers have been analyzed for signal detection, medical diagnostics, and machine learning • Basic construct is “receiver operating characteristic” (ROC) curve • Binary classifiers have an adjustable threshold separating the two possible predictions • Interpretation in OVIS: How extreme an outlier is alarmed? • Sweeping this threshold generates a tradeoff curve between false positives and false negatives • Statistical significance of predictor can be measured • Any definition of failure/non-failure can be used, but one motivated by costs is most relevant
Cost metrics assess ability tobenefit from failure prediction • Given a predictor and a response, evaluate the net cost of using them versus not (or versus others) • Historical data alone may not answer this counterfactual • Alternatives are real-world trials and dynamical models • Classifier thresholds are subject to cost optimization • ROC curves allow reading off simple cost functions: constant cost per false positive and per false negative • Realistic costs may not match such binary labels • Is the cost-benefit of an alarm really governed by whether a failure occurs in the next N minutes? • If costs are available for each historical event, they can be used to optimize thresholds directly
Real-world failure predictorillustrates evaluation issues • Out of memory (OOM) condition has been a cause of job failure on Sandia’s Glory cluster • Failure predictor: abnormally high memory usage during idle time (detectable > 2 hours before failure)
Real-world failure predictorillustrates evaluation issues What is failure? Jobs terminate abnormally before system failure event(s) Cost-benefit and ramifications? How to evaluate cost-benefit for a given action? What are the ramifications of a given action/inaction on a live system where playback is impossible? Attribution? Event is far from the indicator/cause
Definitions and assumptions allowexample quantification of OOM predictor • Classifier predicts whether a job will terminate as COMPLETED (non-failure) or otherwise (failure) • Failure is predicted if memory usage (MU) on any job node during preceding idle time exceeds threshold • Response is rebooting of any node with excess MU during idle time, thus clearing memory • Cost of rebooting is 90 CPU-seconds • Does not include cycling wear or effect on job scheduling • If a job failed, rebooting its highest-MU node during the preceding idle time would have saved it • Credit given for total CPU-hours of failed job • Unrealistic assumption because not all failures are OOM
Example ROC curvemeasures prediction accuracy Lowest threshold (always alarm) • Predictions of job failure/non-failure are evaluated for various MU thresholds • ROC curve shows better-than-chance accuracy • Area under curve is 0.562 vs. 0.5 for chance • Statistical significance(p ~ 0.001) via comparison to synthetic data with no MU-failure correlation • Validates ability to predict failure in this system False neg. False pos. Highest threshold (never alarm)
Example net-benefit curvemeasures response effectiveness 80% of benefit from 20% of responses Lowest threshold (always reboot) • With stated assumptions, net benefit (saved jobs minus rebooting time) is monotonic with threshold • Rebooting cost is negligible • Routine rebooting is optimal • More realistic treatment would reduce net benefit • Not all failed jobs were OOM or could be saved • Additional rebooting costs • Curve bent 80/20: smart reboot has potential value Highest threshold (never reboot)
Conclusion • HPC failure prediction is a valuable ingredient to improve resilience and thus scaling of applications • System complexity makes predictors difficult to discover • When a potential failure predictor is identified, quantification of effectiveness is challenging in itself • Classifier metrics evaluate correlation between predictor and failure, but do not account for feasibility/cost of response • Assessing practical value of predictor involves a response’s cost and impact on the system (often not fully understood) • At least one predictor is known (idle memory usage) • Evaluation methodology applied to this example confirms predictivity and suggests benefit from reboot response http://ovis.ca.sandia.gov ovis@sandia.gov