490 likes | 583 Views
Machine Learning for Automated Diagnosis of Distributed Systems Performance. Ira Cohen HP-Labs June 2006 http://www.hpl.hp.com/personal/Ira_Cohen. Intersection of systems and ML/Data mining: Growing (research) area.
E N D
Machine Learning for Automated Diagnosis of Distributed Systems Performance Ira Cohen HP-Labs June 2006 http://www.hpl.hp.com/personal/Ira_Cohen
Intersection of systems and ML/Data mining: Growing (research) area • Berkeley’s RAD lab (Reliable Adaptable Distributed systems lab) got $7.5mil from Google, Microsoft and Sun for: “…adoption of automated analysis techniques from Statistical Machine Learning (SML), control theory, and machine learning, to radically improve detection speed and quality in distributed systems” • Workshops devoted to area (e.g., SysML), papers in leading system and data mining conferences • Part of IBM’s “Autonomic Computing” and HP’s Adaptive Enterprise visions • Startups (e.g., Splunk, LogLogic) • And more…
SLIC project at HP-Labs*: Statistical learning inference and control • Research objective: Provide technology enabling automated decision making, management and control of complex IT systems. • Explore statistical learning, decision theory and machine learning as basis for automation. • *Participants/Collaborators: Moises Goldszmidt, Julie Symons, Terence Kelly, Armando Fox, Steve Zhang, Jeff Chase, Rob Powers, Chengdu Huang, Blaine Nelson I’ll Focus today on Performance diagnosis
Intuition: Why is performance diagnosis hard? • What do you do when your PC is slow?
Why care about performance? • Answer: It costs companies BIG money Analysts estimate that poor application performance costs U.S.-based companies approximately $27 billion each year • Performance management software products revenue growing at double digit % every year!
Challenges today in diagnosing/forecasting IT performance problems • Distributed systems/services are complex • Thousands of systems/services/applications is typical • Multiple levels of abstractions and interactions between components • Systems/Applications change rapidly • Multiple levels of responsibility (infrastructure operators, application operators, DBAs, …) --> a lot of finger pointing • Problems can take days/weeks to resolve • Loads of data, no actionable information • Operators manually search for needle in haystack • Multiple types of data sources --- lack of unifying tools to even view data • Operators hold past diagnosis efforts in their head - history of diagnosis efforts mostly lost.
Translation to Machine Learning Challenges • Transforming data to information: Classification, feature selection methods – with need for explanation • Adaptation: Learning with concept drift • Leveraging history: Transforming diagnosis to an information retrieval problem, clustering methods, etc. • Using multiple data sources: combining structured and semi-structured data • Scalable machine learning solutions: distributed analysis, transfer learning • Using human feedback (human in the loop):semi-supervised learning (active learning, semi-supervised clustering)
Outline • Motivation (already behind us…) • Concrete example: The state of distributed performance management today • ML challenges • examples of research results • Bringing in all together as a tool: Providing diagnostic capabilities as a centrally managed service • Discussion/Summary
Example: A real distributed HP Application architecture Geographically distribution 3-tier application Results shown today are from last 19+ months of data collected from this service
Unhealthy = SLO Violation Application performance “management”: Service Level Objectives (SLO)
Leverage history: Did we see similar problems in the past? What were the repair actions? Do/Did they occur in other data centers? Triage: What are the symptoms of the problem? Who do I call? Unhealthy Detection is not enough… • Problem prioritization • How many different problems are there and their severity? • Which are recurrent? • Can we forecast these problems?
Challenge 1: Transforming data to information… • Many measurements (metrics) available on IT-systems (OpenView, Tivoli, etc…) • System/application metrics: CPU, memory, disk, network utilizations, queues, etc... • Measured on a regular basis (1-5 minutes with commercial tools). • Other semi-structure data (log files) Where is the relevant information?
Unhealthy F(M ,SLO) ML Approach: Model using Classifiers • Leverage all the data collected in the infrastructure to: • Use classifiers: F(M) -> SLO state • Classification accuracy is a measure of success • Use feature selection to find most predictive metrics of SLO state
Unhealthy P(M,SLO) Inferences (“metric attribution”): Normal Metric has a valueassociated with healthybehavior Abnormal Metric has a valueassociated with unhealthybehavior But we need an explanation, not just classification accuracy... Our approach: Learn joint probability distribution (Bayesian network classifiers) P(M|SLO)
SLO State M3 M5 M30 M8 M32 Bayesian network classifiers: Results • “Fast”: (in the context of 1-5 mins data collection) • Models takes 2-10 seconds to train on days worth of data • Metric attribution: Takes 1ms-10ms to compute • Found that order of 3-10 metrics are needed (out of hundreds) to capture accurately a performance problem • Accuracy is high (~90%)* • Experiments showed metrics are useful for diagnosing certain problems on real systems • Hard to capture with single model multiple types of performance problems!
Additional issues • How much data is needed to get accurate models? • How to detect model validity? • How to present models/results to operators?
Different? Same problem? Challenge 2: Adaptation • Systems and application change • Reasons for performance problems change over time (and sometimes recur) Learning with “Concept drift”
Adaptation: Possible approaches • Single omniscient model: “Train once, use forever” • Assumes training data provides all information. • Online updating of model • E.g., parameter/structure updating of Bayesian networks, online learning of Neural networks, Support vector machines, etc. • Potentially wasteful retraining when similar problems reoccur • Maintain ensemble of models • Requires criteria for choosing subset of models in inference. • Criteria for adding new models to ensemble • Criteria for removing models from ensemble
Our approach: Managing an ensemble of models for our classification approach Construction • Periodically induce a new model • Check whether the model adds new information (classification accuracy) • Update the ensemble of models Inference: Use Brier score for selection of models
Adaptation: Results • ~7500 samples, 5 mins/sample (one month), ~70 metrics • Classifying a sample with the Ensemble of BNCs: • Used model with best Brier Score for predicting class (winner takes all) • Brier score was better than other measures (e.g., accuracy, likelihood) • Winner takes all was more accurate than other combination approaches (e.g., majority voting)
Adaptation: Result • “Single adaptive” slower to adapt to recurrent issues • Must re-learn behavior, instead of just selecting a previous model
Additional issues • Need criteria for “aging” models • Periods of “good” behavior also change: Need robustness to those changes as well.
Diagnosis: Stuck thread due to insufficient Database connections Repair: Increase connections to +6 Periods: :: : Severity: SLO time increases up to 10secs : : Location: Americas. Not seen in Asia/Pacific Challenge 3: Leveraging history • It would be great to have the following system:
Diagnosis: Stuck thread due to insufficient Database connections Repair: Increase connections to +6 Periods: :: : Severity: SLO time increases up to 10secs : : Location: Americas. Not seen in Asia/Pacific Leveraging history • Main challenge: Find a representation (signature) that captures the main characteristics of the system behavior that is: • Amenable to distance metrics • Generated automatically • In Machine readable form
Unhealthy Models P(SLO,M) 3) Define these as signatures of the problems Our approach to defining signatures 1) Learn probabilistic classifiers 2) Inferences: Metric Attribution
Example: Defining a signature • For a given SLO violation, the models provide a list of metrics that are attributed with the violation. • Metric has value 1 if it is attributed with the violation, -1 if it is not attributed, 0 if it is not relevant, e.g.: Attri- bution
Diagnosis: Stuck thread due to insufficient Database connections Repair: Increase connections to +6 Periods: :: : Severity: SLO time increases up to 10secs : : Location: Americas. Not seen in Asia/Pacific Results: With signatures… • We were able to accurately retrieve past occurrences of similar performance problems with the diagnosis efforts • ML technique: Information retrieval
Ideal P-R curve Top 100: 92 vs 51 Results: Retrieval accuracy Retrieval of "Stuck Thread" problem
Results: With signatures we can also… • Automatically identify groups of different problems and their severity • Identify which are recurrent • ML technique: Clustering
Additional issues • Can we generalize and abstract signatures for different systems/applications? • How to incorporate human feedback for retrieval and clustering? • Semi-supervised learning: results not shown today
Challenge 4: Combining multiple data sources • We have a lot of semi-structured text logs, e.g., • Problem tickets • Event/error logs (application/system/security/network…) • Other logs (e.g., operators actions) • Logs can help obtain more accurate diagnosis and models – sometimes system/application metrics not enough • Challenges: • Transforming logs to “features”: information extraction • Doing it efficiently!
Properties of logs • Logs events have relatively short text messages • Much of the diversity in messages comes from different “parameters” – dates, machine/component names. Core is less unique compared to free text. • Amount of events can be huge (e.g., >100 million events per day for large IT systems) Processing events needs to compress logs significantly while doing it efficiently!
Our approach: Processing application error-logs 2006-02-26T00:00:06.461 ES_Domain:ES_hpat615_01:2257913:Thread43.ES82|commandchain.BaseErrorHandler.logException()|FUNCTIONAL|0||FatalException occurredtype=com.hp.es.service.productEntitlement.knight.logic.access.KnightIOException, message=Connection timed out, class=com.hp.es.service.productEntitlement.knight.logic.RequestKnightResultMENUCommand 2006-02-26T00:00:06.465 ES_Domain:ES_hpat615_01:22579163:Thread-43.ES82|com.hp.es.service.productEntitlement.combined.errorhandling.DefaultAlwaysEIAErrorHandlerRed.handleException()|FATAL|2706||KNIGHT system unavailable: java.io.IOException 2006-02-26T00:00:06.465 ES_Domain:ES_hpat615_01:22579163:Thread-43.ES82|com.hp.es.service.productEntitlement.combined.errorhandling.DefaultAlwaysEIAErrorHandlerRed.handleException()|FATAL|0||com.hp.es.service.productEntitlement.knight.logic.RequestKnightResultMENUCommand message: Connection timed out causing exception type: java.io.IOException KNIGHT URL accessed: http://vccekntpro.cce.hp.com/knight/knightwarrantyservice.asmx 2006-02-26T00:00:06.466 ES_Domain:ES_hpat615_01:22579163:Thread-43.ES82|com.hp.es.service.productEntitlement.combined.errorhandling.DefaultAlwaysEIAErrorHandlerRed.handleException()|FATAL|0||com.hp.es.service.productEntitlement.knight.logic.access.KnightIOException: Connection timed out 2006-02-26T00:00:08.279 ES_Domain:ES_hpat615_01:22579163:ExecuteThread: '16' for 'weblogic.kernel.Default'.ES82|com.hp.es.service.productEntitlement.combined.MergeAllStartedThreadsCommand.setWaitingFinished()|WARNING|3709||2006-02-26T00:00:08.279 ES_Domain:ES_hpat615_01:22579163:ExecuteThread: '16' for 2006-02-26T00:00:06.465 ES_Domain:ES_hpat615_01:22579163:Thread-43.ES82|com.hp.es.service.productEntitlement.combined.errorhandling.DefaultAlwaysEIAErrorHandlerRed.handleException()|FATAL|0||com.hp.es.service.productEntitlem Similarity-based SequentialClustering • Significant reduction of messages • 200,000 190 • Accurate • Clustering results validated with hierarchical tree clustering algorithm 190 “feature messages” Over 4,000,000 error log entries 200,000+ distinct error messages Use count of appearances over 5-minute intervals of the features messages as metrics for learning
PDF # of appearances Learning Probabilistic Models • Construct probabilistic models metrics using a “hybrid-gamma distribution” (Gamma distribution with zeros)
From Application Error Log: Results: Adding Log based metrics • Signatures using error logs metrics pointed to the right causes in 4 out of 5 “High” severity incidents in past 2 months • System metrics were not related to the problems in these cases From Operator Incident Report: Diagnosis and Solution: Unable to start SWAT wrapper. Disk usage reached 100%. Cleaned up disk and restarted the wrapper… CORBA access failure: IDL:hpsewrapper/SystemNotAvailableException:… com.hp.es.wrapper.corba.hpsewrapper.SystemNotAvailableException
Additional issues • With multiple instances of an application – how to do joint, efficient processing of the logs? • Treating events as sequences in time could lead to more accuracy and compression.
A C B D E Challenge 5: Scaling up Machine Learning techniques • Large scale distributed applications have various level of dependencies • Multiple instances of components • Shared resources (DB, network, software components) • Thousands to millions of metrics (features)
Challenge 5: Possible approaches • Scalable approach: Ignore dependencies between components • Putting head in the sand? • See Werner Vogel’s (Amazon’s CTO) thoughts on it… • Centralized approach: Use all available data together for building models. • Not scalable • A different approach: Transfer models, not metrics. • Good for components that are similar and/or have similar measurements
Example: Diagnosis with Multiple Instances • Method 1: diagnosing multiple instances by sharing measurement data (metrics) A B
A Diagnosis with Multiple Instances • Method 1: diagnosing multiple instances by sharing measurement data (metrics) H G F E D C B
Diagnosis with Multiple Instances • Method 2: diagnosing multiple instances by sharing learning experience (models) • A form of transfer learning A B
A Diagnosis with Multiple Instances • Method 2: diagnosing multiple instances by sharing learning experience (models) H G F E D C B
Metric Exchange: Does it help? • Building models based on metrics of other instances • Observation: metric exchange does not improve model performance for load-balanced instances Instance 1 Instance 2 Violation detection w/ model exchange Violation detection w/o model exchange Online Prediction Online Prediction False Alarm Time Epoch Time Epoch
Models imported from other instances improve accuracy Violation detection w/o model exchange Violation detection w/ model exchange False alarm w/o model exchange False alarm w/ model exchange Model Exchange: Does it help? • Apply models trained on other instances • Observation 1: model exchange enables quicker recognition of previously unseen problem types • Observation 2: model exchange reduces model training cost Online Prediction Time Epoch
Additional issues • How do/Can we do transfer learning on similar but not identical instances? • More efficient methods for detecting which data is needed from related components during diagnosis
Monitored Services Retrieval engine Signature DB Signature construction engine Metrics/SLO Monitoring Clustering engine Admin Providing diagnosis as a web service: SLIC’s IT-Rover • Centralized diagnosis web service allows: • Retrieval across different data centers/different services/possibly different companies • Fast deployment of new algorithms • Better understanding of real problems for further development of algorithms • Value of portal is in the information (“Google” for systems)
Discussion: Additional issues, opportunities, and challenges • Beyond the “black box”: Using domain knowledge • Expert knowledge • Topology information • Use known dependencies and causal relationship between components • Provide solutions in cases where SLOs are not known • Learn relationship between business objectives and IT performance • Anomaly detection methods with feedback mechanisms • Beyond diagnosis: Automated control and decision making • HP-Labs work on applying adaptive controllers for controlling systems/applications • IBM Labs work using reinforcement learning for resource allocation
Summary • Presented several challenges at the intersection machine learning and IT automated diagnosis • A relatively new area for machine learning and data mining researchers and practitioners • Many more opportunities and challenges ahead: research and product/business wise… Read more: www.hpl.hp.com/research/slic • SOSP-05, DSN-05, HotOS-05, KDD-05, OSDI-04
Publications: • Ira Cohen, Steve Zhang, Moises Goldszmidt, Julie Symons, Terence Kelly, Armando Fox, "Capturing, Indexing, Clustering, and Retrieving System History", SOSP 2005. • Rob Powers, Ira Cohen, and Moises Goldszmidt, "Short term performance forecasting in enterprise systems", KDD 2005. • Moises Goldszmidt, Ira Cohen, Armando Fox and Steve Zhang, "Three research challenges at the intersection of machine learning, statistical induction, and systems", HOTOS 2005. • Steve Zhang, Ira Cohen, Moises Goldszmidt, Julie Symons, Armando Fox, "Ensembles of models for automated diagnosis of system performance problems", DSN 2005. • Ira Cohen, Moises Goldszmidt, Terence Kelly, Julie Symons, Jeff Chase, "Correlating instrumentation data to system states: A building block for automated diagnosis and control", OSDI, 2004. • George Forman and Ira Cohen, "Beware the null hypothesis", European Conference on Machine Learning/ European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD) 2005. • Ira Cohen and Moises Goldszmidt, "Properties and Benefits of Calibrated Classifiers", European Conference on Machine Learning/ European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD) 2004. • George Forman and Ira Cohen, "Learning from Little: Comparison of Classifiers given Little Training", European Conference on Machine Learning/ European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD) 2004.