Automated Diagnosis of Chronic Problems in Production Systems

Automated Diagnosis of Chronic Problems in Production Systems Soila Kavulya Thesis Committee Christos Faloutsos , CMU Greg Ganger, CMU MattiHiltunen, AT&T PriyaNarasimhan, CMU (Advisor)

Outline • Motivation • Thesis Statement • Approach • End-to-end trace construction • Anomaly detection • Localization • Evaluation • VoIP • Hadoop • Critique & Related Work • Pending Work Soila Kavulya @ March 2012

Motivation • Chronicsare problems that are • Not transient • Not resulting in system-wide outage • Chronics occur in real production systems • VoIP • User’s calls fail due to version conflict between user and upgraded server • Hadoop (CMU’s OpenCloud) • A user job sporadically fails in map phase with cryptic block I/O error • User and admins spend 2 months troubleshooting • Traced to large heap size in tasktracker starving collocated datanodes • Chronics are due to a variety of root-causes • Configuration problems, bad hardware, software bugs • Thesis: Automate chronics diagnosis in production systems Soila Kavulya @ March 2012

Challenge for Diagnosis Single manifestation, multiple possible causes Due to single node? Due to complex interactions between nodes? Due to multiple independent node? Node3 Node5 Node1 Node2 Node4 Soila Kavulya @ March 2012

Challenges in ProductionSystems • Labeled failure-data is not always available • Difficult to diagnose problems not encountered before • Sysadmins’ perspective may not correspond to users’ • No access to user configurations, user behavior • No access to application semantics • First sign of trouble is often a customer complaint • Customer complaints can be cryptic • Desired level of instrumentation may not be possible • As-is vendor instrumentation with limited control • Cost of added instrumentation may be high • Granularity of diagnosis consequently limited Soila Kavulya @ March 2012

Objectives • “Is there a problem?” (anomaly detection) • Detect a problem despite potentially not having seen it before • Distinguish a genuine problem from a workload change • “Where is the problem?” (localization) • Drill down by analyzing different instrumentation perspectives • “What kind of problems?” (chronics) • Manifestation: exceptions, performance degradations • Root-cause: misconfiguration, bad hardware, bugs, contention • Origin: single/multiple independent sources, interacting sources • “What kind of environments?” (production systems) • Production VoIP system at AT&T • Hadoop: Open-source implementation of MapReduce Soila Kavulya @ March 2012

Thesis Statement Peer-comparison* enables anomaly detection in production systems despite workload changes, and the subsequent incremental fusion of different instrumentation sources enables localizationof chronic problems. *Comparison of some performance metric across similar (peer) system elements Soila Kavulya @ March 2012

What was our Inspiration? rika(Swahili), noun. peer, contemporary, age-set, undergoing rites of passage (marriage) at similar times.

What is a Peer? • Temporal similarity • Age-set: Born around the same time • Anomaly detection: Events within same time window • Spatial similarity • Age-set: Live in same location • Anomaly detection: Run on same node • Phase similarity • Age-set: (birth, initiation, marriage) • Anomaly detection: (map, shuffle, reduce) • Contextual similarity • Age-set: Same gender, clan • Anomaly detection: Same workload, h/w Soila Kavulya @ March 2012

Target Systems for Validation • VoIP system at large telecommunication provider • 10s of millions of calls per day, diverse workloads • 100s of network elements with heterogeneous hardware • 24x7 Ops team uses alarm correlation to diagnose outages • Separate team troubleshoots long-term chronics • Labeled traces available • Hadoop: Open-source implementation of MapReduce • Diverse kinds of real workloads • Graph mining, language translation • Hadoopclusters with homogeneous hardware • Yahoo! M45 & Opencloud production clusters • Controlled experiments in Amazon EC2 cluster • Long running jobs (> 100s): Hard to label failures Soila Kavulya @ March 2012

In Support of Thesis Statement Soila Kavulya @ March 2012

Goals & Non-Goals • Goals • Anomaly detection in the absence of labeled failure-data • Diagnosis based on available instrumentation sources • Differentiation of workload changes from anomalies • Non-goals • Diagnosis of system-wide outages • Diagnosis of value faults and transient faults • Root-cause analysis at code-level • Online/runtime diagnosis • Recovery based on diagnosis Soila Kavulya @ March 2012

Assumptions • Majority of system is working correctly • Problems manifest in observable behavioral changes • Exceptions or performance degradations • All instrumentation is locally timestamped • Clocks are synchronized to enable system-wide correlation of data • Instrumentation faithfully captures system behavior Soila Kavulya @ March 2012

Overview of Approach Application Logs Performance Counters End-to-end Trace Construction Anomaly Detection Localization Ranked list of root-causes Soila Kavulya @ March 2012

Target System #1: VoIP IP Base Elements ISP’s network Call Control Elements PSTN Access IP Access Application Servers Gateway Servers Soila Kavulya @ March 2012

Target System #2: Hadoop Slave Nodes Master Node Map/Reduce tasks JobTracker NameNode TaskTracker DataNode OS data OS data HDFS blocks Hadoop logs Hadoop logs Soila Kavulya @ March 2012

Performance Counters • For both Hadoop and VoIP • Metrics collected periodically from /proc in OS • Monitoring interval varies from 1 sec to 15 min • Examples of metrics collected • CPU utilization • CPU run-queue size • Pages in/out • Memory used/free • Context switches • Packets sent/received • Disk blocks read/written Soila Kavulya @ March 2012

End-to-End Trace Construction Application Logs Performance Counters End-to-end Trace Construction Anomaly Detection Localization Ranked list of root-causes Soila Kavulya @ March 2012

Application Logs • Each node logs each request that passes through it • Timestamp, IP address, request duration/size, phone no., … • Log formats vary across components and systems • Application-specific parsers extract relevant attributes • Construction of end-to-end traces • Pre-defined schema used to stitch requests across nodes • Match on key attributes • In Hadoop, match tasks with same task IDs • In VoIP, match calls with same sender/receiver phone no • Incorporate time-based correlation • In Hadoop, consider block reads in same time interval as maps • In VoIP, consider calls with same phone no. within same time interval Soila Kavulya @ March 2012

Application Logs: VoIP IP Base Element Call Control Element Application Server 10:03:59, START 973-123-8888 to 409-555-5555 192.156.1.2 to 11.22.34.1 10:03:59, STOP 10:03:59, ATTEMPT 973-123-8888 to 409-555-5555 Gateway Server 10:04:01, ATTEMPT 973-123-xxxx to 409-555-xxxx 192.156.1.2 to 11.22.34.1 • Combine per-element logs to obtain per-call traces • Approximate match on key attributes • Timestamps, caller-callee numbers, IP, ports • Determine call status from per-element codes • Zero talk-time, callback soon after call termination Soila Kavulya @ March 2012

Application Logs: Hadoop (1) • Peer-comparable attributes extracted from logs • Correlate traces using IDs and request schema Context similarity: TaskType Temporal similarity: Timestamps 2009-03-06 23:06:01,572 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200903062245_0051_r_000005_0 Scheduled 10 of 115 known outputs (0 slow hosts and 105 dup hosts) 2009-03-06 23:06:01,612 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 2 bytes (2 raw bytes) into RAM from attempt_200903062245_0051_m_000055_0 …from ip-10-250-90-207.ec2.internal Phase similarity: MapReduce Hostnames: Spatial similarity Soila Kavulya @ March 2012

Application Logs: Hadoop (2) • No global IDs for correlating logs in Hadoop & VoIP • Extract causal flows using predefined schemas Application logs Flow schema (JSON) MapReduce: { “events” : { “Map” : { “primary-key” : “MapID”, “join-key” : “MapID”, “next-event” : “Shuffle”},… 2009-03-06 23:06:01,572 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200903062245_0051_r_000005_0 Scheduled 10 of 115 known outputs (0 slow hosts and 105 dup hosts) NoSQL Database Extract events Causal flows <time=t2,type=shuffle, reduceid=reduce1,mapid=map1,duration=2s> Soila Kavulya @ March 2012

Anomaly Detection Application Logs Performance Counters End-to-end Trace Construction Anomaly Detection Localization Ranked list of root-causes Soila Kavulya @ March 2012

Anomaly Detection Overview • Some systems have rules for anomaly detection • Redialing number immediately after disconnection • Server reported error codes and exceptions • If no rules available, rely on peer-comparison • Identifies peers (nodes, flows) in distributed systems • Detect anomalies by identifying “odd-man-out” Soila Kavulya @ March 2012

Anomaly Detection (1) • Empirically determine best peer groupings • Window size, request-flow types, job information • Best grouping minimizes false positives in fault-free runs • Peer-comparison identifies “odd-man-out” behavior • Robust to workload changes • Relies on histogram-comparison • Less sensitive to timing differences • Multiple suspects might be identified • Due to propagating errors, multiple independent problems Soila Kavulya @ March 2012

Anomaly Detection (2) Faulty node Normal node Normal node Normalized counts (total 1.0) Normalized counts (total 1.0) Normalized counts (total 1.0) Histograms (distributions) of durations of flows • Histogram comparison identifies anomalous flows • Generate aggregate histogram represents majority behavior • Compare each node’s histogram against aggregate histogram O(n) • Compute anomaly score using Kullback-Leibler divergence • Detect anomaly if score exceeds pre-specified threshold Soila Kavulya @ March 2012

Localization Application Logs Performance Counters End-to-end Trace Construction Anomaly Detection Localization Ranked list of root-causes Soila Kavulya @ March 2012

“Truth table” Request Representation Log Snippet Req1: 20100901064914,SUCCESS,Node1,Map,ReadBlock Req2: 20100901064930,FAIL,Node2,Map,ReadBlock Soila Kavulya @ March 2012

Identify Suspect Attributes • Assume each attribute represented as “coin toss” • Estimate attribute distribution using Bayes • Success distribution: Prob(Attribute|Success) • Anomalous distribution: Prob(Attribute|Anomalous) • Anomaly score: KL-divergence between the two distributions Anomalous requests Successful requests Indict attributes with highest divergence between distributions Belief Probability(Node2=TRUE) http://www.pdl.cmu.edu/ Soila Kavulya @ March 2012

Rank Problems by Severity Step 2: Filter all requests exceptthose matching Problem1 Step 1: All requests 120 350 290 Node3 Node2 Node3 670 90 450 Shuffle Shuffle Map 340 160 Indict path with highest anomaly score ExceptionX ExceptionY Problem1: Node2 Map • Problem2: • Node3 • Shuffle Soila Kavulya @ March 2012

Incorporate Performance Counters (1) • Annotate requests on indicted nodes with performance counters based on timestamps • Identify metrics most correlated with problem • Compare distribution of metrics in successful and failed requests Requests on node2 # Timestamp,CallNo,Status,Memory(%),CPU(%) 20100901064914, 1, SUCCESS, 54, 6 20100901065030, 2, SUCCESS, 54, 6 20100901065530, 3, SUCCESS, 56, 4 20100901070030, 4, FAIL, 52, 45 Soila Kavulya @ March 2012

Incorporate Performance Counters (2) All requests 120 350 670 90 Node3 Node2 High CPU Shuffle Map Problem1: Node2 Map High CPU Incorporate performance counters in diagnosis Soila Kavulya @ March 2012

Why Does It Work? • Real-world data backs up utility of peer-comparison • Task durations peer-comparable in >75% of jobs [CCGrid’10] • Approach analyzes both successful and failed requests • Analyzing only failed requests might elevate common elements over causal elements • Iterative approach discovers correlated attributes • Identifies problems due to conjunctions of attributes • Filtering step identifies multiple ongoing problems • Handles unencounteredproblems • Does not rely on historical models of normal behavior • Does not rely on signatures of known defects Soila Kavulya @ March 2012

VoIP: Diagnosis of RealIncidents 8 out of 10 real incidents diagnosed Soila Kavulya @ March 2012

VoIP: Case Studies Incident 1: Chronic due to unsupported fax codec Chronic nightly problem Customers stop using unsupported codec Failed calls for two customers Day1 Day2 Day3 Day4 Day5 Day6 Incident 2: Chronic server problem Unrelated chronic server problem emerges Failed calls for server Server reset Day1 Day2 Day3 Day4 Day5 Day6 Soila Kavulya @ March 2012

Implementation of ApproachDraco: Deployment in Production at AT&T Filter Search 1. Problem1 STOP.IP-TO-PS.487.3 STOP.IP-TO-PSTN.41.0.-.- Chicago*GSXServers MemoryOverload 2. Problem2 STOP.IP-TO-PSTN.102.0.102.102 ServiceB CustomerAcme IP_w.x.y.z ~8500 lines of C code http://www.pdl.cmu.edu/ Soila Kavulya @ March 2012

VoIP: Ranking Multiple Problems Draco performs better at ranking multiple independent problems Soila Kavulya @ March 2012

VoIP: Performance of Algorithm Running on 16-core Xeon (@ 2.4GHz), 24 GB Memory Soila Kavulya @ March 2012

Hadoop: Target Clusters • 10 to 100-node Amazon’s EC2 cluster • Commercial, pay-as-you-use cloud-computing resource • Workloads under our control, problems injected by us • gridmix, nutch, sort, random writer • Can harvest logs and OS data of only our workloads • 4000-processor M45 & 64 node Opencloud cluster • Production environment • Offered to CMU as free cloud-computing resource • Diverse kinds of real workloads, problems in the wild • Massive machine-learning, language/machine-translation • Permission to harvest all logs and OS data Soila Kavulya @ March 2012

Hadoop: EC2 Fault Injection Injected fault on single node Soila Kavulya @ March 2012

Hadoop: Peer-comparison Results Without Causal Flows Different metrics detect different problems True Positive Rates Correlated problems (e.g., packet-loss) harder to localize Metrics Soila Kavulya @ March 2012

Hadoop: Peer-comparison Results With Causal Flows + Localization Correlated problems correctly identified Soila Kavulya @ March 2012

Critique of Approach • Anomaly detection thresholds are fragile • Need to use statistical tests • Anomaly detection does not address problems at master • Peer-groups are defined statically • Assumes homogeneous clusters • Need to automate identification of peers • False positives occur if root-cause not in logs • Algorithm tends to implicate adjacent network elements • Need to incorporate more data to improve visibility Soila Kavulya @ March 2012

Related Work • Chronics fly under the radar • Undetected by alarm mining [Mahimkar09] • Chronics can persist undetected for long periods of time • Hard to detect using change-points [Kandula09] • Hard to demarcate problem periods [Sambasivan11] • Multiple ongoing problems at a time • Single fault assumption inadequate [Cohen05, Bodik10] • Peer-comparison on its own inadequate • Hard to localize propagating problems [Kasick10,Tan10,Kang10] Soila Kavulya @ March 2012

Automated Diagnosis of Chronic Problems in Production Systems

Automated Diagnosis of Chronic Problems in Production Systems

Presentation Transcript

Automated Systems

Diagnosis of road accident problems

Managing Chronic Pain in Rehabilitation Diagnosis

Automated Problem Diagnosis for Production Systems

Automated Diagnosis of Chronic Problems in Production Systems

Automated systems

Diagnosis and Classification of Psychological Problems

Automated Systems

Automated Systems

Diagnosis of chronic Pancreatitis

Machine Learning for Automated Diagnosis of Distributed Systems Performance

Automated Diagnosis of Software Configuration Errors

COPD Problems in Diagnosis

Automated Systems

Automated Systems

Automated Production Reporting in Manufacturing

Diagnosis of road accident problems

Automated Systems

Differential Diagnosis of Post-Insertion Problems