A Path-based Approach to Managing Failures and Evolution

A Path-based Approach to Managing Failures and Evolution Mike Chen, Anthony Accardi1, Emre Kıcıman, Jim Lloyd2, Dave Patterson, Armando Fox, Eric Brewer UC Berkeley, Tellme1, Stanford Univ., eBay2

Need for Fast Recovery • Failures are common and costly • Daily partial site outages for large sites. • Downtime: $300K - $6million/hr. • Challenges: • Lots of potential sources of faults. • Multiple independent faults. • Distributed runtime behavior (e.g. load balancing) • Observation: very short outages are “free” • Cost of downtime is not linear.

Need for Rapid Evolution • Competition drives demand for new features and bug fixes • Switching cost is low. • Single administrative domain lowers upgrade barrier. • Challenges: • Short release cycles • Weekly and bi-weekly for new features at eBay and Tellme, shorter for bug fixes. • Distributed runtime behavior • Observation: trend towards application server frameworks • E.g. J2EE, .NET, etc.

X = 3Y = true eBay granularity External (end to end) “Micro” view e.g. code-level debuggers Current Approaches to Understand Systems • 2 extremes of granularity • Problems: • Dispersed execution context • Local context often insufficient • “Blackbox” components

App App WS X = 3Y = true WS eBay DB App WS “Macro” view “Micro” view e.g. code-level debuggers “Macro” Approach • Captures the relationship between components and their aggregate behavior • Complements both end-to-end tools and “micro” analysis tools. Web Server External (end to end) “Micro” view e.g. code-level debuggers

request path 1. Web A, t = 12. App A, t = 233. App B, t = 304. DB B, t = 56…. First Step: Path-based Analysis • Paths record runtime properties of requests • components used (name, version, etc) • timestamps • Two principles • Use paths as the core abstraction • Apply statistical analysis to a large number of paths • Focus on correctness • In addition to performance (MSR’s Magpie, HP’s WebMon and Project 5) Web A Web B App A App B App C DB A DB B

observation Tracer Tracer Tracer Tracer Tracer Tracer App Web Web DB DB App Analysis Engines Detection Diagnosis Viz Path Query interface Storage Architecture request • Observation includes: • Component/resource names, version, … • Timestamps • Application-generic tracing • By instrumenting the application servers • E.g. < 1K lines for JBoss, a J2EE app server • Request-centric • Associate system events to user-visible events • Performance overhead • 1-3% for eBay Aggregator Ops/QA/Dev

3 Path-based Frameworks • eBay Stats • 1TB raw logs/day (150GB gzipped), 200Mbps peak • 2K app servers, 40 SuperCAL machines

Talk Outline • Motivation and Approach • Failure Management • Failure detection via path anomalies • Failure diagnosis using machine learning methods • Evolution Management • Application-generic dependency tracking • Detecting and diagnosing changes • Conclusions

Impact Analysis Detection Recovery 78% failure Feedback Diagnosis Repair Failure Management • Goal: minimize impact of failures • User-visible failures => $$$ lost • 78% of recovery time is spent on detection and diagnosis timeline

Fast Recovery Challenges • Many potential causes of failures • SW bugs, hardware, configuration, network, DB, … • Multiple independent failures • Lots of data • Many small, but tolerable failures • Real-time detection/diagnosis • Root cause might not be captured in logs • Tradeoff between logging granularity and overhead • Observation: exact root cause may not be required for many recovery techniques

Failure Detection Concepts • Path collisions • Incomplete paths interrupted by other requests. • Structural anomalies • Learn a set of “good” paths, and flag unseen paths. • Extended to use probabilistic models. requests requests Web A Web B App A App B App C DB A DB B

Sample Paths A B C A B C Learned PCFG C B A S p=.5 p=1 $ B B A p=.5 p=.5 $ C BC A p=1 p=.5 Structural Anomalies in Path Shapes • Probabilistic Context Free Grammar (PCFG) • Represents likely calls made by each component • Learn probabilities of rules based on observed paths • Anomalous path shapes • Score a path by calculating the deviations of P(observed calls) from average. • Detected 90% of injected faults in our experiments

App B Failure Diagnosis Concepts • Idea: all bad paths touch the root cause • Look for path properties common to failed requests • E.g. components used in all failed paths • Extended to use probabilistic models. • Limitation: • Inter-path dependency requests requests Web A Web B App A App B App C DB A DB B

Failure Diagnosis • Summarize each path into: • What features of requests correlate with failures (e.g. NullPointerException)? • Request type, name, pool, host, version, DB, or a combination of these? • Different causes require different recovery techniques Features

Machine Y Machine X Machine MyFeedback Request Name Login Respond ViewFeedback Success Null-Pointer Success Time-out Borrow Statistical Learning Techniques • Cast as feature selection problem in machine learning • Use decision trees because results are easily interpretable • Learn the tree from data (with failed paths) • The edges that lead to failed nodes are the candidates Features Class Label Diagnosis: 1) Machine X and MyFeedback 2) Machine Y and Respond

Recall vs precision tradeoff Recall: % of true faults identified Precision: 1 – false positive rate Decision trees C4.5 w/ adaptation A standard decision tree algorithm MinEntropy A greedy variant that finds one leaf with the most failures Actual results from eBay deployment Association rules Data mining algorithm that computes the conditional probabilities for all combinations of features perfect Diagnosis Results of Decision Trees

Talk Outline • Motivation and Approach • Failure Management • Evolution Management • Application-generic dependency tracking • Detecting and diagnosing expected and unexpected changes • Conclusions

Tracking Dependency • Current approaches • Manual approaches are error-prone and slow • Static analysis captures possible system behavior vs. runtime analysis which captures the actual behavior • Paths directly captures application structure • Application-generic tracking of actual dependency • Zero changes to applications Rubis, a J2EE auction application, hosted on Pinpoint/JBoss

Automatically Derived State Dependency • Paths associate requests with internal state • Coupling of requests through shared state • Easily extended to track fine-grained (e.g. row-level) state sharing Requests R – read W - write PetStore, a J2EE e-commerce application, hosted on Pinpoint/JBoss

obs observation obs obs obs Detecting/Diagnosing Changes • Paths provides a flexible mechanism to profile any sub-path • Take the interval between any two observations • Drill down to identify problematic sub-paths • Statistical analysis simultaneously examines thousands of sub-paths • Use non-parametric tests (e.g. Mann-Whitney) • Thousands of sub-paths tested for every Tellme release path

Paths enables simultaneous testing of many sub-paths drill down to diagnose specific slow sub-paths Detecting/Diagnosing App-level Changes Change detected in 1 sub-path in 1 application Outliers Upper quartile Median Lower quartile 2 versions of a Tellmeapplication

Paths enables simultaneous testing of many sub-paths drilling down to diagnose the specific slow sub-paths No changes Detecting/Diagnosing App-level Changes Outliers Upper quartile Median Lower quartile 2 versions of 2 Tellmeapplications and 3 sub-paths

Paths enables simultaneous testing of many sub-paths drilling down to diagnose the specific slow sub-paths Detecting/Diagnosing App-level Changes App fixed 3 versions of 2 Tellmeapplications and 3 sub-paths

Detecting/Diagnosing Platform Changes • Look for consistent deviation across applications Change detected in 1 sub-path in 1 application 2 versions of a Tellmeplatform

Detecting/Diagnosing Platform Changes • Look for consistent deviation across applications Consistent changesacross all apps 2 versions of a Tellmeplatform

Detecting/Diagnosing Platform Changes • Look for consistent deviation across applications platform fixed 3 versions of a Tellmeplatform

Lessons Learned • Separate the path analysis logic from observation instrumentation • Improves maintainability and extensibility • Data is cheap • Allows the use of simple statistical algorithms • Live workload • Important to support online use of tools • Record “attempts” • Failed components/resources may not record observations properly

Summary • Paths + statistical analysis: • Improves failure detection and diagnosis to support fast recovery. • Automates dependency tracking and change analysis to support rapid and correct evolution. • Deployed and evaluated on real systems • Pinpoint, Tellme, and eBay • Future work: • Wide-area systems and systems that span multiple administrative domains

Thank You • Acknowledgements • Berkeley/Stanford ROC Research Group • Professor Michael Jordan and Alice Zheng • Shepherd Miguel Castro and anonymous reviewers • For more info: • Google, Yahoo, or MSN Search for Mike Chen

Backup Slides

Expected Time Saved = E (Manual Diag. + Recovery) – E (Automated & Manual Diag. + Recovery) Use diagnosis time based on experience: Diagnosis time: Automated = 1min, Manual (perfect) = 15min Recovery time (w/ verification) = 5 min $50K to $1million saved Recovery Time Saving Time Saved (min) Noise Filtering Threshold

Show eBay’s Complex System Diagram • Show a few path examples

Failure Management Process • Detection • Isolation • Diagnosis • Impact Analysis • Repair • Feedback

MinEntropy • Entropy measures the randomness of data • E.g. if failure is evenly distributed (very random), then entropy is high • Rank features by the normalized entropy • E.g. if root cause is a machine failure, then entropy will be low in the host dimension. Since all types of requests will fail on that host, the entropy in the request type dimension will be higher. • Implemented at eBay • Greedy approach searches for the leaf node with most failures • Pros: fast (<1s for 100K txns and scales linearly) • Cons: • Optimized for single faults • Features may not be independent (ie. pool and host)

MinEntropy example Alert: Build E293 causing URL error storm (not specific to any URL) in pool CGI1

Association Rules • Data mining technique to compute item sets • e.g. Shoppers who bought this item also shopped for … • Metrics • Confidence: (# of A & B) / # of A • Conditional probability of B given A • Support: (# of A & B)/total # of txns • Generates rules for all possible sets • e.g. machine=abc, txn=login => status=NullPointerException (conf:0.1, support=0.02) • Applied to failure diagnosis • Find all rules that has failed status on the right • Pros: looks at combinations of features • Cons: generates many rules

Adapting Association Rules • Sample output (rules containing failures): TxnType=URL Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) TxnType=URL TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) • Rank by the size of item sets if support and conf are equal • TxnName = LeaveFeedback

Impact Analysis Detection Recovery timeline failure Feedback Diagnosis Repair Failure Management • Goal: minimize impact of failure • User-visible failures => $$$ lost • Fast failure detection and diagnosis are critical to availability • 78% of recovery time is spent on detection and diagnosis

PCFG Thresholding • Set a threshold for declaring anomalies • Static threshold: any request > 99th or 99.5th percentile • Dynamic threshold: when proportions don't match known good.

Failure Diagnosis Experiments • Data set • 10 one-minute traces, 4 with 2 independent faults • total of 14 independent faults • About 1/8 of the whole site (640 potential single-faults) • Metrics • Recall: % of true faults identified = (# of identified faults) / (# of true faults) • Precision: 1 – false positive rate = (# of identified faults) / (# of predicted faults)

eBay’s Site • 2 physical tiers • Web server/app server + DB • Apps in both Java (WebSphere) and C++ • SuperCAL (Centralized Application Logging) • API for app developer to log anything to CAL • Platform logs common path features: cookie, host, URL, DB table(s), status, etc. • Stats • 1TB raw logs/day (150GB gzipped), 200Mbps peak • 2K app servers, 40 SuperCAL machines How to diagnose accurately and efficiently???

A Path-based Approach to Managing Failures and Evolution

A Path-based Approach to Managing Failures and Evolution

Presentation Transcript

Designing and Managing Programs: An effectiveness-based approach

Evo -Ed: A Case-based Approach to Teaching Evolution

Evo -Ed: A Case-based Approach to Teaching Evolution

A Simulation-Based Approach to the Evolution of the G -matrix

A Randomized Approach to Robot Path Planning Based on Lazy Evaluation

MANAGING: A COMPETENCY BASED APPROACH 11 th Edition

‘Risk-based’ Approach to Managing Infrastructure a ‘Commercial Prospective’

A Generative Approach to Model Interpreter Evolution

A Pragmatic Approach To Billing Systems Evolution

A Bolometric Approach To Galaxy And AGN Evolution.

A Team Approach to Managing and Changing Challenging Behaviors

MANAGING: A COMPETENCY BASED APPROACH 11 th Edition

A Regional Approach to Managing the Ecosystem

MANAGING: A COMPETENCY BASED APPROACH 11 th Edition

MANAGING: A COMPETENCY BASED APPROACH 11 th Edition

A Values-based Approach to Campaigns and Communications

A New and Revolutionary Approach to Managing Incontinence

Blockchain: A decentralized and “smart” approach to managing finances

A Values-based Approach to Campaigns and Communications

MANAGING: A COMPETENCY BASED APPROACH 11 th Edition

MANAGING: A COMPETENCY BASED APPROACH 11 th Edition

A Scalable Approach to Deploying and Managing Appliances