420 likes | 548 Views
A Path-based Approach to Managing Failures and Evolution. Mike Chen , Anthony Accardi 1 , Emre Kıcıman , Jim Lloyd 2 , Dave Patterson , Armando Fox , Eric Brewer UC Berkeley , Tell me 1 , Stanford Univ. , e B a y 2. Need for Fast Recovery. Failures are common and costly
E N D
A Path-based Approach to Managing Failures and Evolution Mike Chen, Anthony Accardi1, Emre Kıcıman, Jim Lloyd2, Dave Patterson, Armando Fox, Eric Brewer UC Berkeley, Tellme1, Stanford Univ., eBay2
Need for Fast Recovery • Failures are common and costly • Daily partial site outages for large sites. • Downtime: $300K - $6million/hr. • Challenges: • Lots of potential sources of faults. • Multiple independent faults. • Distributed runtime behavior (e.g. load balancing) • Observation: very short outages are “free” • Cost of downtime is not linear.
Need for Rapid Evolution • Competition drives demand for new features and bug fixes • Switching cost is low. • Single administrative domain lowers upgrade barrier. • Challenges: • Short release cycles • Weekly and bi-weekly for new features at eBay and Tellme, shorter for bug fixes. • Distributed runtime behavior • Observation: trend towards application server frameworks • E.g. J2EE, .NET, etc.
X = 3Y = true eBay granularity External (end to end) “Micro” view e.g. code-level debuggers Current Approaches to Understand Systems • 2 extremes of granularity • Problems: • Dispersed execution context • Local context often insufficient • “Blackbox” components
App App WS X = 3Y = true WS eBay DB App WS “Macro” view “Micro” view e.g. code-level debuggers “Macro” Approach • Captures the relationship between components and their aggregate behavior • Complements both end-to-end tools and “micro” analysis tools. Web Server External (end to end) “Micro” view e.g. code-level debuggers
request path 1. Web A, t = 12. App A, t = 233. App B, t = 304. DB B, t = 56…. First Step: Path-based Analysis • Paths record runtime properties of requests • components used (name, version, etc) • timestamps • Two principles • Use paths as the core abstraction • Apply statistical analysis to a large number of paths • Focus on correctness • In addition to performance (MSR’s Magpie, HP’s WebMon and Project 5) Web A Web B App A App B App C DB A DB B
observation Tracer Tracer Tracer Tracer Tracer Tracer App Web Web DB DB App Analysis Engines Detection Diagnosis Viz Path Query interface Storage Architecture request • Observation includes: • Component/resource names, version, … • Timestamps • Application-generic tracing • By instrumenting the application servers • E.g. < 1K lines for JBoss, a J2EE app server • Request-centric • Associate system events to user-visible events • Performance overhead • 1-3% for eBay Aggregator Ops/QA/Dev
3 Path-based Frameworks • eBay Stats • 1TB raw logs/day (150GB gzipped), 200Mbps peak • 2K app servers, 40 SuperCAL machines
Talk Outline • Motivation and Approach • Failure Management • Failure detection via path anomalies • Failure diagnosis using machine learning methods • Evolution Management • Application-generic dependency tracking • Detecting and diagnosing changes • Conclusions
Impact Analysis Detection Recovery 78% failure Feedback Diagnosis Repair Failure Management • Goal: minimize impact of failures • User-visible failures => $$$ lost • 78% of recovery time is spent on detection and diagnosis timeline
Fast Recovery Challenges • Many potential causes of failures • SW bugs, hardware, configuration, network, DB, … • Multiple independent failures • Lots of data • Many small, but tolerable failures • Real-time detection/diagnosis • Root cause might not be captured in logs • Tradeoff between logging granularity and overhead • Observation: exact root cause may not be required for many recovery techniques
Failure Detection Concepts • Path collisions • Incomplete paths interrupted by other requests. • Structural anomalies • Learn a set of “good” paths, and flag unseen paths. • Extended to use probabilistic models. requests requests Web A Web B App A App B App C DB A DB B
Sample Paths A B C A B C Learned PCFG C B A S p=.5 p=1 $ B B A p=.5 p=.5 $ C BC A p=1 p=.5 Structural Anomalies in Path Shapes • Probabilistic Context Free Grammar (PCFG) • Represents likely calls made by each component • Learn probabilities of rules based on observed paths • Anomalous path shapes • Score a path by calculating the deviations of P(observed calls) from average. • Detected 90% of injected faults in our experiments
App B Failure Diagnosis Concepts • Idea: all bad paths touch the root cause • Look for path properties common to failed requests • E.g. components used in all failed paths • Extended to use probabilistic models. • Limitation: • Inter-path dependency requests requests Web A Web B App A App B App C DB A DB B
Failure Diagnosis • Summarize each path into: • What features of requests correlate with failures (e.g. NullPointerException)? • Request type, name, pool, host, version, DB, or a combination of these? • Different causes require different recovery techniques Features
Machine Y Machine X Machine MyFeedback Request Name Login Respond ViewFeedback Success Null-Pointer Success Time-out Borrow Statistical Learning Techniques • Cast as feature selection problem in machine learning • Use decision trees because results are easily interpretable • Learn the tree from data (with failed paths) • The edges that lead to failed nodes are the candidates Features Class Label Diagnosis: 1) Machine X and MyFeedback 2) Machine Y and Respond
Recall vs precision tradeoff Recall: % of true faults identified Precision: 1 – false positive rate Decision trees C4.5 w/ adaptation A standard decision tree algorithm MinEntropy A greedy variant that finds one leaf with the most failures Actual results from eBay deployment Association rules Data mining algorithm that computes the conditional probabilities for all combinations of features perfect Diagnosis Results of Decision Trees
Talk Outline • Motivation and Approach • Failure Management • Evolution Management • Application-generic dependency tracking • Detecting and diagnosing expected and unexpected changes • Conclusions
Tracking Dependency • Current approaches • Manual approaches are error-prone and slow • Static analysis captures possible system behavior vs. runtime analysis which captures the actual behavior • Paths directly captures application structure • Application-generic tracking of actual dependency • Zero changes to applications Rubis, a J2EE auction application, hosted on Pinpoint/JBoss
Automatically Derived State Dependency • Paths associate requests with internal state • Coupling of requests through shared state • Easily extended to track fine-grained (e.g. row-level) state sharing Requests R – read W - write PetStore, a J2EE e-commerce application, hosted on Pinpoint/JBoss
obs observation obs obs obs Detecting/Diagnosing Changes • Paths provides a flexible mechanism to profile any sub-path • Take the interval between any two observations • Drill down to identify problematic sub-paths • Statistical analysis simultaneously examines thousands of sub-paths • Use non-parametric tests (e.g. Mann-Whitney) • Thousands of sub-paths tested for every Tellme release path
Paths enables simultaneous testing of many sub-paths drill down to diagnose specific slow sub-paths Detecting/Diagnosing App-level Changes Change detected in 1 sub-path in 1 application Outliers Upper quartile Median Lower quartile 2 versions of a Tellmeapplication
Paths enables simultaneous testing of many sub-paths drilling down to diagnose the specific slow sub-paths No changes Detecting/Diagnosing App-level Changes Outliers Upper quartile Median Lower quartile 2 versions of 2 Tellmeapplications and 3 sub-paths
Paths enables simultaneous testing of many sub-paths drilling down to diagnose the specific slow sub-paths Detecting/Diagnosing App-level Changes App fixed 3 versions of 2 Tellmeapplications and 3 sub-paths
Detecting/Diagnosing Platform Changes • Look for consistent deviation across applications Change detected in 1 sub-path in 1 application 2 versions of a Tellmeplatform
Detecting/Diagnosing Platform Changes • Look for consistent deviation across applications Consistent changesacross all apps 2 versions of a Tellmeplatform
Detecting/Diagnosing Platform Changes • Look for consistent deviation across applications platform fixed 3 versions of a Tellmeplatform
Lessons Learned • Separate the path analysis logic from observation instrumentation • Improves maintainability and extensibility • Data is cheap • Allows the use of simple statistical algorithms • Live workload • Important to support online use of tools • Record “attempts” • Failed components/resources may not record observations properly
Summary • Paths + statistical analysis: • Improves failure detection and diagnosis to support fast recovery. • Automates dependency tracking and change analysis to support rapid and correct evolution. • Deployed and evaluated on real systems • Pinpoint, Tellme, and eBay • Future work: • Wide-area systems and systems that span multiple administrative domains
Thank You • Acknowledgements • Berkeley/Stanford ROC Research Group • Professor Michael Jordan and Alice Zheng • Shepherd Miguel Castro and anonymous reviewers • For more info: • Google, Yahoo, or MSN Search for Mike Chen
Expected Time Saved = E (Manual Diag. + Recovery) – E (Automated & Manual Diag. + Recovery) Use diagnosis time based on experience: Diagnosis time: Automated = 1min, Manual (perfect) = 15min Recovery time (w/ verification) = 5 min $50K to $1million saved Recovery Time Saving Time Saved (min) Noise Filtering Threshold
Show eBay’s Complex System Diagram • Show a few path examples
Failure Management Process • Detection • Isolation • Diagnosis • Impact Analysis • Repair • Feedback
MinEntropy • Entropy measures the randomness of data • E.g. if failure is evenly distributed (very random), then entropy is high • Rank features by the normalized entropy • E.g. if root cause is a machine failure, then entropy will be low in the host dimension. Since all types of requests will fail on that host, the entropy in the request type dimension will be higher. • Implemented at eBay • Greedy approach searches for the leaf node with most failures • Pros: fast (<1s for 100K txns and scales linearly) • Cons: • Optimized for single faults • Features may not be independent (ie. pool and host)
MinEntropy example Alert: Build E293 causing URL error storm (not specific to any URL) in pool CGI1
Association Rules • Data mining technique to compute item sets • e.g. Shoppers who bought this item also shopped for … • Metrics • Confidence: (# of A & B) / # of A • Conditional probability of B given A • Support: (# of A & B)/total # of txns • Generates rules for all possible sets • e.g. machine=abc, txn=login => status=NullPointerException (conf:0.1, support=0.02) • Applied to failure diagnosis • Find all rules that has failed status on the right • Pros: looks at combinations of features • Cons: generates many rules
Adapting Association Rules • Sample output (rules containing failures): TxnType=URL Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) TxnType=URL TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) • Rank by the size of item sets if support and conf are equal • TxnName = LeaveFeedback
Impact Analysis Detection Recovery timeline failure Feedback Diagnosis Repair Failure Management • Goal: minimize impact of failure • User-visible failures => $$$ lost • Fast failure detection and diagnosis are critical to availability • 78% of recovery time is spent on detection and diagnosis
PCFG Thresholding • Set a threshold for declaring anomalies • Static threshold: any request > 99th or 99.5th percentile • Dynamic threshold: when proportions don't match known good.
Failure Diagnosis Experiments • Data set • 10 one-minute traces, 4 with 2 independent faults • total of 14 independent faults • About 1/8 of the whole site (640 potential single-faults) • Metrics • Recall: % of true faults identified = (# of identified faults) / (# of true faults) • Precision: 1 – false positive rate = (# of identified faults) / (# of predicted faults)
eBay’s Site • 2 physical tiers • Web server/app server + DB • Apps in both Java (WebSphere) and C++ • SuperCAL (Centralized Application Logging) • API for app developer to log anything to CAL • Platform logs common path features: cookie, host, URL, DB table(s), status, etc. • Stats • 1TB raw logs/day (150GB gzipped), 200Mbps peak • 2K app servers, 40 SuperCAL machines How to diagnose accurately and efficiently???