1 / 32

Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems

The Failure Trace Archive. Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems. Nezih Yigitbasi 1 , Matthieu Gallet 2 , Derrick Kondo 3, Alexandru Iosup 1 , Dick Epema 1. 1 TUDelft , 2 École Normale Supérieure de Lyon , 3 INRIA .

petra
Download Presentation

Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The FailureTraceArchive Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih Yigitbasi1,Matthieu Gallet2, Derrick Kondo3, Alexandru Iosup1, Dick Epema1 1TUDelft, 2École Normale Supérieure de Lyon, 3INRIA http://guardg.st.ewi.tudelft.nl/

  2. Failures Do Happen • … Build a computing system with 10 thousand servers with MTBF of 30 years each, watch one fail per day … Jeff Dean, Google Fellow, LADIS’09 Keynote • … Average worker deaths per MapReduce job is 1.2 … MapReduce, OSDI’04 • … 20-45% failures in TeraGrid … Khalili et al., GRID’06 • … During the month of March 2005 on one dedicated cluster with 1500 Xeon CPUs, there were 32,580 Sawzall jobs launched, using an average of 220 machines each. While running those jobs, 18,636 failures occurred (application failure, network outage, system crash, etc.) that triggered rerunning some portion of the job ... Rob Pike et al., Google

  3. Are Failures Independent? • Common assumption • Is this realistic for large-scale distributed systems? • Already know that space correlations exist • Time correlations may impact • Proactive fault-tolerance solutions • Design decisions • Checkpointing & scheduling decisions (e.g., migrate computation at the beginning of a predicted peak) M.Gallet, N.Yigitbasi, B.Javadi, D.Kondo, A.Iosup, D.Epema, A Model for Space-correlated Failures in Large-scale Distributed Systems, Euro-Par 2010.

  4. Our Goals GOAL 1 Investigate whether failures have time correlations GOAL 2 Model the time-varying behavior of failures (peaks)

  5. Outline Background Our Approach Analysis of Time-Correlation Modeling the Peaks of Failures Conclusions

  6. Why Not Root-Cause Analysis? • Root-cause analysis is definitely useful Challenges • Systems are large and complex • Not all subsystems provide detailed info • Little monitoring/debugging support • Environment-specific or temporary failures • Huge size of failure data • 19 systems

  7. The FailureTraceArchive Failure Trace Archive (FTA) • Provides • Availability traces of diverse distributed systems of different scale • Standard format for failure events • Tools for parsing & analysis • Enables • Comparing models/algorithms using identical data sets • Evaluation of the generality/specificity of models/algorithms across different types of systems • Analysis of availability evolution across time scales • And many more … http://fta.inria.fr

  8. FTA Schema • Hierarchical trace format • Resource centric • Event-based • Associated metadata • Codes for different components and events • Available in raw, tabbed and MYSQL formats

  9. Sample Trace Identifiers for the event/component/node/platform Type of event: unavailability/availability Event start/stop time (UNIX time) Node name

  10. Outline Background Our Approach Analysis of Time-Correlation Modeling the Peaks of Failures Conclusions

  11. Our Approach (1): Outline Traces • Nineteen failure traces from the FTA • Mostly production systems Analysis • Use the auto-correlation of failure rate time series Modeling • Fit well-known probability distributions to the failure data to model failure peaks

  12. Our Approach (2): Traces 100K+ hosts ~1.2 M failure events 15+ years of operation in total http://fta.inria.fr

  13. Our Approach (3): Analysis • Auto-Correlation Function (ACF) • Similarity between observations as a function of the time lag between them • Mathematical tool for finding repeating patterns • Used for assessing time correlations • [-1  1]: weak  strong correlation

  14. Our Approach (4): Modeling • We use five probability distributions to fit to the empirical data • Exponential, Weibull, Pareto, Log-Normal, and Gamma • Maximum likelihood estimation + Goodness of Fit Tests

  15. Outline Background Our Approach Analysis of Time-Correlation Modeling the Peaks of Failures Conclusions

  16. Analysis (1): Auto-correlation • Many systems exhibit moderate/strong auto-correlation for moderate/short time lags (GRID5K, LDNS, SKYPE, …) WEBSITES

  17. Analysis (2): Auto-correlation • Small number of systems exhibit low auto-correlation (TeraGrid, PNNL, NOTRE-DAME) TERAGRID

  18. Analysis (3): Failure Patterns Daily/Weekly Cycles Daily/Weekly Cycles • Systems with similar usage patterns have similar failure patterns SKYPE MICROSOFT

  19. Analysis (4): Workload Intensity vs Failure Rate • There is a strong correlation between the workload intensity and the failure rate in some systems GRID5000

  20. Outline Background Our Approach Analysis of Time-Correlation Modeling the Peaks of Failures Conclusions

  21. Failure Peaks (1): Model 3 4 2 1 μ+kσ μ

  22. Failure Peaks (2): Identification Our goal • Balance between capturing the extreme system behavior and characterizing an important part of the system failures We use a threshold to isolate peaks • μ + kσ where k is a positive integer • Large k=> Few periods explaining only a small fraction of failures • Small k=> More failures of probably very different characteristics We use k=1 • Tried k={0.5, 0.9, 1.0, 1.1, 1.25, 1.5, 2.0} • Over all traces, average fraction of downtime and average number of failures are close (see Technical Report)

  23. Failure Peaks (3): Modeling Results (1) • On average, 50% - 95% of the system downtime is caused by the failures that originate during peaks, but the fraction of peaks < 10% for all platforms • The average peak durations are on the order of 1-2 hours • The average time between peaks is on the order of 15-80 hours • Average IAT over the entire trace is about 9x the IAT during peaks

  24. Failure Peaks (4): Modeling Results (2) • Exponential distribution is not a good fit for IAT during peaks, time between peaks, and failure duration during peaks • Traditional models are not enough • Model parameters do not follow a heavy-tailed distribution • Goodness of fit test results (p-values) for the Pareto distribution are very low • Weibull and the Log-Normal provide the best fit • See the paper for the parameters

  25. Conclusions (1) Large-Scale Study • Nineteen traces most of which are production systems • 100K+ hosts – ~1.2 M failure events – 15+ years of operation • Four new traces available in the FTA (3 CONDOR + 1 TERAGRID) GOAL 1: Analysis • Failures exhibit strong periodic behavior & time correlation • Systems with similar usage patterns have similar failure patterns • Strong correlation between workload intensity and failure rate

  26. Conclusions (2) GOAL 2: Modeling • Peak duration, time between peaks, the failure IAT • during peaks, and the failure duration during peaks • On average 50% - 95% of the system downtime is caused by the failures that originate during peaks (fraction of peaks < 10%) • Weibull & the Log-Normal distributions provide good fit

  27. The FailureTraceArchive Thank you! Questions? Comments? “M.N.Yigitbasi@tudelft.nl” http://www.st.ewi.tudelft.nl/~nezih/ • More Information: • Guard-g Project: http://guardg.st.ewi.tudelft.nl/ • The Failure Trace Archive: http://fta.inria.fr • PDS publication database: http://www.pds.twi.tudelft.nl

  28. X X X

  29. Autocorrelation Function +1 Significant positive correlation at short lags 0 Autocorrelation Coefficient -1 lag k 0 100

  30. Autocorrelation Function +1 0 Autocorrelation Coefficient No statistically significant correlation beyond this lag -1 lag k 0 100

  31. Long-range Dependence • For most processes (e.g., Poisson, or compound Poisson), the autocorrelation function drops to zero very quickly • usually immediately or exponentially fast • For self-similar processes, the autocorrelation function drops very slowly • i.e., hyperbolically, toward zero, but may never reach zero

  32. Autocorrelation Function +1 Typical long-range dependent process 0 Autocorrelation Coefficient Typical short-range dependent process -1 lag k 0 100

More Related