1 / 16

The Shape of Failure

Explore methods for characterizing workstation availability and fault behavior, with a focus on clustered systems and scalability. Learn how to approximate Time to Failure, match data to statistical distributions, and compare measured vs. modeled system behavior across different clusters. Discuss implications, limitations, and future research directions in workstation reliability analysis.

dbriggs
Download Presentation

The Shape of Failure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Shape of Failure Taliver Heath, Richard Martin and Thu Nguyen Rutgers University Department of Computer Science EASY Workshop July 2001 EASY workshop 2001

  2. Goal and Motivation • Characterize workstation availability • Scalable Internet Services • built from clusters for scalability and fault isolation • but components not designed for availability • Current Availability methods ad-hoc • over-engineer and hope for the best • restless sleep next to pagers EASY workshop 2001

  3. Design Approach • Decompose system into components • Characterize fault behavior of each component in isolation • Design system so desired overall failure rate tolerates failure rates of components • This work: whole workstation is a component EASY workshop 2001

  4. Approximating the TTF • Ideal: distribution of Time to Failure (TTF) of workstation • Approximate “failure” with reboot • TTF  TTB EASY workshop 2001

  5. Methodolgy • Collect system last logs • Observe reboot times • Collect length of time between boots (TTB) • Fit observed data to multiple distributions to see which is most representative EASY workshop 2001

  6. Observed Systems • Undergrad cluster • 20 Ultra1’s open to juniors+seniors, 1 admin • Machine room cluster • 17 Ultra1’s,2 sparc20’s, operator access only, 3 admins • Industrial cluster • 8 Netra’s, 9 e450’s, 21 Ultra’s 1’s, operator access only, 1 admin EASY workshop 2001

  7. Matching to a distribution • Maximum Likelihood Estimates to approximate the distribution • Least squares fit to a quantile-quantile plot of data points to the distributions: • Exponential, Weibull, Pareto, Rayleigh • Best match is a Weibull distribution EASY workshop 2001

  8. Measured vs. Modeled: ugrad EASY workshop 2001

  9. Measured vs. Molded: machine room EASY workshop 2001

  10. Measured vs. Modeled: Industrial EASY workshop 2001

  11. Side by side comparison EASY workshop 2001

  12. Results • Workstations that have been up longer are more likely to stay up than those recently rebooted • Weibull shape <1 mean systems not memoryless • Similar results across all 3 clusters • timescales different, but shape of curves the same EASY workshop 2001

  13. Implications • OS rejuvenation? • is effect large enough to observe? • Useful lifetime < bathtub model? • Is a 3 year useful life < decay area? • All systems stay in the “flat-region”? • Load balancing? • Not clean when restarted? • Upgrades EASY workshop 2001

  14. Limitations • TTB only approximates TTF • e.g. a disk error may be a “failure” not captured • downtime not measured • Many factors aggregated • difficult to determine problematic sub-component • Independence assumption • model assumes independent experiments EASY workshop 2001

  15. Future Work • Independence assumption • Conditional probability I.e. if A reboots, is B more likely to reboot soon? • Event loggers (measurability) • Are reboots correlated with load? • What are the first-order factors? • More/longer industrial data • Diversification and comparison of systems • Same models apply to windows, linux? EASY workshop 2001

  16. More Info www.panic-lab.rutgers.edu EASY workshop 2001

More Related