1 / 35

Why does the Cloud stop computing?

Why does the Cloud stop computing?. Lessons from hundreds of service outages. Haryadi S. Gunawi , Mingzhe Hao , Riza O. Suminto , Agung Laksono , Anang D. Satria , Jeffrey Adityatama , and Kurnia J. Eliazar. Outages. Bugs. 2 years ago @ SoCC ’ 14

hermione
Download Presentation

Why does the Cloud stop computing?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Why does the Cloud stop computing? Lessons from hundreds of service outages Haryadi S. Gunawi, MingzheHao, Riza O. Suminto, AgungLaksono, Anang D. Satria, Jeffrey Adityatama, and Kurnia J. Eliazar

  2. COS @ SoCC '16

  3. COS @ SoCC '16 Outages Bugs 2 years ago @ SoCC’14 Study of bugs in datacenter distributed systems(Hadoop, HBase, etc.)

  4. COS @ SoCC '16 Public reports! • Headline news and post-mortem reports • Providers’ transparency • Untapped information • Pros/cons + Detailed root causes + Detailed chain of failures + Downtime durations + Zero false positive -- (Very) incomplete -- (High) variance

  5. COS @ SoCC '16 COS: Cloud Outage Study • 32 services • 597outages • between 2009-2015 • ~70% report downtimes • ~60% report root causes ?

  6. COS @ SoCC '16

  7. COS @ SoCC '16 Downtime/year • On average • 6% services do not reach 99% availability (>88 hours) • 78% not reach 99.9% (>8.8 hours) • Worst year • 31% not reach 99% • 81% not reach 99.9% • 5-nine availability? • It’s just a dream? Hours

  8. COS @ SoCC '16 Root causes(sorted by count)

  9. COS @ SoCC '16 Interesting Root Causes • Upgrade • Involves multi-layers • “a code push behaved differently in widespread use than it had during testing” • To understand/reproduce, need full ecosystem

  10. COS @ SoCC '16 Interesting Root Causes • Human mistakes • Rare now (vs. 10 years ago) • Config/Upgrade software bugs • Bugs in automation process • Similar issues? • But root cause origins are different

  11. COS @ SoCC '16 Config vs. Upgrade Research • Upgrade #1, need more research? • Paper count in last few years  • Challenges: • Multi-layer • Full ecosystem needed • Multi-year? • Reproducible bugs from industry (benchmarks)?

  12. COS @ SoCC '16 Interesting Root Causes • Bugs • What types of bugs lead to outages? Why are not masked? • (pls. see paper) • “Cascading” bugs

  13. COS @ SoCC '16 • “DynamoDB Storage servers query the metadata service for their membership” • “But, on Sunday morning, the metadata service responses exceeded the retrieval time allowed by storage servers [busy timeout]” • “As a result, the storage servers were unable to obtain their membership data, and removedthemselves from taking requests” Storage servers Remove self Timeout Busy Metadata service

  14. COS @ SoCC '16 • “Each EBS storage server contacts data collection servers and reports information that is used for fleet maintenance” • “data collection servers … had a failure” • “this inability to contact a data collection server triggered a latent memory leak bug in the storage servers… • “EBS servers continued trying in a way that slowly consumed system memory” EBS storage servers Memory leak Failure Data collection servers

  15. COS @ SoCC '16 (more in the paper)

  16. COS @ SoCC '16 Where is the SPOF? Redundancies, redundancies, redundancies! Yes, we did that So, why do outages still happen?

  17. COS @ SoCC '16 Failure recovery chain Failure Detection Failover Backups

  18. COS @ SoCC '16 Imperfect failure recovery chain Incomplete Failure Detection Failover that Fails Backups that also Fail

  19. COS @ SoCC '16 Imperfect failure recovery chain • Incomplete error/failure detection • Undetected (specific type of) memory leaks • Load spikes of authentication requests • “an unexpected hardware behavior” Incomplete Failure Detection Failover that Fails Backups that also Fail

  20. COS @ SoCC '16 Imperfect failure recovery chain • Failover/recovery that fails • Bad PLC fails to activate backup power generators • Failed network switch failover • DC failover fails due to cold cache problems • Recovery/re-mirroring storm Incomplete Failure Detection Failover that Fails Backups that also Fail

  21. COS @ SoCC '16 Imperfect failure recovery chain • Multiple failures! • Doublefailures of power, network, storage or server components • Diversefailures: network+server; storage+fibre cut • Cascading bugs … • … that caused many/all redundancies to fail Incomplete Failure Detection Failover that Fails Backups that also Fail

  22. COS @ SoCC '16 COS Database: • Email us / Check our website • More correlations between … • Root cause & downtime • Service maturity & downtime • Root cause & impacts • Root cause & fixes • Etc. ?

  23. COS @ SoCC '16 Conclusion • Features and failures are racing with each other • “Biggest/worst cloud outages of 20YY” – a new year’s tradition • Hope COS tells the cause • Many more examples/details in the papers

  24. COS @ SoCC '16 Thank you!Questions? ceres.cs.uchicago.edu ucare.cs.uchicago.edu

  25. EXTRA

  26. COS @ SoCC '16 Manually extract outage “metadata” Classifications:

  27. COS @ SoCC '16 Aservice outageimplies an unplanned unavailability of partial or fullfeatures of the service that affects all or a significant number of users, in such a way that the outage is reported publicly. Data loss, staleness, and late deliveries that lead to loss of productivity are also considered an outage.

  28. COS @ SoCC '16 #Outages/year • On average • 1/3 of the services, at least 3 unplanned outages per year • Worst Year • (between ’09-’14) • ½ of the services, at least 4unplanned outages per year

  29. COS @ SoCC '16 Downtime by root cause • (sorted by median downtime)

  30. COS @ SoCC '16 Maturity helps? • Does service maturity help? • Based on outage count: • In 2014, 24outages occurred from 9-yr old services

  31. COS @ SoCC '16 Maturity helps? • Based on downtime: • In 2014, 267hours of downtime from 17-yr old services • More mature  more popular  more users  more complex

  32. COS @ SoCC '16 Interesting Root Causes • Load • Spikes of non-monitored requests • User requests (monitored) • Database index accesses • Authentication requests (cryptographic consumption) • Misconfiguration • Ex: traffic redirection • Take-away: be careful with traffic-related code/configs • Recovery feedback loop

  33. COS @ SoCC '16 Interesting Root Causes • Cross (dependencies) • Amazon Web Services • Airbnb, Bitbucket, Dropbox, Foursquare, Github, Heroku, Instagram, Minecraft, Netflix, Pinterest, Quora, Reddit, Vine • Azure • Xbox Live and “52 other services” • Google DC (co-location) • Google Gmail, Search, Drive, Youtube • (40% drop of internet traffic for 5 mins)

  34. COS @ SoCC '16 Studies of failures, enough?

  35. COS @ SoCC '16 Studies of failures, enough? Not all report “d”owntimes Most study only a few services (data behind company walls)

More Related