370 likes | 1.12k Views
Why does the Cloud stop computing?. Lessons from hundreds of service outages. Haryadi S. Gunawi , Mingzhe Hao , Riza O. Suminto , Agung Laksono , Anang D. Satria , Jeffrey Adityatama , and Kurnia J. Eliazar. Outages. Bugs. 2 years ago @ SoCC ’ 14
E N D
Why does the Cloud stop computing? Lessons from hundreds of service outages Haryadi S. Gunawi, MingzheHao, Riza O. Suminto, AgungLaksono, Anang D. Satria, Jeffrey Adityatama, and Kurnia J. Eliazar
COS @ SoCC '16 Outages Bugs 2 years ago @ SoCC’14 Study of bugs in datacenter distributed systems(Hadoop, HBase, etc.)
COS @ SoCC '16 Public reports! • Headline news and post-mortem reports • Providers’ transparency • Untapped information • Pros/cons + Detailed root causes + Detailed chain of failures + Downtime durations + Zero false positive -- (Very) incomplete -- (High) variance
COS @ SoCC '16 COS: Cloud Outage Study • 32 services • 597outages • between 2009-2015 • ~70% report downtimes • ~60% report root causes ?
COS @ SoCC '16 Downtime/year • On average • 6% services do not reach 99% availability (>88 hours) • 78% not reach 99.9% (>8.8 hours) • Worst year • 31% not reach 99% • 81% not reach 99.9% • 5-nine availability? • It’s just a dream? Hours
COS @ SoCC '16 Root causes(sorted by count)
COS @ SoCC '16 Interesting Root Causes • Upgrade • Involves multi-layers • “a code push behaved differently in widespread use than it had during testing” • To understand/reproduce, need full ecosystem
COS @ SoCC '16 Interesting Root Causes • Human mistakes • Rare now (vs. 10 years ago) • Config/Upgrade software bugs • Bugs in automation process • Similar issues? • But root cause origins are different
COS @ SoCC '16 Config vs. Upgrade Research • Upgrade #1, need more research? • Paper count in last few years • Challenges: • Multi-layer • Full ecosystem needed • Multi-year? • Reproducible bugs from industry (benchmarks)?
COS @ SoCC '16 Interesting Root Causes • Bugs • What types of bugs lead to outages? Why are not masked? • (pls. see paper) • “Cascading” bugs
COS @ SoCC '16 • “DynamoDB Storage servers query the metadata service for their membership” • “But, on Sunday morning, the metadata service responses exceeded the retrieval time allowed by storage servers [busy timeout]” • “As a result, the storage servers were unable to obtain their membership data, and removedthemselves from taking requests” Storage servers Remove self Timeout Busy Metadata service
COS @ SoCC '16 • “Each EBS storage server contacts data collection servers and reports information that is used for fleet maintenance” • “data collection servers … had a failure” • “this inability to contact a data collection server triggered a latent memory leak bug in the storage servers… • “EBS servers continued trying in a way that slowly consumed system memory” EBS storage servers Memory leak Failure Data collection servers
COS @ SoCC '16 (more in the paper)
COS @ SoCC '16 Where is the SPOF? Redundancies, redundancies, redundancies! Yes, we did that So, why do outages still happen?
COS @ SoCC '16 Failure recovery chain Failure Detection Failover Backups
COS @ SoCC '16 Imperfect failure recovery chain Incomplete Failure Detection Failover that Fails Backups that also Fail
COS @ SoCC '16 Imperfect failure recovery chain • Incomplete error/failure detection • Undetected (specific type of) memory leaks • Load spikes of authentication requests • “an unexpected hardware behavior” Incomplete Failure Detection Failover that Fails Backups that also Fail
COS @ SoCC '16 Imperfect failure recovery chain • Failover/recovery that fails • Bad PLC fails to activate backup power generators • Failed network switch failover • DC failover fails due to cold cache problems • Recovery/re-mirroring storm Incomplete Failure Detection Failover that Fails Backups that also Fail
COS @ SoCC '16 Imperfect failure recovery chain • Multiple failures! • Doublefailures of power, network, storage or server components • Diversefailures: network+server; storage+fibre cut • Cascading bugs … • … that caused many/all redundancies to fail Incomplete Failure Detection Failover that Fails Backups that also Fail
COS @ SoCC '16 COS Database: • Email us / Check our website • More correlations between … • Root cause & downtime • Service maturity & downtime • Root cause & impacts • Root cause & fixes • Etc. ?
COS @ SoCC '16 Conclusion • Features and failures are racing with each other • “Biggest/worst cloud outages of 20YY” – a new year’s tradition • Hope COS tells the cause • Many more examples/details in the papers
COS @ SoCC '16 Thank you!Questions? ceres.cs.uchicago.edu ucare.cs.uchicago.edu
COS @ SoCC '16 Manually extract outage “metadata” Classifications:
COS @ SoCC '16 Aservice outageimplies an unplanned unavailability of partial or fullfeatures of the service that affects all or a significant number of users, in such a way that the outage is reported publicly. Data loss, staleness, and late deliveries that lead to loss of productivity are also considered an outage.
COS @ SoCC '16 #Outages/year • On average • 1/3 of the services, at least 3 unplanned outages per year • Worst Year • (between ’09-’14) • ½ of the services, at least 4unplanned outages per year
COS @ SoCC '16 Downtime by root cause • (sorted by median downtime)
COS @ SoCC '16 Maturity helps? • Does service maturity help? • Based on outage count: • In 2014, 24outages occurred from 9-yr old services
COS @ SoCC '16 Maturity helps? • Based on downtime: • In 2014, 267hours of downtime from 17-yr old services • More mature more popular more users more complex
COS @ SoCC '16 Interesting Root Causes • Load • Spikes of non-monitored requests • User requests (monitored) • Database index accesses • Authentication requests (cryptographic consumption) • Misconfiguration • Ex: traffic redirection • Take-away: be careful with traffic-related code/configs • Recovery feedback loop
COS @ SoCC '16 Interesting Root Causes • Cross (dependencies) • Amazon Web Services • Airbnb, Bitbucket, Dropbox, Foursquare, Github, Heroku, Instagram, Minecraft, Netflix, Pinterest, Quora, Reddit, Vine • Azure • Xbox Live and “52 other services” • Google DC (co-location) • Google Gmail, Search, Drive, Youtube • (40% drop of internet traffic for 5 mins)
COS @ SoCC '16 Studies of failures, enough?
COS @ SoCC '16 Studies of failures, enough? Not all report “d”owntimes Most study only a few services (data behind company walls)