Why does the Cloud stop computing?

Why does the Cloud stop computing? Lessons from hundreds of service outages Haryadi S. Gunawi, MingzheHao, Riza O. Suminto, AgungLaksono, Anang D. Satria, Jeffrey Adityatama, and Kurnia J. Eliazar

COS @ SoCC '16

COS @ SoCC '16 Outages Bugs 2 years ago @ SoCC’14 Study of bugs in datacenter distributed systems(Hadoop, HBase, etc.)

COS @ SoCC '16 Public reports! • Headline news and post-mortem reports • Providers’ transparency • Untapped information • Pros/cons + Detailed root causes + Detailed chain of failures + Downtime durations + Zero false positive -- (Very) incomplete -- (High) variance

COS @ SoCC '16 COS: Cloud Outage Study • 32 services • 597outages • between 2009-2015 • ~70% report downtimes • ~60% report root causes ?

COS @ SoCC '16

COS @ SoCC '16 Downtime/year • On average • 6% services do not reach 99% availability (>88 hours) • 78% not reach 99.9% (>8.8 hours) • Worst year • 31% not reach 99% • 81% not reach 99.9% • 5-nine availability? • It’s just a dream? Hours

COS @ SoCC '16 Root causes(sorted by count)

COS @ SoCC '16 Interesting Root Causes • Upgrade • Involves multi-layers • “a code push behaved differently in widespread use than it had during testing” • To understand/reproduce, need full ecosystem

COS @ SoCC '16 Interesting Root Causes • Human mistakes • Rare now (vs. 10 years ago) • Config/Upgrade software bugs • Bugs in automation process • Similar issues? • But root cause origins are different

COS @ SoCC '16 Config vs. Upgrade Research • Upgrade #1, need more research? • Paper count in last few years  • Challenges: • Multi-layer • Full ecosystem needed • Multi-year? • Reproducible bugs from industry (benchmarks)?

COS @ SoCC '16 Interesting Root Causes • Bugs • What types of bugs lead to outages? Why are not masked? • (pls. see paper) • “Cascading” bugs

COS @ SoCC '16 • “DynamoDB Storage servers query the metadata service for their membership” • “But, on Sunday morning, the metadata service responses exceeded the retrieval time allowed by storage servers [busy timeout]” • “As a result, the storage servers were unable to obtain their membership data, and removedthemselves from taking requests” Storage servers Remove self Timeout Busy Metadata service

COS @ SoCC '16 • “Each EBS storage server contacts data collection servers and reports information that is used for fleet maintenance” • “data collection servers … had a failure” • “this inability to contact a data collection server triggered a latent memory leak bug in the storage servers… • “EBS servers continued trying in a way that slowly consumed system memory” EBS storage servers Memory leak Failure Data collection servers

COS @ SoCC '16 (more in the paper)

COS @ SoCC '16 Where is the SPOF? Redundancies, redundancies, redundancies! Yes, we did that So, why do outages still happen?

COS @ SoCC '16 Failure recovery chain Failure Detection Failover Backups

COS @ SoCC '16 Imperfect failure recovery chain Incomplete Failure Detection Failover that Fails Backups that also Fail

COS @ SoCC '16 Imperfect failure recovery chain • Incomplete error/failure detection • Undetected (specific type of) memory leaks • Load spikes of authentication requests • “an unexpected hardware behavior” Incomplete Failure Detection Failover that Fails Backups that also Fail

COS @ SoCC '16 Imperfect failure recovery chain • Failover/recovery that fails • Bad PLC fails to activate backup power generators • Failed network switch failover • DC failover fails due to cold cache problems • Recovery/re-mirroring storm Incomplete Failure Detection Failover that Fails Backups that also Fail

COS @ SoCC '16 Imperfect failure recovery chain • Multiple failures! • Doublefailures of power, network, storage or server components • Diversefailures: network+server; storage+fibre cut • Cascading bugs … • … that caused many/all redundancies to fail Incomplete Failure Detection Failover that Fails Backups that also Fail

COS @ SoCC '16 COS Database: • Email us / Check our website • More correlations between … • Root cause & downtime • Service maturity & downtime • Root cause & impacts • Root cause & fixes • Etc. ?

COS @ SoCC '16 Conclusion • Features and failures are racing with each other • “Biggest/worst cloud outages of 20YY” – a new year’s tradition • Hope COS tells the cause • Many more examples/details in the papers

COS @ SoCC '16 Thank you!Questions? ceres.cs.uchicago.edu ucare.cs.uchicago.edu

EXTRA

COS @ SoCC '16 Manually extract outage “metadata” Classifications:

COS @ SoCC '16 Aservice outageimplies an unplanned unavailability of partial or fullfeatures of the service that affects all or a significant number of users, in such a way that the outage is reported publicly. Data loss, staleness, and late deliveries that lead to loss of productivity are also considered an outage.

COS @ SoCC '16 #Outages/year • On average • 1/3 of the services, at least 3 unplanned outages per year • Worst Year • (between ’09-’14) • ½ of the services, at least 4unplanned outages per year

COS @ SoCC '16 Downtime by root cause • (sorted by median downtime)

COS @ SoCC '16 Maturity helps? • Does service maturity help? • Based on outage count: • In 2014, 24outages occurred from 9-yr old services

COS @ SoCC '16 Maturity helps? • Based on downtime: • In 2014, 267hours of downtime from 17-yr old services • More mature  more popular  more users  more complex

COS @ SoCC '16 Interesting Root Causes • Load • Spikes of non-monitored requests • User requests (monitored) • Database index accesses • Authentication requests (cryptographic consumption) • Misconfiguration • Ex: traffic redirection • Take-away: be careful with traffic-related code/configs • Recovery feedback loop

COS @ SoCC '16 Interesting Root Causes • Cross (dependencies) • Amazon Web Services • Airbnb, Bitbucket, Dropbox, Foursquare, Github, Heroku, Instagram, Minecraft, Netflix, Pinterest, Quora, Reddit, Vine • Azure • Xbox Live and “52 other services” • Google DC (co-location) • Google Gmail, Search, Drive, Youtube • (40% drop of internet traffic for 5 mins)

COS @ SoCC '16 Studies of failures, enough?

COS @ SoCC '16 Studies of failures, enough? Not all report “d”owntimes Most study only a few services (data behind company walls)

Why does the Cloud stop computing?

Why does the Cloud stop computing?

Presentation Transcript

“This is a Test. This is Only a Test!”

Software Testing

3D Test Issues

Test and Test Equipment December 2012 Hsin -Chu , Taiwan

Who wants to be a Millionaire?

Test Preparation, Test Taking Strategies, and Test Anxiety

Test Automation Tools: QF-Test and Selenium

System Test Specification

TDC ( Test Description Code)

Engine Condition Diagnosis

Chi-square test or c 2 test

200

Test del Software, con elementi di Verifica e Validazione, Qualità del Prodotto Software

Test of Significance

System Test Tools

Lesson 7