What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems

What Bugs Live in the Cloud?A Study of 3000+ Issues in Cloud Systems Haryadi S. Gunawi, MingzheHao, TanakornLeesatapornwongsa, TiratatPatana-anake Thanh Do Jeffry Adityatama, Kurnia J. Eliazar, AgungLaksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria

First, let’s ask Google

Cloud era No Deep Root Causes…

What reliability research community do? • Bug study • A Study of Linux File System Evolution. In FAST ’13. • A Comprehensive Study on Real World Concurrency Bug Characteristics. In ASPLOS ’08. • Precomputing Possible Configuration Error Diagnoses. In ASE ’11. …

Open sourced cloud software • Publicly accessible bug repositories

Study to solve… • What bugs “live” in the cloud? • Are there new classes of bugs uniqueto cloud systems? • How should cloud dependability tools evolve in near future? • Many others questions…

Cloud Bug Study (CBS) • 6 systems: HadoopMapReduce, HDFS, HBase, Cassandra, Zookeeper, and Flume • 11 people, 1 year study • Issues in a 3-year window: Jan 2011 to Jan 2014 • ~21000 issues reviewed • ~3600 “vital” issues  in-depth study • Cloud Bug Study (CBS) database

Classifications • Aspects – Reliability, performance, availability, security, consistency, scalability, topology, QoS • Hardware failures- types of hardware and types of hardware failures • Software bug types – Logic, error handling, optimization, config, race, hang, space, load • Implications – Failed operation, performance, component downtime, data loss, data staleness, data corruption • ~25000 annotations in total, about 7 annotations per issue

Cloud Bug Study (CBS) database • Open to public

Outline • Introduction • Methodology • Overview of results • Other CBS database use cases • Conclusion

Methodology • 6 systems, 3-year span, 2011 to 2014 • 20~30 bugs a day! Protein yeah! • 17% “vital” issues affecting real deployments • 3655 vital issues

Example issue Title Time to resolve Type & Priority Description Discussion

Classifications for each vital issue • Aspects • Hardware types and failure modes • Software bug types • Implications • Bug scopes

Overview of result • Aspects • Hardware faults vs. Software faults • Implications

Aspects • CS = Cassandra • FL = flume • HB = HBase • HD = HDFS • MR = MapReduce • ZK = ZooKeeper

Aspects: Reliability • Reliability (45%) • Operation & job failures/errors, data loss/corruption/staleness

Aspects: Performance • Reliability • Performance (22%)

Aspects: Availability • Reliability • Performance • Availability (16%) • Node and cluster downtime

Aspects: Security • Reliability • Performance • Availability • Security (6%)

Overview of result • Aspects (classical) • Aspects • Data consistency, scalability, topology, QoS • Hardware faults vs. Software faults • Implications

Aspects: Data consistency • Data consistency (5%) • Permanent inconsistent replicas • Various root causes: • Buggy operational protocol • Concurrency bugs and node failures

Cassandra cross-DC synchronization A’ A’ A’ A B B’ B’ B’ Permanent inconsistency C’ C’ C Background operational protocols often buggy!

Aspects: Scalability • Data consistency • Scalability (2%) • Small number does not mean not important! • Only found at scale • Large cluster size • Large data • Large load • Large failures

Large cluster • In Cassandra Ring position changed. 100x O(n3) calculation CPU explosion

Large data In HBase Insufficient lookup operation Tens of minutes R1 R… R2 R100K R3

Large load 1000x small files in parallel In HDFS … … … Not expecting small files!

Large failure Un-optimized connection AM managing 16,000 tasks fails … 1 5K Time cost: 7+ hours 2 1K 3K … 2K 4K 16K 3

From above examples… • Protocol algorithms must anticipate • Large cluster sizes • Large data • Large request load of various kinds • Large scale failures • The need for scalability bug detection tools

Aspects: Topology • Data consistency • Scalability • Topology (1%) • Systems have problem when deployed on some network topology • Cross DC • Different racks • New layering architecture • Typically unseen in pre-deployment

Aspects: QoS • Data consistency • Scalability • Topology • QoS (1%) • Fundamental for multi-tenant systems • Two main points • Horizontal/intra-system QoS • Vertical/cross-system QoS

Overview of result • Aspects (classical) • Aspects (unique) • Data consistency, scalability, topology, QoS • Hardware faults vs. Software faults • Implications

HW faults vs. SW faults “Hardware can fail, and reliability should come from software.”

HW faults and modes • 299 improper handling of • node fail-stop failure • A 25% normal speed • memory card causes problems • in HBase deployment.

Hardware faults vs. Software faults • Hardware failures, components and modes • Software bug types

Software bug types: Logic • Logic (29%) • Many domain-specific issues

Software bug types: Error handling • Logic • Error handling (18%) • Aspirator, Yuan et al, [OSDI’ 14]

Software bug types: Optimization • Logic • Error handling • Optimization (15%)

Software bug types: Configuration • Logic • Error handling • Optimization • Configuration (14%) • Automating Configuration Troubleshooting. [OSDI ’10] • PrecomputingPossible Configuration Error Diagnoses. [ASE ’11] • Do Not Blame Users for Misconfigurations. [SOSP ’13]

Software bug types: Race • Race (12%) • < 50% local concurrency bugs • Buggy thread interleaving • Tons of work • > 50% distributed concurrency bugs • Reordering of messages, crashes, timeouts • More work is needed • SAMC [OSDI ’14]

Software bug types: Hang • Hang (4%) • Classical deadlock • Un-served jobs, stalled operations, … • Root causes? • How to detect them?

Software bug types: Space • Space (4%) • Big data + leak = Big leak • Clean-up operations must be flawless.

Software bug types: Load • Load (4%) • Happen when systems face high request load • Relates to QoS and admission control

Overview of result • Aspects (classical) • Aspects (unique) • Data consistency, scalability, topology, QoS • Hardware faults vs. Software faults • Implications

Implications • Failed operation (42%) • Performance (23%) • Downtimes (18%) • Data loss (7%) • Data corruption (5%) • Data staleness (5%)

Root causes Every implication can be caused by all kinds of hardware and software faults!

“Killer” bugs • Bugs that simultaneously affect multiple nodes or even the entire cluster • Single Point of Failure still exists in many forms • Positive feedback loop • Buggy failover • Repeated bugs after failover • …

CBS database • 50+ per system and aggregate graphs from mining CBS database in the last one year • Still more waiting to be studied…

Components with most issues How should we enhance reliability for multiple cloud system interaction? Cross-system issues are prevalent!

What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems