550 likes | 566 Views
This study examines the bugs and issues found in various cloud systems, focusing on reliability, performance, availability, security, consistency, scalability, topology, and QoS. The study analyzes a database of over 21,000 issues, with a detailed analysis of 3,600 vital issues. The findings provide valuable insights for improving cloud dependability tools in the future.
E N D
What Bugs Live in the Cloud?A Study of 3000+ Issues in Cloud Systems Haryadi S. Gunawi, MingzheHao, TanakornLeesatapornwongsa, TiratatPatana-anake Thanh Do Jeffry Adityatama, Kurnia J. Eliazar, AgungLaksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria
Cloud era No Deep Root Causes…
What reliability research community do? • Bug study • A Study of Linux File System Evolution. In FAST ’13. • A Comprehensive Study on Real World Concurrency Bug Characteristics. In ASPLOS ’08. • Precomputing Possible Configuration Error Diagnoses. In ASE ’11. …
Open sourced cloud software • Publicly accessible bug repositories
Study to solve… • What bugs “live” in the cloud? • Are there new classes of bugs uniqueto cloud systems? • How should cloud dependability tools evolve in near future? • Many others questions…
Cloud Bug Study (CBS) • 6 systems: HadoopMapReduce, HDFS, HBase, Cassandra, Zookeeper, and Flume • 11 people, 1 year study • Issues in a 3-year window: Jan 2011 to Jan 2014 • ~21000 issues reviewed • ~3600 “vital” issues in-depth study • Cloud Bug Study (CBS) database
Classifications • Aspects – Reliability, performance, availability, security, consistency, scalability, topology, QoS • Hardware failures- types of hardware and types of hardware failures • Software bug types – Logic, error handling, optimization, config, race, hang, space, load • Implications – Failed operation, performance, component down- time, data loss, data staleness, data corruption • ~25000 annotations in total, about 7 annotations per issue
Cloud Bug Study (CBS) database • Open to public
Outline • Introduction • Methodology • Overview of results • Other CBS database use cases • Conclusion
Methodology • 6 systems, 3-year span, 2011 to 2014 • 20~30 bugs a day! Protein yeah! • 17% “vital” issues affecting real deployments • 3655 vital issues
Example issue Title Time to resolve Type & Priority Description Discussion
Outline • Introduction • Methodology • Overview of results • Other CBS database use cases • Conclusion
Classifications for each vital issue • Aspects • Hardware types and failure modes • Software bug types • Implications • Bug scopes
Overview of result • Aspects • Hardware faults vs. Software faults • Implications
Aspects • CS = Cassandra • FL = flume • HB = HBase • HD = HDFS • MR = MapReduce • ZK = ZooKeeper
Aspects: Reliability • Reliability (45%) • Operation & job failures/errors, data loss/corruption/staleness
Aspects: Performance • Reliability • Performance (22%)
Aspects: Availability • Reliability • Performance • Availability (16%) • Node and cluster downtime
Aspects: Security • Reliability • Performance • Availability • Security (6%)
Overview of result • Aspects (classical) • Aspects • Data consistency, scalability, topology, QoS • Hardware faults vs. Software faults • Implications
Aspects: Data consistency • Data consistency (5%) • Permanent inconsistent replicas • Various root causes: • Buggy operational protocol • Concurrency bugs and node failures
Cassandra cross-DC synchronization A’ A’ A’ A B B’ B’ B’ Permanent inconsistency C’ C’ C Background operational protocols often buggy!
Aspects: Scalability • Data consistency • Scalability (2%) • Small number does not mean not important! • Only found at scale • Large cluster size • Large data • Large load • Large failures
Large cluster • In Cassandra Ring position changed. 100x O(n3) calculation CPU explosion
Large data In HBase Insufficient lookup operation Tens of minutes R1 R… R2 R100K R3
Large load 1000x small files in parallel In HDFS … … … Not expecting small files!
Large failure Un-optimized connection AM managing 16,000 tasks fails … 1 5K Time cost: 7+ hours 2 1K 3K … 2K 4K 16K 3
From above examples… • Protocol algorithms must anticipate • Large cluster sizes • Large data • Large request load of various kinds • Large scale failures • The need for scalability bug detection tools
Aspects: Topology • Data consistency • Scalability • Topology (1%) • Systems have problem when deployed on some network topology • Cross DC • Different racks • New layering architecture • Typically unseen in pre-deployment
Aspects: QoS • Data consistency • Scalability • Topology • QoS (1%) • Fundamental for multi-tenant systems • Two main points • Horizontal/intra-system QoS • Vertical/cross-system QoS
Overview of result • Aspects (classical) • Aspects (unique) • Data consistency, scalability, topology, QoS • Hardware faults vs. Software faults • Implications
HW faults vs. SW faults “Hardware can fail, and reliability should come from software.”
HW faults and modes • 299 improper handling of • node fail-stop failure • A 25% normal speed • memory card causes problems • in HBase deployment.
Hardware faults vs. Software faults • Hardware failures, components and modes • Software bug types
Software bug types: Logic • Logic (29%) • Many domain-specific issues
Software bug types: Error handling • Logic • Error handling (18%) • Aspirator, Yuan et al, [OSDI’ 14]
Software bug types: Optimization • Logic • Error handling • Optimization (15%)
Software bug types: Configuration • Logic • Error handling • Optimization • Configuration (14%) • Automating Configuration Troubleshooting. [OSDI ’10] • PrecomputingPossible Configuration Error Diagnoses. [ASE ’11] • Do Not Blame Users for Misconfigurations. [SOSP ’13]
Software bug types: Race • Race (12%) • < 50% local concurrency bugs • Buggy thread interleaving • Tons of work • > 50% distributed concurrency bugs • Reordering of messages, crashes, timeouts • More work is needed • SAMC [OSDI ’14]
Software bug types: Hang • Hang (4%) • Classical deadlock • Un-served jobs, stalled operations, … • Root causes? • How to detect them?
Software bug types: Space • Space (4%) • Big data + leak = Big leak • Clean-up operations must be flawless.
Software bug types: Load • Load (4%) • Happen when systems face high request load • Relates to QoS and admission control
Overview of result • Aspects (classical) • Aspects (unique) • Data consistency, scalability, topology, QoS • Hardware faults vs. Software faults • Implications
Implications • Failed operation (42%) • Performance (23%) • Downtimes (18%) • Data loss (7%) • Data corruption (5%) • Data staleness (5%)
Root causes Every implication can be caused by all kinds of hardware and software faults!
“Killer” bugs • Bugs that simultaneously affect multiple nodes or even the entire cluster • Single Point of Failure still exists in many forms • Positive feedback loop • Buggy failover • Repeated bugs after failover • …
Outline • Introduction • Methodology • Overview of results • Other CBS database use cases • Conclusion
CBS database • 50+ per system and aggregate graphs from mining CBS database in the last one year • Still more waiting to be studied…
Components with most issues How should we enhance reliability for multiple cloud system interaction? Cross-system issues are prevalent!