310 likes | 327 Views
Dive into the realm of cloud computing to reveal and classify bugs, and explore methods to enhance system dependability for improved performance and security. This study delves into vital issues affecting deployed systems in a comprehensive one-year analysis. Discover how hardware failures, software bug types, and diverse implications impact reliability, performance, availability, and security in cloud systems.
E N D
What Bugs Live in the Cloud?A Study of 3000+ Issues in Cloud Systems Authors: Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria Presenter: Richeng Huang 1
This is cloud computing era! • Cloud systems are in rapid development. • Complex, need to improve dependability. What Bug do we have? How to classify them? Is there cloud-unique bugs? How should dependability tools improve 2
Cloud Bug Study(CBS) • 6 Target systems: Hadoop MapReduce, HDFS, HBase, Cassandra, Zookeeper, and Flume • 1 year study • Issues in a 3-year window: Jan 2011 to Jan 2014 • ~21000 issues reviewed • ~3600(17%) “vital” issues for in-depth study • vital: affect real deployed systems. 3
Why these 6 systems Distributed cloud computing Framework Scalable storage systems Distributed key-value stores Synchronization services Streaming systems 4
Methodology • Issue Repositories Analysis • Issue Classifications • Cloud Bug Study DB (CBSDB) 5
Issue Reposities • Luckily, Apache Software Foundation Projects each maintains a highly organized issue repository • For example:Zookeeper’s Issue Reposity 6
Example Title Description Time to resolved Discussion Type& Priority 7
Several Classifications • Aspects – Reliability, performance, availability, security, consistency, scalability, topology, QoS • Hardware - processor, disk, memory, network, node. • Hardware failures - Corrupt, limp, stop • Software bug types – Logic, error handling, optimization, config, race, hang, space, load • Implications – Failed operation, performance, component downtime, data loss, data staleness, data corruption 8
Aspects: Reliability • Reliability (45%) • Operation & job failures/errors, data loss/corruption/staleness • CS = Cassandra FL = Flume HB = HBase HD = HDFS MR = MapReduce ZK = ZooKeeper 9
Aspects: Performance • Reliability (45%) • Performance (22%) 10
Aspects: Availability • Reliability (45%) • Performance (22%) • Availability(16%) 11
Aspects: Security • Reliability (45%) • Performance (22%) • Availability(16%) • Security(8%) 12
There’s new aspects in cloud systems • Classical: • Reliability (45%) • Performance (22%) • Availability(16%) • Security(8%) • New: Data consistency, scalability, topology, QoS 13
Aspects: Data consistency • Data consistency (5%) • Permanent inconsistent replicas • Various root causes: • Buggy operational protocol • Concurrency bugs and node failures 14
Aspects • Reliability (45%) • Performance (22%) • Availability(16%) • Security(8%) • Data consistency (5%) • Scalability (2%) • Topology(1%) • QoS (1%) Small numbers, but important, hard to test in small-scale 15
Aspects • Reliability (45%) • Performance (22%) • Availability(16%) • Security(8%) • Data consistency (5%) • Scalability (2%) • Topology(1%) • QoS (1%) Cross DC, Different racks 16
Aspects • Reliability (45%) • Performance (22%) • Availability(16%) • Security(8%) • Data consistency (5%) • Scalability (2%) • Topology(1%) • QoS (1%) Typically in vertical/cross-system QoS. 17
Killer Bugs • bugs that simultaneously affect multiple nodes or even the entire cluster • SPoF still exists in many forms • Positive feedback loop • Buggy failover • Repeated bugs after failover • Distributed deadlock • … 18
Killer Bugs • The figure shows heat maps of correlation between scope of killer bugs (multiple nodes or whole cluster) and hardware/software root causes. A killer bug can be caused by multiple root causes. The number in each cell represents the bug count 19
False Failure Positive feedback loop High Load Recovery • Example Case in Cassandra: More False Failure High Gossip Traffic More nodes More 20
Repeated bugs after failover • A key to no-SPoF: after a successful failover, the system should resume previously failed operation • But for software bugs, a failover the system will run the same buggy logic again… • In HBase, a region server dies due to a bad handling of corrupt region files, live region server that will run the same code and will also die. • Eventually, all region servers go offline 21
HW faults and modes • 299 improper handling of node fail-stop failure • A 25% normal speed memory card causes problems in HBase deployment. 23
Software bug types • Logic (29%) • Error handling (18%) • Optimization (15%) • Configuration (14%) • Data Race (12%) • Hang (4%) - Deadlock • Space (4%) • Load (4%) Load Space Hang Race Config Opt Err-h Logic 24
Implications • Failed operation (42%) • Performance (23%) • Downtimes (18%) • Data loss (7%) • Data corruption (5%) • Data staleness (5%) Corrupt Stale Loss Down Perf Opfail 25
Software/Hardware Faults & Implications Long way from a highly dependable system. Catch all faults! 26
Cloud Bug Study database (CBSDB) • a total of 21,399 issues (3655 vitals) • Open to public • Bug evolution analysis. 27
System evolution Hadoop 2.0 28
Conclude • The largest bug studies for cloud systems to date • Provide insights for a lot of intricate bugs • Unique bugs in cloud systems. • Killer bugs • Cloud Bug Study(CBS) database. 29
Comments • This study includes a huge amount of human effort, not efficient and maintainable. • The study finds out the issues distribution, but do not have any suggestion or solution to them at all. • The study analyses the issues that have all been resolved. These informations is retrievable from repositories. Experts and developers can get implication from the issue report itself. • CBSDB is not active, involving large amount of maintaining time. • The author did not explicitly mention how are we supposed to use this study for future development. 30
Thoughts and Discussion from Piazza • Combine Machine learning and NLP technique for the classification and tagging task. - Hongwei Wang. • They don’t provide possible solution for problem “why are cloud systems not 100% dependable?” - Eric Badger • They say it is still far way 100% dependable. • Need an automatic analysing tool - Sanchit Gupta 31