100 likes | 394 Views
Cloud Testing. Towards thousands of failures and hundreds of specifications. Haryadi Gunawi. Motivation. Real-world data loss Facebook (03/09), HSBC (07/09), T-Mobile (10/09) 7 out of 10 small-medium firms go out of business Failures in the cloud
E N D
Cloud Testing Towards thousands of failures and hundreds of specifications Haryadi Gunawi
Motivation • Real-world data loss • Facebook (03/09), HSBC (07/09), T-Mobile (10/09) • 7 out of 10 small-medium firms go out of business • Failures in the cloud • Crashes, disk and network failures (permanent and transient), corruptions, etc. • “Millions of opportunities” for failures to occur • Not just single, but also combinations of failures
Testing Cloud Infrastructure • Big and complex recovery • HDFS/GFS • HDFS: 44 out of 600 bug reports/issues pertain to recovery • Cassandra/Dynamo+BigTable, Zookeeper/Chubby • Existing approaches • Mostly single failures • Random multiple failures • How to systematically test cloud infrastructure against failures?
Google FS – Write Protocol Master 1 4 Client API 2 DataNode 1 DataNode 2 DataNode 3 3
HDFS Implementation of Write Master X X X X X X Client API X X X DataNode 1 DataNode 2 DataNode 3 X X X X X X X X X X X X X
Failure Service • Goal: • Exercise combinations of failures systematically • Method: • Identify a failure point: • Source location, node id, stack trace, etc. • Specify max # failures • Run the workload until all combinations of failures are exhausted (similar to model checking) • Challenges: • Efficiency: 400+ fail experiments due to 6 bugs • Single write workload: thousands of experiments • Coverage: failure coverage, workload coverage, state coverage • Integrating failure service into a model checker
Declarative Checks • How do we conclude bugs? • Client observable behaviors • Write distributed invariant checks • Current approach: • Checks are written in Java/C++ • A check is more than 50 LOC • 100 checks 5000 LOC • Our goal/method: • Encourage developers to write hundreds of specs • More specs, more bugs (esp. silent bugs) • Use relational logic language (Bloom) • Write specs in Bloom • Convert runtime events from inter-node protocols, disk I/Os, etc. into Bloom events (i.e., directly verify the implementation against the specs)
Conclusion • Towards thousands of failures • Failure service • Towards hundreds of specifications • Declarative checks with Bloom