1 / 36

Towards Automatically Checking Thousands of Failures with Micro-Specifications

This paper explores FATE and DESTINI, two solutions for automatically checking failure recovery in cloud systems, and discusses their findings and benefits.

ajacobson
Download Presentation

Towards Automatically Checking Thousands of Failures with Micro-Specifications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do†, Pallavi Joshi, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau†, Remzi H. Arpaci-Dusseau†, Koushik Sen University of California, Berkeley † University of Wisconsin, Madison

  2. Cloud Era Solve bigger human problems Use cluster of thousands of machines 2

  3. Failures in The Cloud “The future is a world of failures everywhere” - Garth Gibson “Recovery must be a first-class operation” -Raghu Ramakrishnan “Reliability has to come from the software” - Jeffrey Dean 3

  4. 4

  5. 5

  6. Why Failure Recovery Hard? • Testing is not advanced enough against complex failures • Diverse, frequent, and multiple failures • FaceBook photo loss • Recovery is under specified • Need to specify failure recovery behaviors • Customized well-grounded protocols • Example: Paxos made live – An engineering perspective [PODC’ 07] 6

  7. Our Solutions • FTS (“FATE”) – Failure Testing Service • New abstraction for failure exploration • Systematically exercise 40,000 unique combinations of failures • DTS (“DESTINI”) – Declarative Testing Specification • Enable concise recovery specifications • We have written 74 checks (3 lines / check) • Note: Names have changed since the paper 7

  8. Summary of Findings • Applied FATE and DESTINI to three cloud systems: HDFS, ZooKeeper, Cassandra • Found 16 new bugs • Reproduced 74 bugs • Problems found • Inconsistency • Data loss • Rack awareness broken • Unavailability 8

  9. Outline • Introduction • FATE • DESTINI • Evaluation • Summary 9

  10. X1 X2 X3 Setup Stage Alloc. Req. Data Transfer Stage Failures at DIFFERENT STAGES lead to DIFFERENT FAILURE BEHAVIORS Goal: Exercise different failure recovery path M C 1 2 3 M C 1 2 3 4 Setup Stage Recovery: Recreate fresh pipeline No failures M C 1 2 3 M C 1 2 3 Data transfer Stage Recovery: Continue on surviving nodes Bug in Data Transfer Stage Recovery 10

  11. X X X X X X FATE M C 1 2 3 • A failure injection framework • target IO points • Systematically exploring failure • Multiple failures • New abstraction of failure scenario • Remember injected failures • Increase failure coverage 11

  12. Failure ID 2 3 12

  13. How Developers Build Failure ID? • FATE intercepts all I/Os • Use aspectJ to collect information at every I/O point • I/O buffers (e.g file buffer, network buffer) • Target I/O (e.g. file name, IP address) • Reverse engineer for domain specific information 13

  14. Failure ID 2 3 12

  15. A B B C B A C A A B A B C A Exploring Failure Space M C 1 2 3 M C 1 2 3 Exp #1: A AB Exp #2: B AC Exp #3: C BC 14

  16. Outline • Introduction • FATE • DESTINI • Evaluation • Summary 15

  17. DESTINI • Enable concise recovery specifications • Check if expected behaviors match with actualbehaviors • Important elements: • Expectations • Facts • Failure Events • Check Timing • Interpose network and disk protocols 16

  18. Writing specifications “Violation if expectationis different from actual facts” violationTable():- expectationTable(), NOT-IN actualTable() DataLog syntax: :- derivation , AND 17

  19. X X Correct recovery Incorrect Recovery M C 1 2 3 M C 1 2 3 incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N); 18

  20. X X Correct recovery Incorrect recovery M C 1 2 3 M C 1 2 3 incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N); BUILD EXPECTATIONS CAPTURE FACTS 19

  21. X Building Expectations M C 1 2 3 expectedNodes(B, N) :- getBlockPipe(B, N); Master Client Give me list of nodes for B [Node 1, Node 2, Node 3] 20

  22. X setupAcks (B, Pos, Ack) :- cdpSetupAck (B, Pos, Ack); goodAcksCnt (B, COUNT<Ack>) :- setupAcks (B, Pos, Ack), Ack == ’OK’; nodesCnt (B, COUNT<Node>) :- pipeNodes (B, , N, ); writeStage (B, Stg) :- nodesCnt (NCnt), goodAcksCnt (ACnt), NCnt == Acnt, Stg := “Data Transfer”; Updating Expectation M C 1 2 3 DELexpectedNodes(B, N) :- fateCrashNode(N), writeStage(B, Stage), Stage = “Data Transfer”, expectedNode(B, N) • “Client receives all acks from setup stage writeStage”  enter Data Transfer stage • Precise failure events • Different stages  different recovery behaviors  different specifications • FATE and DESTINI must work hand in hand 21

  23. X X Capture Facts Correct recovery Incorrect recovery actualNodes(B, N) :- blocksLocation(B, N, Gs), latestGenStamp(B, Gs) M C 1 2 3 M C 1 2 3 B_gs2 B_gs1 B_gs1 22

  24. Violation and Check-Timing • There is a point in time where recovery is ongoing, thus specifications are violated • Need precise events to decide when the check should be done • In this example, upon block completion • incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N), • cnpComplete(B) ; 23

  25. Capture Facts, Build Expectation from IO events • No need to interpose internal functions • Specification Reuse • For the first check, # rules : #check is 16:1 • Overall, #rules: # check ratio is 3:1 Rules 24

  26. Outline • Introduction • FATE • DESTINI • Evaluation • Summary 25

  27. Evaluation • FATE: 3900 lines, DESTINI: 1200 lines • Applied FATE and DESTINI to three cloud systems • HDFS, ZooKeeper, Cassandra • 40,000 unique combination of failures • Found 16 new bugs, reproduced 74 bugs • 74 recovery specifications • 3 lines / check 26

  28. Bugs found • Reduced availability and performance • Data loss due to multiple failures • Data loss in log recovery protocol • Data loss in append protocol • Rack awareness property is broken 27

  29. Conclusion • FATE explores multiple failure systematically • DESTINI enables concise recovery specifications • FATE and DESTINI: a unified framework • Testing recovery specifications requires a failure service • Failure service needs recovery specifications to catch recovery bugs 28

  30. Thank you! QUESTIONS? Berkeley Orders of Magnitude http://boom.cs.berkeley.edu The Advanced Systems Laboratory http://www.cs.wisc.edu/adsl Downloads our full TR paper from these websites 29

  31. New Challenges • Exponential growth of multiple failures • FATE exercised 40,000 failure combinations in 80 hours 30

  32. DESTINI vs. Related works 31

  33. Filters FATE Architecture HDFS Failure Server Workload Driver while (server injects new failureIDs) { runWorkload(); // e.g hdfs.write } Fail/ No Fail? Failure Surface Java SDK

  34. N D C DESTINI DESTINI stateY(..) :- cnpEv(..), state(X); FATE

  35. Current state of the Art: • Failure exploration • Rarely deal with multiple failures • Or using random approach • System specifications • Unit test checking: cumbersome • WiDS, Pip: not integrated with failure service

  36. Static: InputStream.read() Domain: - Src : Node 1 - Dest: Node 2 - Type: Setup Static: InputStream.read() Domain: - Src : Node 2 - Dest: Node 3 - Type: Data Transfer Static: InputStream.read() Domain: - Src : Node 1 - Dest: Node 2 - Type: Data Transfer M C 1 2 3 4 M C 1 2 3 X1 Recovery 1: Recreate fresh pipeline No failures M C 1 2 3 M C 1 2 3 X2 X3 Bug in recovery 2 Recovery 2: Continue on surviving nodes 35

More Related