430 likes | 721 Views
SAMC: Sematic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. Tanakorn Leesatapornwongsa , Mingzhe Hao , Pallavi Joshi * , Jeffrey F. Lukman † , and Haryadi S. Gunawi. †. *. Internet Services. Internet Services. Cloud Systems. Reliability. Complex failures.
E N D
SAMC: Sematic-Aware Model Checkingfor Fast Discovery ofDeep Bugs in Cloud Systems Tanakorn Leesatapornwongsa,MingzheHao, Pallavi Joshi*,Jeffrey F. Lukman†, and Haryadi S. Gunawi † *
Internet Services Cloud Systems
Reliability Complex failures “Deep bugs”
Deep Bug Example B C A ZooKeeper (synchronization service) Issue #335. 1. Nodes A, B, C start (w/ latex txid: 10) 2. B becomes leader 3. B crashes 4. C becomes leader 5. C commits new txid-value pair (11, X) 6. Acrashes, beforecommitting the new txid 11 7. C loses quorum and C crashes 8. A and B are back online afterC crashes 9. Abecomes leader 10. A's commits new txid-value pair (11, Y) 11. C is back online afterA's new tx commit 12. C announce to B (11, X) 13. B replies diff starting with tx 12 14. Inconsistency: A has (11, Y), C has (11, X) F F F F F F F F L L L L L L L x x x x y y x x y y PERMANENT INCONSISTENT REPLICA
Deep Bug Characteristics ZooKeeper (synchronization service) Issue #335. 1. Nodes A, B, C start (w/ latex txid: 10) 2. B becomes leader 3. B crashes 4. C becomes leader 5. C commits new txid-value pair (11, X) 6. Acrashes, beforecommitting the new txid 11 7. C loses quorum and C crashes 8. A and B are back online afterC crashes 9. Abecomes leader 10. A's commits new txid-value pair (11, Y) 11. C is back online afterA's new tx commit 12. C announce to B (11, X) 13. B replies diff starting with tx 12 14. Inconsistency: A has (11, Y), C has (11, X) 1. Out-of-order messages 2. Multiple crashes Specific Order 3. Multiple reboots HAPPEN IN ANY ORDER 1 2 3
Study of Deep Bugs # crashes / # reboots # crashes / # reboots SEVERE IMPLICATIONS INCONSISTENT REPLICAS, DATA LOSS, DOWNTIMES, ETC. x-axis is bug number y-axis is number of crashes and reboots # crashes / # reboots
How to Catch Deep Bugs • Distributed system model checker • Re-ordering all non-deterministic events • Find which specific orderings lead to bugs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 2 7 1 4 5 6 3 8 11 10 9 12 14 13 6 9 3 4 5 1 7 8 2 10 11 13 12 14 3 4 1 5 2 6 7 8 9 12 11 10 14 13 2 1 3 4 5 7 6 8 9 10 11 14 12 13 . ..
What’s Wrong withExisting Model Checkers? • Last 7 years • MaceMC[NSDI’07], Modist [NSDI’09], dBug [SSV’10], Demeter [SOSP’13], etc. • BUT • Too many events • Multiple crashes and reboots • Create more messages • No model checker incorporate multiple crashes and reboots • Cannot find deep bugs! 100 events
Black-Box Approach • Existing model checkers are so slow • They treat target systems as black boxes • A large number of event orderings are generated Black Box A C D B
Semantic Knowledge • How can we make model checker fast? • Exploit semantic knowledge • Semantic-aware model checker (SAMC) Black Box SAMC A C D B
Black Box vs. SAMC Black Box Message Processing Semantic A C D B A C D B Unnecessary Re-orderings Lead to the same state
SAMC with Crashes Crash Recovery Semantic N2 A,B N1 N3 C,D Unnecessary Re-orderings N4
Generic Reduction Algorithms Reboot Synchronization Symmetry (RSS) Crash-Message Independence (CMI) Local-Message Independence (LMI) CrashRecovery Symmetry (CRS) Principle of Semantic Awareness
SAMC Implementation and Integration • SAMC implementation • 10,000 LOC from scratch • Apply SAMC to 3 cloud systems • 7 protocols • 10 versions
Result • Reproduced 12 old bugs • Compare to state-of-the-art techniques • Dynamic Partial Order Reduction (DPOR) • Random-DPOR • Find bugs 2xto 340xfaster • 49x on average • Found 2 new bugs • Submit them to developers
Outline • Intuition • SAMC • Local-Message Independence • Crash-Message Independence • Crash Recovery Symmetry • Reboot Synchronization Symmetry • Evaluation
Dependency vs. Independency A, B Node State = S S’ A, B S S’ S S’’ B, A B, A A B Unnecessary A, B = Dependent A, B = Independent INDEPENDENT = NO REORDERING 2X SPEED-UP
Black Box vs. SAMC Black Box Semantic Info A B C D A B C D All dependent Dependent Dependent 6X SPEED-UP
Reduction Speed-up ABCD ABCD ABDC ABDC S S ACBD ACBD . . . . . . . . . . . . . . . . . . . . . . . .
How to Declare Message Independency? Q: Which concurrent messages are independent ? A: Use message processing semantic
Message Processing Semantic inSimplified Leader Election B = 3 B = 3 B = 3 B = 3 B = 3 B = 4 Belief = n3 V = 1 V = 2 V = 4 if(vote <= belief) // do nothing else belief = vote; Vote=1 Vote=2 Vote=4
Removing Re-ordering viaMessage Processing Semantic 1, 2, 3 1, 3, 2 2, 1, 3 • B = 4 B = 4 • B = 4 • B = 4 • B = 4 • B = 4 • B = 4 • B = 4 • B = 4 • B = 4 • B = 4 • B = 4 Belief=4 V = 3 V = 3 • V = 2 • V = 2 V = 1 • V = 1 V = 3 V = 2 • V = 1 Vote=2 Vote=3 Vote=1 . . . Unnecessary if (vote <= belief) // do nothing else belief = vote;
Formalizing the Intuition MESSAGE PROCESSING SEMANTIC if (vote <= belief) // do nothing else belief = vote; DISCARD PATTERN if (isDiscard(msg, ls)) { // do nothing; } DISCARD PREDICATE booleandiscardPredicate(msg, ls) { if (msg.vote <= ls.belief) return true; else return false; }
Formalizing the Intuition Belief=4 Vote=2 Vote=3 Vote=1
Outline • Intuition • SAMC • Local-Message Independence • Crash-Message Independence • Crash Recovery Symmetry • Reboot Synchronization Symmetry • Evaluation
SAMC Architecture a, b SAMC Local state: ls1 Protocol Specific Rules … … Leader Election Atomic Broadcast interceptor Generic Reduction Policies release(c) RSS LMI CMI CRS interceptor Basic Reduction Techniques Dynamic PartialOrder Reduction (DPOR) Symmetry Local state: ls2 c, d
Local-Message Independence SAMC Generic ReductionPolicies Local-Message Independence (LMI) Crash-Message Independence (CMI) Reboot Synchronization Symmetry (RSS) Crash Recovery Symmetry (CRS)
Local-Message Independence • Discard pattern • Increment pattern • Constant pattern C=0 C=1 C=2 C=0 C=2 C=1 if (msg.type== ack) { node.ackCount++; } ack ack ack ack booleanincrementPredicate(msg, ls) { if (msg.type== ack) return true; else return false; }
Crash-Message Independence SAMC Generic ReductionPolicies Local-Message Independence (LMI) Crash-Message Independence (CMI) Reboot Synchronization Symmetry (RSS) Crash Recovery Symmetry (CRS)
Crash-Message Independence L void handleCrash() { if (X == follower && isQuorum()) followerCount--; // No new messages } A,B C,D F F F booleanlocalImpact(X, ls) { if (X == follower && isQuorum()) return true; else return false; } L Local Impact A,B C,D F F F
Crash-Message Independence L void handleCrash() { if (X == leader || !isQuorum()) electLeader() // New messages created } A,B C,D F F F booleanglobalImpact(X, ls) { if (X == leader || !isQuorum()) return true; else return false; } L Global Impact S S S
Crash Recovery SymmetryReboot Synchronization Symmetry SAMC Generic ReductionPolicies Crash-Message Independence (CMI) Local-Message Independence (LMI) Crash Recovery Symmetry (CRS) Reboot Synchronization Symmetry (RSS)
Outline • Intuition • SAMC • Local-Message Independence • Crash-Message Independence • Crash Recovery Symmetry • Reboot Synchronization Symmetry • Evaluation
Evaluation • SAMC implementation • 10,000 LOC from scratch • Apply SAMC to 3 cloud systems • 7 protocols • 10 versions
Protocol-Specific Rules(e.g. ZooKeeper Leader Election) • Guide SAMC to remove re-orderings • 35 LOC on average per protocol
Catching Old Bugs A table shows number of executions to reach the bugs and speedup
Catching Old Bugs A table shows number of executions to reach the bugs and speedup
Reduction Ratio • ZooKeeperleader election protocol • Run black-box DPOR • After each execution, find that this execution is executed by SAMC or not • Count DPOR’s executions that are executed by SAMC too
Conclusion • Deep bugs live in the cloud • Model checker needs to incorporate complex failure to reach deep bugs • State space explosion • Semantic-aware model checking • LMI, CMI, CRS, RSS • Bring future research questions • What other semantic knowledge is useful? • How to extract them from the code automatically?
Thank YouQuestions? http://ucare.cs.uchicago.edu