300 likes | 619 Views
D 3 S : D ebug D eployed D istributed S ystems. Xuezheng Liu , Zhenyu Guo, Xi Wang, Feibo Chen, Xiaochen Lian , Jian Tang, Ming Wu, M. Frans Kaashoek, Zheng Zhang Microsoft Research Asia, Tsinghua University, Fudan University, Shanghai Jiaotong University, MIT CSAIL.
E N D
D3S: Debug Deployed Distributed Systems Xuezheng Liu, Zhenyu Guo, Xi Wang, Feibo Chen, XiaochenLian, Jian Tang, Ming Wu, M. Frans Kaashoek, Zheng Zhang Microsoft Research Asia, Tsinghua University, Fudan University, Shanghai Jiaotong University, MIT CSAIL
Debugging distributed systems is difficult • Bugs are difficult to reproduce • Many machines executing concurrently • Machines may fail • Network may fail
Example: Distributed lock • Distributed reader-writer locks • Lock mode: exclusive, shared • Invariant: only one client can hold a lock in the exclusive mode • Debugging is difficult because the protocol is complex • For performance, clients cache locks • For failure tolerance, locks have a lease
How do people debug? • Simulation • Model-checking • Runtime checking
State-of-the-art of runtime checking Step 1: add logs void ClientNode::OnLockAcquired(…) { … print_log( m_NodeID, lock, mode); } • Step 2: Collect logs, align them into a globally consistent sequence • Keep partial order • Step 3: Write checking scripts • Scan the logs to retrieve lock states • Check the consistency of locks
Problems for large/deployed systems • Too much manual effort • Difficult to anticipate what needs to log • Too much information: slow systems down • Too little information: miss a problem • Checking for large system is challenging • A central checker cannot keep up • Snapshots must be consistent • Our focus: make runtime checking easier and feasible for deployed/large-scale system
D3S approach state Conflict! state state state state Predicate: no conflict locks Violation! Checker Checker
Our contributions/outline • A simple language for writing distributed predicates • Programmers can change what is being checked on-the-fly • Failure tolerant consistent snapshot for predicate checking • Evaluation with five real-world applications
Design goals • Simplicity: a sequential style for writing predicates • Parallelism: run in parallel on multiple checkers • Correctness: check consistent states in spite of failures • Solution • MapReduce model • Failure-tolerant consistent snapshot
Developers write a D3S predicate V0: exposer { ( client: ClientID, lock: LockID, mode: LockMode ) } V1: V0 { ( conflict: LockID ) } as final after (ClientNode::OnLockAcquired) addtuple ($0->m_NodeID, $1, $2) after (ClientNode::OnLockReleased) deltuple ($0->m_NodeID, $1, $2) class MyChecker : vertex<V1> { virtual void Execute( const V0::Snapshot & snapshot ) { …. // Invariant logic, writing in sequential style } static int64 Mapping( const V0::tuple & t ) ; // guidance for partitioning }; Part 1: define the dataflow and types of states, and how states are retrieved Part 2: define the logic and mapping function in each stage for predicates
D3S parallel predicate checker Lock clients Expose states individually Key: LockID Exposed states (C1, L1, E), (C2, L3, S), (C5, L1, S),… L1 L1 Reconstruct: SN1, SN2, … (C1,L1,E),(C5,L1,S) (C2,L3,S) Checkers
States and dataflow V0: exposer { ( client: ClientID, lock: LockID, mode: LockMode ) } V1: V0 { ( conflict: LockID ) } as final after (ClientNode::OnLockAcquired) addtuple ($0->m_NodeID, $1, $2) after (ClientNode::OnLockReleased) deltuple ($0->m_NodeID, $1, $2) triggersin app Source code for Boxwood client: class ClientNode { ClientIDm_NodeID; void OnLockAcquired( LockID, LockMode ); void OnLockReleased( LockID, LockMode ); }; V0: exposer Set of (C, L, M) checkingfunction V1 (checker) Set of (lock) • Insert a hook to the app using binary rewrite at run time • Triggered at function boundaries to expose app states Final report
Checking functions class MyChecker : vertex<V1> { void Execute( const V0::Snapshot& SN) { foreach (V0::Tupletin SN) { if (t.mode == EXCLUSIVE) ex[t.lock]++; else sh[t.lock]++; } foreach (LockIDL in ex) { if (ex[L] > 1 || (ex[L] == 1 && sh[L] > 0)) output += V1::Tuple(L); } } int64 Mapping( const V0::Tuple& t ) { return t.lock; } }; • Write in C++ language, reuse types • Execute(): run for each snapshot • Mapping(): guide partitioning of snapshots triggersin app V0: exposer Set of (C, L, M) checkingfunction V1 (checker) Set of (lock) Final report
Summary of checking language • Predicate • Any property calculated from a finite number of consecutive state snapshots • Highlights • Sequential programs (w/ mapping) • Reuse app types in the script and C++ code • Supports for reducing the overhead (in the paper) • Incremental checking • Sampling the time or snapshots
Constructing consistent snapshots • Use Lamport clock to total order states • Problem: how does the checker know whether it receives all necessary states for a snapshot? • Solution: detect app node failures and use membership info to construct snapshots
Constructing consistent snapshots { (A, L0, S) }, ts=2 { }, ts=10 { (A, L1, E) }, ts=16 A • Membership: external service or built-in heart-beats • Snapshot is correct as long as membership is correct • When no state being exposed, app node should report its timestamp periodically { (B, L1, E) }, ts=6 ts=12 B Detect failure SA(2) SB(6) Checker SA(10) SA(16) M(2)={A,B} SB(2)=?? M(6)={A,B} SA(6)=?? M(10)={A,B} SA(6)=SA(2) check(6) SB(10)=SB(6) check(10) M(16)={A} check(16)
Experimental method • By debugging 5 real systems, we answer • Can D3S help developers find bugs? • Are predicates simple to write? • Is the checking overhead acceptable? • None of the apps are written by us!
Case study: Leader-election • Predicate • There is at most one leader in each group of replicas • Deployment • 8 machines (1 Gb Ethernet, 2 GHz Intel Xeon CPU, and 4 GB memory) • Test scenario: database app with random I/O (40 MB/s per machine at peak time) • Randomly crash & restart processes • Debugging • 3 checkers, partitioned by replica groups • Time to trigger violation: several hours
Root cause of the bug • Coordinator crashed and forgot • the previous answer • Must write to disk synchronously! Coordinator leader! Leader! Replica node Replica node Replica node Failure detector Failure detector timeout Report Node involved, sequence of related states and events. Checker (catch violation)
Summary of results Data center apps Wide area apps
Performance overhead (stress test of PacificA) • Less than 8%, in most cases less than 4%. • I/O overhead < 0.5% • Overhead is negligible in other checked systems
Related work • Log analysis • Magpie[OSDI’04], Pip[NSDI’06], X-Trace[NSDI’07] • Predicate checking at replay time • WiDS Checker[NSDI’07], Friday[NSDI’07] • P2-based online monitoring • P2-monitor[EuroSys’06] • Model checking • MaceMC[NSDI’07], CMC[OSDI’04]
Conclusions • Predicate checking is effective for debugging deployed & large-scale distributed systems • D3S enables: • Change of what is monitored on-the-fly • Checking with multiple checkers • Specify predicate in sequential & centralized manner
Design goals • An advanced predicate checker designed for deployment & large scale • Deployment • Flexibility: change which states are checked on-the-fly • Low overhead • Large scale • Distributed checking • Failure-tolerance: continue to check correctly when • App node fails • Checking machine fails
Case study: PacificA • A BigTable-like distributed database • Replica group management • Perfect failure detection on storage node • Group reconfiguration to handle node failures • Primary-backup replication • Two-phase commit for consistent updates • Data reconciliation when re-joining a node
Case study: PacificA • A bunch of invariants stem from the design • Group consistency: • single-primary in all replica groups • Data consistency • same data for the same version number • Reliability • when committing, all replicas are already prepared • Correctness of reconciliation • After joining the group, the new node have up-to-date states • Etc… • Specify the invariants as predicates, and check them • Necessary to use multiple checkers • Result: detected 3 correctness bugs caused by atomicity violation and incorrect failure handling
Bug in RSL (Paxos server in Cosmos) • Root cause of the “live-lock”: • Prepare node only re-sends requests to the ones that has previously responded to it • A node in “learning” never participates in prepare • Result: D is stuck in preparing for a long time Learning B • RSL • 1 primary, 4 secondaries • Two phase commit • Leader election/failure detection A C Primary D E prepare prepare Primary Verifier Detect the unstable node status • Lesson: • Complete system is error-prone due to optimization and supporting components • Bugs are not always visible to outside • Always-on checking catches “hidden” bugs
Chord overlay • Consistency vs. Availability: cannot get both • Global measure on the factors • See the tradeoff quantitatively for performance tuning • Capable of checking detailed key coverage • Perfect Ring: • No overlap, no hole • Aggregated key coverage is 100% ???