D 3 S : D ebug D eployed D istributed S ystems

D3S: Debug Deployed Distributed Systems Xuezheng Liu, Zhenyu Guo, Xi Wang, Feibo Chen, XiaochenLian, Jian Tang, Ming Wu, M. Frans Kaashoek, Zheng Zhang Microsoft Research Asia, Tsinghua University, Fudan University, Shanghai Jiaotong University, MIT CSAIL

Debugging distributed systems is difficult • Bugs are difficult to reproduce • Many machines executing concurrently • Machines may fail • Network may fail

Example: Distributed lock • Distributed reader-writer locks • Lock mode: exclusive, shared • Invariant: only one client can hold a lock in the exclusive mode • Debugging is difficult because the protocol is complex • For performance, clients cache locks • For failure tolerance, locks have a lease

How do people debug? • Simulation • Model-checking • Runtime checking

State-of-the-art of runtime checking Step 1: add logs void ClientNode::OnLockAcquired(…) { … print_log( m_NodeID, lock, mode); } • Step 2: Collect logs, align them into a globally consistent sequence • Keep partial order • Step 3: Write checking scripts • Scan the logs to retrieve lock states • Check the consistency of locks

Problems for large/deployed systems • Too much manual effort • Difficult to anticipate what needs to log • Too much information: slow systems down • Too little information: miss a problem • Checking for large system is challenging • A central checker cannot keep up • Snapshots must be consistent • Our focus: make runtime checking easier and feasible for deployed/large-scale system

D3S approach state Conflict! state state state state Predicate: no conflict locks Violation! Checker Checker

Our contributions/outline • A simple language for writing distributed predicates • Programmers can change what is being checked on-the-fly • Failure tolerant consistent snapshot for predicate checking • Evaluation with five real-world applications

Design goals • Simplicity: a sequential style for writing predicates • Parallelism: run in parallel on multiple checkers • Correctness: check consistent states in spite of failures • Solution • MapReduce model • Failure-tolerant consistent snapshot

Developers write a D3S predicate V0: exposer { ( client: ClientID, lock: LockID, mode: LockMode ) } V1: V0  { ( conflict: LockID ) } as final after (ClientNode::OnLockAcquired) addtuple ($0->m_NodeID, $1, $2) after (ClientNode::OnLockReleased) deltuple ($0->m_NodeID, $1, $2) class MyChecker : vertex<V1> { virtual void Execute( const V0::Snapshot & snapshot ) { …. // Invariant logic, writing in sequential style } static int64 Mapping( const V0::tuple & t ) ; // guidance for partitioning }; Part 1: define the dataflow and types of states, and how states are retrieved Part 2: define the logic and mapping function in each stage for predicates

D3S parallel predicate checker Lock clients Expose states individually Key: LockID Exposed states (C1, L1, E), (C2, L3, S), (C5, L1, S),… L1 L1 Reconstruct: SN1, SN2, … (C1,L1,E),(C5,L1,S) (C2,L3,S) Checkers

States and dataflow V0: exposer { ( client: ClientID, lock: LockID, mode: LockMode ) } V1: V0  { ( conflict: LockID ) } as final after (ClientNode::OnLockAcquired) addtuple ($0->m_NodeID, $1, $2) after (ClientNode::OnLockReleased) deltuple ($0->m_NodeID, $1, $2) triggersin app Source code for Boxwood client: class ClientNode { ClientIDm_NodeID; void OnLockAcquired( LockID, LockMode ); void OnLockReleased( LockID, LockMode ); }; V0: exposer Set of (C, L, M) checkingfunction V1 (checker) Set of (lock) • Insert a hook to the app using binary rewrite at run time • Triggered at function boundaries to expose app states Final report

Checking functions class MyChecker : vertex<V1> { void Execute( const V0::Snapshot& SN) { foreach (V0::Tupletin SN) { if (t.mode == EXCLUSIVE) ex[t.lock]++; else sh[t.lock]++; } foreach (LockIDL in ex) { if (ex[L] > 1 || (ex[L] == 1 && sh[L] > 0)) output += V1::Tuple(L); } } int64 Mapping( const V0::Tuple& t ) { return t.lock; } }; • Write in C++ language, reuse types • Execute(): run for each snapshot • Mapping(): guide partitioning of snapshots triggersin app V0: exposer Set of (C, L, M) checkingfunction V1 (checker) Set of (lock) Final report

Summary of checking language • Predicate • Any property calculated from a finite number of consecutive state snapshots • Highlights • Sequential programs (w/ mapping) • Reuse app types in the script and C++ code • Supports for reducing the overhead (in the paper) • Incremental checking • Sampling the time or snapshots

Constructing consistent snapshots • Use Lamport clock to total order states • Problem: how does the checker know whether it receives all necessary states for a snapshot? • Solution: detect app node failures and use membership info to construct snapshots

Constructing consistent snapshots { (A, L0, S) }, ts=2 { }, ts=10 { (A, L1, E) }, ts=16 A • Membership: external service or built-in heart-beats • Snapshot is correct as long as membership is correct • When no state being exposed, app node should report its timestamp periodically { (B, L1, E) }, ts=6 ts=12 B Detect failure SA(2) SB(6) Checker SA(10) SA(16) M(2)={A,B} SB(2)=?? M(6)={A,B} SA(6)=?? M(10)={A,B} SA(6)=SA(2) check(6) SB(10)=SB(6) check(10) M(16)={A} check(16)

Experimental method • By debugging 5 real systems, we answer • Can D3S help developers find bugs? • Are predicates simple to write? • Is the checking overhead acceptable? • None of the apps are written by us!

Case study: Leader-election • Predicate • There is at most one leader in each group of replicas • Deployment • 8 machines (1 Gb Ethernet, 2 GHz Intel Xeon CPU, and 4 GB memory) • Test scenario: database app with random I/O (40 MB/s per machine at peak time) • Randomly crash & restart processes • Debugging • 3 checkers, partitioned by replica groups • Time to trigger violation: several hours

Root cause of the bug • Coordinator crashed and forgot • the previous answer • Must write to disk synchronously! Coordinator leader! Leader! Replica node Replica node Replica node Failure detector Failure detector timeout Report Node involved, sequence of related states and events. Checker (catch violation)

Summary of results Data center apps Wide area apps

Performance overhead (stress test of PacificA) • Less than 8%, in most cases less than 4%. • I/O overhead < 0.5% • Overhead is negligible in other checked systems

Related work • Log analysis • Magpie[OSDI’04], Pip[NSDI’06], X-Trace[NSDI’07] • Predicate checking at replay time • WiDS Checker[NSDI’07], Friday[NSDI’07] • P2-based online monitoring • P2-monitor[EuroSys’06] • Model checking • MaceMC[NSDI’07], CMC[OSDI’04]

Conclusions • Predicate checking is effective for debugging deployed & large-scale distributed systems • D3S enables: • Change of what is monitored on-the-fly • Checking with multiple checkers • Specify predicate in sequential & centralized manner

Thanks & Q/A

Design goals • An advanced predicate checker designed for deployment & large scale • Deployment • Flexibility: change which states are checked on-the-fly • Low overhead • Large scale • Distributed checking • Failure-tolerance: continue to check correctly when • App node fails • Checking machine fails

Case study: PacificA • A BigTable-like distributed database • Replica group management • Perfect failure detection on storage node • Group reconfiguration to handle node failures • Primary-backup replication • Two-phase commit for consistent updates • Data reconciliation when re-joining a node

Case study: PacificA • A bunch of invariants stem from the design • Group consistency: • single-primary in all replica groups • Data consistency • same data for the same version number • Reliability • when committing, all replicas are already prepared • Correctness of reconciliation • After joining the group, the new node have up-to-date states • Etc… • Specify the invariants as predicates, and check them • Necessary to use multiple checkers • Result: detected 3 correctness bugs caused by atomicity violation and incorrect failure handling

Bug in RSL (Paxos server in Cosmos) • Root cause of the “live-lock”: • Prepare node only re-sends requests to the ones that has previously responded to it • A node in “learning” never participates in prepare • Result: D is stuck in preparing for a long time Learning B • RSL • 1 primary, 4 secondaries • Two phase commit • Leader election/failure detection A C Primary D E prepare prepare Primary Verifier Detect the unstable node status • Lesson: • Complete system is error-prone due to optimization and supporting components • Bugs are not always visible to outside • Always-on checking catches “hidden” bugs

Chord overlay • Consistency vs. Availability: cannot get both • Global measure on the factors • See the tradeoff quantitatively for performance tuning • Capable of checking detailed key coverage • Perfect Ring: • No overlap, no hole • Aggregated key coverage is 100% ???

D 3 S : D ebug D eployed D istributed S ystems

D 3 S : D ebug D eployed D istributed S ystems

Presentation Transcript

Lecture 14: Distributed Multimedia Systems

INFS4201 Distributed Enterprise Computing Module 1: Workflow Systems

Vulnerabilities and Threats in Distributed Systems *

4. Processes and Processors in Distributed Systems

Chapter 23

Midterm Review CS 230 – Distributed Systems (http://www.ics.uci.edu/~cs230)

Socio -Economics Inspired Distributed Systems Design

Distributed File Systems

Part 2: Fault-Tolerance Distributed Systems 2010

Self-Stabilization: An approach for Fault-Tolerance in Distributed Systems

Reliable Distributed Systems

LINF2345: Languages and Algorithms for Distributed Applications

Chapter 22: Distributed Databases

Distributed Systems: Shared Data

Goals

Reliable Distributed Systems

Outline

Distributed Systems: Coordination models and languages

CS 194: Distributed Systems DHT Applications: What and Why

DIS Revision

Client/Server Distributed Systems