540 likes | 841 Views
Distributed Debugging. Presenter: Chi-Hung Lu. Problems. Distributed applications are hard to validate Distribution of application state across many distinct execution environments Protocols involve complex interactions among a collection of networked machines
E N D
Distributed Debugging Presenter: Chi-Hung Lu
Problems • Distributed applications are hard to validate • Distribution of application state across many distinct execution environments • Protocols involve complex interactions among a collection of networked machines • Need to handle failures ranging from network problems to crashing nodes • Intricate sequences of events can trigger complex errors as a result of mishandled corner cases
Approaches • Logging-based Debugging • X-Trace • Bi-directional Distributed BackTracker (BDB) • Pip • Deterministic Replay • WiDS • Friday • Jockey • Model Checking • MaceMC
X-Trace: A Pervasive Network Tracing Framework R. Fonseca et al, NSDI 07
Problem Description • It is difficult to diagnose the source of the problem for an internet application • Current network diagnostic tools only focus on one particular protocol • Does not share information on the application between the user, service, and the network operators
Examples • traceroute • Could locate IP connectivity problem • Could not reveal proxy or DNS failures • HTTP monitoring suite • Could locate application problem • Could not diagnose routing problems
Examples DNS Server User Web Server Proxy
Examples DNS Server User Web Server Proxy
Examples DNS Server User Web Server Proxy
Examples DNS Server User Web Server Proxy
X-Trace • An integrated tracing framework • Record the network path that were taken • Invoke X-Trace when initiating an application task • Insert X-Trace metadata with a task identifier in the request • Propagate the metadata down to lower layers through protocol interfaces
Task Tree • X-Trace tags all network operations resulting from a particular task with the same task identifier • Task tree is the set of network operations connected with an initial task • Task tree could be reconstruct after collecting trace data with reports
An example of the task tree • A simple HTTP request through a proxy
X-Trace Components • Data • X-Trace metadata • Network path • Task tree • Report • Reconstruct task tree
Propagation of X-Trace Metadata • The propagation of X-Trace metadata through the task tree
Propagation of X-Trace Metadata • The propagation of X-Trace metadata through the task tree
Usage Scenario (1) • Web request and recursive DNS queries
Usage Scenario (2) • A request fault annotated with user input
Usage Scenario (3) • A client and a server communicate over I3 overlay network
Usage Scenario (3) • Internet Indirect Infrastructure (I3)
Usage Scenario (3) • Internet Indirect Infrastructure (I3)
Usage Scenario (3) • Internet Indirect Infrastructure (I3)
Usage Scenario (3) • Tree for normal operation
Usage Scenario (3) • The receiver host fails
Usage Scenario (3) • Middlebox process crash
Usage Scenario (3) • The middlebox host fails
Discussion • Report loss • Non-tree request structures • Partial deployment • Managing report traffic • Security Considerations
WiDS Checker: Combating Bugs in Distributed Systems X. Liu et al, NSDI 07
Problem Description • Log mining is both labor-intensive and fragile • Latent bugs often are distributed across multiple nodes • Logs reflect incomplete information of an execution • Non-determinism of distributed application
Goals • Efficiently verify application properties • Provide fairly complete information about an execution • Reproduce the buggy runs deterministically and faithfully
Approach • Log the actual execution of a distributed system • Apply predicate checking in a centralized simulator over a run driven by testing scripts or replayed by logs • Output violation report along with message traces • An execution is interpreted as a sequence of events, which are dispatched to corresponding handling routines
Components • A versatile script language • Allow a developer to refine system properties into straightforward assertions • A checker • Inspect for violations
Architecture • Components of WiDS Checker
Architecture • Reproduce real runs • Log all non-deterministic events using Lamport’s logical clock • Check user-defined predicates • A versatile scription language to specify system states being observed and the predicates for invariants and correctness • Screen out false alarms with auxiliary information • For liveness properties • Trace root causes using a visualization tool
Programming with WiDS • WiDS APIs are mostly member function of the WiDSObject class • WiDS runtime maintains an event queue to buffer pending events and dispatches them to corresponding handling routines
Enabling Replay • Logging • Log all WiDS nondeterminism • Redirect OS calls and log the results • Embed a Lamport Clock in each out-going message • Checkpoint • Support partial replay • Save the WiDS process context • Replay • Start from the beginning or a checkpoint • Replay events in serialized Lamport order
Checker • Observe memory state • Define states and evaluate predicates • Refresh database for each event • Maintain history • Re-evaluate modified predicates • Auxiliary information for violations • Liveness properties only guarantee to be true eventually
Visualization Tools • Message flow graph
Evaluation • Benchmark and result summary
Performance • Running time for evaluating predicates
Logging Overhead • Percentage of logging time