1 / 51

Distributed Debugging

Distributed Debugging. Presenter: Chi-Hung Lu. Problems. Distributed applications are hard to validate Distribution of application state across many distinct execution environments Protocols involve complex interactions among a collection of networked machines

Download Presentation

Distributed Debugging

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Debugging Presenter: Chi-Hung Lu

  2. Problems • Distributed applications are hard to validate • Distribution of application state across many distinct execution environments • Protocols involve complex interactions among a collection of networked machines • Need to handle failures ranging from network problems to crashing nodes • Intricate sequences of events can trigger complex errors as a result of mishandled corner cases

  3. Approaches • Logging-based Debugging • X-Trace • Bi-directional Distributed BackTracker (BDB) • Pip • Deterministic Replay • WiDS • Friday • Jockey • Model Checking • MaceMC

  4. X-Trace: A Pervasive Network Tracing Framework R. Fonseca et al, NSDI 07

  5. Problem Description • It is difficult to diagnose the source of the problem for an internet application • Current network diagnostic tools only focus on one particular protocol • Does not share information on the application between the user, service, and the network operators

  6. Examples • traceroute • Could locate IP connectivity problem • Could not reveal proxy or DNS failures • HTTP monitoring suite • Could locate application problem • Could not diagnose routing problems

  7. Examples DNS Server User Web Server Proxy

  8. Examples DNS Server User Web Server Proxy

  9. Examples DNS Server User Web Server Proxy

  10. Examples DNS Server User Web Server Proxy

  11. X-Trace • An integrated tracing framework • Record the network path that were taken • Invoke X-Trace when initiating an application task • Insert X-Trace metadata with a task identifier in the request • Propagate the metadata down to lower layers through protocol interfaces

  12. Task Tree • X-Trace tags all network operations resulting from a particular task with the same task identifier • Task tree is the set of network operations connected with an initial task • Task tree could be reconstruct after collecting trace data with reports

  13. An example of the task tree • A simple HTTP request through a proxy

  14. X-Trace Components • Data • X-Trace metadata • Network path • Task tree • Report • Reconstruct task tree

  15. Propagation of X-Trace Metadata • The propagation of X-Trace metadata through the task tree

  16. Propagation of X-Trace Metadata • The propagation of X-Trace metadata through the task tree

  17. The X Trace metadata

  18. Operation of X-Trace Metadata

  19. Operation of X-Trace Metadata

  20. X-Trace Report Architecture

  21. X-Trace Report Architecture

  22. X-Trace Report Architecture

  23. Usage Scenario (1) • Web request and recursive DNS queries

  24. Usage Scenario (2) • A request fault annotated with user input

  25. Usage Scenario (3) • A client and a server communicate over I3 overlay network

  26. Usage Scenario (3) • Internet Indirect Infrastructure (I3)

  27. Usage Scenario (3) • Internet Indirect Infrastructure (I3)

  28. Usage Scenario (3) • Internet Indirect Infrastructure (I3)

  29. Usage Scenario (3) • Tree for normal operation

  30. Usage Scenario (3) • The receiver host fails

  31. Usage Scenario (3) • Middlebox process crash

  32. Usage Scenario (3) • The middlebox host fails

  33. Discussion • Report loss • Non-tree request structures • Partial deployment • Managing report traffic • Security Considerations

  34. WiDS Checker: Combating Bugs in Distributed Systems X. Liu et al, NSDI 07

  35. Problem Description • Log mining is both labor-intensive and fragile • Latent bugs often are distributed across multiple nodes • Logs reflect incomplete information of an execution • Non-determinism of distributed application

  36. Goals • Efficiently verify application properties • Provide fairly complete information about an execution • Reproduce the buggy runs deterministically and faithfully

  37. Approach • Log the actual execution of a distributed system • Apply predicate checking in a centralized simulator over a run driven by testing scripts or replayed by logs • Output violation report along with message traces • An execution is interpreted as a sequence of events, which are dispatched to corresponding handling routines

  38. Components • A versatile script language • Allow a developer to refine system properties into straightforward assertions • A checker • Inspect for violations

  39. Architecture • Components of WiDS Checker

  40. Architecture • Reproduce real runs • Log all non-deterministic events using Lamport’s logical clock • Check user-defined predicates • A versatile scription language to specify system states being observed and the predicates for invariants and correctness • Screen out false alarms with auxiliary information • For liveness properties • Trace root causes using a visualization tool

  41. Programming with WiDS • WiDS APIs are mostly member function of the WiDSObject class • WiDS runtime maintains an event queue to buffer pending events and dispatches them to corresponding handling routines

  42. Enabling Replay • Logging • Log all WiDS nondeterminism • Redirect OS calls and log the results • Embed a Lamport Clock in each out-going message • Checkpoint • Support partial replay • Save the WiDS process context • Replay • Start from the beginning or a checkpoint • Replay events in serialized Lamport order

  43. Checker • Observe memory state • Define states and evaluate predicates • Refresh database for each event • Maintain history • Re-evaluate modified predicates • Auxiliary information for violations • Liveness properties only guarantee to be true eventually

  44. 45

  45. 46

  46. Visualization Tools • Message flow graph

  47. Evaluation • Benchmark and result summary

  48. Performance • Running time for evaluating predicates

  49. Logging Overhead • Percentage of logging time

More Related