1 / 30

Pip Detecting the Unexpected in Distributed Systems

Pip Detecting the Unexpected in Distributed Systems. Janet Wiener Jeff Mogul Mehul Shah. Patrick Reynolds. Chip Killian Amin Vahdat. http://issg.cs.duke.edu/pip/ reynolds@cs.duke.edu. Motivation. Distributed systems exhibit complex behaviors Some behaviors are unexpected

uma-marks
Download Presentation

Pip Detecting the Unexpected in Distributed Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PipDetecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah • Patrick Reynolds • Chip Killian • Amin Vahdat http://issg.cs.duke.edu/pip/ reynolds@cs.duke.edu

  2. Motivation • Distributed systems exhibit complex behaviors • Some behaviors are unexpected • Structural bugs • Placement or timing of processing and communication • Performance problems • Throughput bottlenecks • Over- or under-consumption of resources • Unexpected interdependencies • Parallel, inter-node behavior is hard to capture with serial, single-node tools • Not captured by traditional debuggers, profilers • Not captured by unstructured log files Pip - November 2005

  3. Motivation Three target audiences: • Primary programmer • Debugging or optimizing his/her own system • Secondary programmer • Inheriting a project or joining a programming team • Learning how the system behaves • Operator • Monitoring running system for unexpected behavior • Performing regression tests after a change Pip - November 2005

  4. 500ms 2000 page faults Motivation • Programmers wish to examine and check system-wide behaviors • Causal paths • Components of end-to-end delay • Attribution of resource consumption • Unexpected behavior might indicate a bug Web server App server Database Pip - November 2005

  5. Behavior model Expectations Pip checker Unexpected structure Resource violations Pip explorer: visualization GUI Pip overview Pip: • Captures events from a running system • Reconstructs behavior from events • Checks behavior against expectations • Displays unexpected behavior • Both structure and resource violations Goal: help programmers locate and explain bugs Application Pip - November 2005

  6. Outline • Expressing expected behavior • Building a model of actual behavior • Exploring application behavior • Results • FAB • RanSub • SplitStream Pip - November 2005

  7. Parse HTTP Send response Run application Query Describing application behavior • Application behavior consists of paths • All events, on any node, related to one high-level operation • Definition of a path is programmer defined • Path is often causal, related to a user request WWW App server DB time Pip - November 2005

  8. “Request = /cgi/…” “2096 bytes in response” “done with request 12” Parse HTTP Send response Run application Query Describing application behavior • Within paths are tasks, messages, and notices • Tasks: processing with start and end points • Messages: send and receive events for any communication • Includes network, synchronization (lock/unlock), and timers • Notices: time-stamped strings; essentially log entries WWW App server DB time Pip - November 2005

  9. Expectations: Recognizers • Application behavior consists of paths • Each recognizer matches paths • A path can match more than one recognizer • A recognizer can be a validator, an invalidator, or neither • Any path matching zero validators or at least one invalidator is unexpected behavior:bug? validator CGIRequest task(“Parse HTTP”) limit(CPU_TIME, 100ms); notice(m/Request URL: .*/); send(AppServer); recv(AppServer); invalidator DatabaseError notice(m/Database error: .*/); Pip - November 2005

  10. Expectations: Recognizers language • repeat: matches a ≤ n ≤ b copies of a block • xor: matches any one of several blocks • call: include another recognizer (macro) • future: block matches now or later • done: force named block to match repeat between 1 and 3 { … } xor { branch: … branch: … } • future F1 { … } • … • done(F1); Pip - November 2005

  11. Expectations: Aggregate expectations • Recognizers categorize paths into sets • Aggregates make assertions about sets of paths • Count, unique count, resource constraints • Simple math and set operators assert(instances(CGIRequest) > 4); assert(max(CPU_TIME, CGIRequest) < 500ms); assert(max(REAL_TIME, CGIRequest) <= 3*avg(REAL_TIME, CGIRequest)); Pip - November 2005

  12. Outline • Expressing expected behavior • Building a model of actual behavior • Exploring application behavior • Results Pip - November 2005

  13. Building a behavior model Sources of events: • Annotations in source code • Programmer inserts statements manually • Annotations in middleware • Middleware inserts annotations automatically • Faster and less error-prone • Passive tracing or interposition • Easier, but less information • Or any combination of the above Model consists of paths constructed from events recorded by the running application Pip - November 2005

  14. “Request = /cgi/…” “2096 bytes in response” “done with request 12” Parse HTTP Send response Run application Query Annotations • Set path ID • Start/end task • Send/receive message • Notice WWW App server DB time Pip - November 2005

  15. Behavior model Automating expectations and annotations • Expectations can be generated from behavior model • Create a recognizer for each actual path • Eliminate repetition • Strike a balance between over- and under-specification • Annotations can be generated by middleware • Automatic annotations in Mace, Sandstorm, J2EE, FAB • Several of our test systems use Mace annotations Application Annotations Expectations Pip checker Unexpected behavior Pip - November 2005

  16. Match start/end task, send/receive message Organize events into causal paths For each path P For each recognizer R Does R match P? Check each aggregate Checking expectations Application Traces Reconciliation Events database Path construction Paths Expectations Expectation checking Categorized paths Pip - November 2005

  17. Exploring behavior • Expectations checker generates lists of valid and invalid paths • Explore both sets • Why did invalid paths occur? • Is any unexpected behavior misclassified as valid? • Insufficiently constrained expectations • Pip may be unable to express all expectations • Two ways to explore behavior • SQL queries over tables • Paths, threads, tasks, messages, notices • Visualization Pip - November 2005

  18. Visualization: causal paths Timing and resource properties for one task Causal view of path Caused tasks, messages, and notices on that thread Pip - November 2005

  19. Visualization: communication graph • Graph view of all host-to-host network traffic Pip - November 2005

  20. Visualization: performance graphs • Plot per-task or per-path resource metrics • Cumulative distribution (CDF), probability density (PDF), or vs. time • Click on a point to see its value and the task/path represented Delay (ms) Time (s) Pip - November 2005

  21. Pip vs. printf • Both record interesting events to check off-line • Pip imposes structure and automates checking • Generalizes ad hoc approaches Pip - November 2005

  22. Results • We have applied Pip to several distributed systems: • FAB: distributed block store • SplitStream: DHT-based multicast protocol • RanSub: tree-based protocol used to build higher-level systems • Others: Bullet, SWORD, Oracle of Bacon • We have found unexpected behavior in each system • We have fixed bugs in some systems … and used Pip to verify that the behavior was fixed Pip - November 2005

  23. Results: SplitStream (DHT-based multicast protocol) 13 bugs found, 12 fixed • 11 found using expectations, 2 found using GUI • Structural bug: some nodes have up to 25 children when they should have at most 18 • This bug was fixed and later reoccurred • Root cause #1: variable shadowing • Root cause #2: failed to register a callback • How discovered: first in the explorer GUI, confirmed with automated checking Pip - November 2005

  24. Results: FAB (distributed block store) 1 bug (so far), fixed • Four protocols checked: read, write, Paxos, membership • Performance bug: nodes seeking quorum call self and peers in arbitrary order • Should call self last, to overlap computation • For cached blocks, should call self second-to-last Pip - November 2005

  25. Results: RanSub (tree-based protocol) 2 bugs found, 1 fixed • Structural bug: during first round of communication, parent nodes send summary messages before hearing from all children • Root cause: uninitialized state variables • Performance bug: linear increase in end-to-end delay for the first ~2 minutes • Suspected root cause: data structure listing all discovered nodes Pip - November 2005

  26. Future work • Further automation of annotations, tracing • Explore tradeoffs between black-box, annotated behavior models • Extensible annotations • Application-specific schema for notices • Composable expectations for large systems Pip - November 2005

  27. Related work • Expectations-based systems • PSpec [Perl, 1993] • Meta-level compilation [Engler, 2000] • Paradyn [Miller, 1995] • Causal paths • Pinpoint [Chen, 2002] • Magpie [Barham, 2004] • Project5 [Aguilera, 2003] • Model checking • MaceMC [Killian, 2006] • VeriSoft [Godefroid, 2005] Pip - November 2005

  28. Conclusions • Finding unexpected behavior can help us find bugs • Both structure and performance bugs • Expectations serve as a high-level external specification • Summary of inter-component behavior and timing • Regression test for structure and performance • Some bugs not exposed by expectations can be found through exploring: queries and visualization Pip - November 2005

  29. Extra slides

  30. Resource metrics • Real time • User time, system time • CPU time = user + system • Busy time = CPU time / real time • Major and minor page faults (paging and allocation) • Voluntary and involuntary context switches • Message size and latency • Number of messages sent • Causal depth of path • Number of threads, hosts in path Pip - November 2005

More Related