300 likes | 390 Views
Pip Detecting the Unexpected in Distributed Systems. Janet Wiener Jeff Mogul Mehul Shah. Patrick Reynolds. Chip Killian Amin Vahdat. http://issg.cs.duke.edu/pip/ reynolds@cs.duke.edu. Motivation. Distributed systems exhibit complex behaviors Some behaviors are unexpected
E N D
PipDetecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah • Patrick Reynolds • Chip Killian • Amin Vahdat http://issg.cs.duke.edu/pip/ reynolds@cs.duke.edu
Motivation • Distributed systems exhibit complex behaviors • Some behaviors are unexpected • Structural bugs • Placement or timing of processing and communication • Performance problems • Throughput bottlenecks • Over- or under-consumption of resources • Unexpected interdependencies • Parallel, inter-node behavior is hard to capture with serial, single-node tools • Not captured by traditional debuggers, profilers • Not captured by unstructured log files Pip - November 2005
Motivation Three target audiences: • Primary programmer • Debugging or optimizing his/her own system • Secondary programmer • Inheriting a project or joining a programming team • Learning how the system behaves • Operator • Monitoring running system for unexpected behavior • Performing regression tests after a change Pip - November 2005
500ms 2000 page faults Motivation • Programmers wish to examine and check system-wide behaviors • Causal paths • Components of end-to-end delay • Attribution of resource consumption • Unexpected behavior might indicate a bug Web server App server Database Pip - November 2005
Behavior model Expectations Pip checker Unexpected structure Resource violations Pip explorer: visualization GUI Pip overview Pip: • Captures events from a running system • Reconstructs behavior from events • Checks behavior against expectations • Displays unexpected behavior • Both structure and resource violations Goal: help programmers locate and explain bugs Application Pip - November 2005
Outline • Expressing expected behavior • Building a model of actual behavior • Exploring application behavior • Results • FAB • RanSub • SplitStream Pip - November 2005
Parse HTTP Send response Run application Query Describing application behavior • Application behavior consists of paths • All events, on any node, related to one high-level operation • Definition of a path is programmer defined • Path is often causal, related to a user request WWW App server DB time Pip - November 2005
“Request = /cgi/…” “2096 bytes in response” “done with request 12” Parse HTTP Send response Run application Query Describing application behavior • Within paths are tasks, messages, and notices • Tasks: processing with start and end points • Messages: send and receive events for any communication • Includes network, synchronization (lock/unlock), and timers • Notices: time-stamped strings; essentially log entries WWW App server DB time Pip - November 2005
Expectations: Recognizers • Application behavior consists of paths • Each recognizer matches paths • A path can match more than one recognizer • A recognizer can be a validator, an invalidator, or neither • Any path matching zero validators or at least one invalidator is unexpected behavior:bug? validator CGIRequest task(“Parse HTTP”) limit(CPU_TIME, 100ms); notice(m/Request URL: .*/); send(AppServer); recv(AppServer); invalidator DatabaseError notice(m/Database error: .*/); Pip - November 2005
Expectations: Recognizers language • repeat: matches a ≤ n ≤ b copies of a block • xor: matches any one of several blocks • call: include another recognizer (macro) • future: block matches now or later • done: force named block to match repeat between 1 and 3 { … } xor { branch: … branch: … } • future F1 { … } • … • done(F1); Pip - November 2005
Expectations: Aggregate expectations • Recognizers categorize paths into sets • Aggregates make assertions about sets of paths • Count, unique count, resource constraints • Simple math and set operators assert(instances(CGIRequest) > 4); assert(max(CPU_TIME, CGIRequest) < 500ms); assert(max(REAL_TIME, CGIRequest) <= 3*avg(REAL_TIME, CGIRequest)); Pip - November 2005
Outline • Expressing expected behavior • Building a model of actual behavior • Exploring application behavior • Results Pip - November 2005
Building a behavior model Sources of events: • Annotations in source code • Programmer inserts statements manually • Annotations in middleware • Middleware inserts annotations automatically • Faster and less error-prone • Passive tracing or interposition • Easier, but less information • Or any combination of the above Model consists of paths constructed from events recorded by the running application Pip - November 2005
“Request = /cgi/…” “2096 bytes in response” “done with request 12” Parse HTTP Send response Run application Query Annotations • Set path ID • Start/end task • Send/receive message • Notice WWW App server DB time Pip - November 2005
Behavior model Automating expectations and annotations • Expectations can be generated from behavior model • Create a recognizer for each actual path • Eliminate repetition • Strike a balance between over- and under-specification • Annotations can be generated by middleware • Automatic annotations in Mace, Sandstorm, J2EE, FAB • Several of our test systems use Mace annotations Application Annotations Expectations Pip checker Unexpected behavior Pip - November 2005
Match start/end task, send/receive message Organize events into causal paths For each path P For each recognizer R Does R match P? Check each aggregate Checking expectations Application Traces Reconciliation Events database Path construction Paths Expectations Expectation checking Categorized paths Pip - November 2005
Exploring behavior • Expectations checker generates lists of valid and invalid paths • Explore both sets • Why did invalid paths occur? • Is any unexpected behavior misclassified as valid? • Insufficiently constrained expectations • Pip may be unable to express all expectations • Two ways to explore behavior • SQL queries over tables • Paths, threads, tasks, messages, notices • Visualization Pip - November 2005
Visualization: causal paths Timing and resource properties for one task Causal view of path Caused tasks, messages, and notices on that thread Pip - November 2005
Visualization: communication graph • Graph view of all host-to-host network traffic Pip - November 2005
Visualization: performance graphs • Plot per-task or per-path resource metrics • Cumulative distribution (CDF), probability density (PDF), or vs. time • Click on a point to see its value and the task/path represented Delay (ms) Time (s) Pip - November 2005
Pip vs. printf • Both record interesting events to check off-line • Pip imposes structure and automates checking • Generalizes ad hoc approaches Pip - November 2005
Results • We have applied Pip to several distributed systems: • FAB: distributed block store • SplitStream: DHT-based multicast protocol • RanSub: tree-based protocol used to build higher-level systems • Others: Bullet, SWORD, Oracle of Bacon • We have found unexpected behavior in each system • We have fixed bugs in some systems … and used Pip to verify that the behavior was fixed Pip - November 2005
Results: SplitStream (DHT-based multicast protocol) 13 bugs found, 12 fixed • 11 found using expectations, 2 found using GUI • Structural bug: some nodes have up to 25 children when they should have at most 18 • This bug was fixed and later reoccurred • Root cause #1: variable shadowing • Root cause #2: failed to register a callback • How discovered: first in the explorer GUI, confirmed with automated checking Pip - November 2005
Results: FAB (distributed block store) 1 bug (so far), fixed • Four protocols checked: read, write, Paxos, membership • Performance bug: nodes seeking quorum call self and peers in arbitrary order • Should call self last, to overlap computation • For cached blocks, should call self second-to-last Pip - November 2005
Results: RanSub (tree-based protocol) 2 bugs found, 1 fixed • Structural bug: during first round of communication, parent nodes send summary messages before hearing from all children • Root cause: uninitialized state variables • Performance bug: linear increase in end-to-end delay for the first ~2 minutes • Suspected root cause: data structure listing all discovered nodes Pip - November 2005
Future work • Further automation of annotations, tracing • Explore tradeoffs between black-box, annotated behavior models • Extensible annotations • Application-specific schema for notices • Composable expectations for large systems Pip - November 2005
Related work • Expectations-based systems • PSpec [Perl, 1993] • Meta-level compilation [Engler, 2000] • Paradyn [Miller, 1995] • Causal paths • Pinpoint [Chen, 2002] • Magpie [Barham, 2004] • Project5 [Aguilera, 2003] • Model checking • MaceMC [Killian, 2006] • VeriSoft [Godefroid, 2005] Pip - November 2005
Conclusions • Finding unexpected behavior can help us find bugs • Both structure and performance bugs • Expectations serve as a high-level external specification • Summary of inter-component behavior and timing • Regression test for structure and performance • Some bugs not exposed by expectations can be found through exploring: queries and visualization Pip - November 2005
Resource metrics • Real time • User time, system time • CPU time = user + system • Busy time = CPU time / real time • Major and minor page faults (paging and allocation) • Voluntary and involuntary context switches • Message size and latency • Number of messages sent • Causal depth of path • Number of threads, hosts in path Pip - November 2005