Distributed Self-Propelled Instrumentation

Distributed Self-Propelled Instrumentation Alex Mirgorodskiy VMware, Inc. Barton P. Miller University of Wisconsin-Madison

Motivation Diagnosis of production systems is hard • Problems are difficult to reproduce • Intermittent or environment-specific (anomalies) • “Rare but dangerous” • Systems are large collections of black boxes • Many distributed components, different vendors • Little support for monitoring/debugging • Collected data are difficult to analyze • High volume • High concurrency

Common Environments W1 A1 client4 D1 client1 Internet Load balancer A2 W2 firewall client2 D2 proxy server A3 W3 E-commerce systems • Multi-tier: Clients, Web, DB servers, Business Logic • Hard to debug: vendors have SWAT teams to fix bugs • Some companies get paid $1000/hour client3 Database Server processes Web Server processes App Server processes

Common Environments Central Manager Execution Host Submitting host Job description negotiator startd submit collector starter schedd user job shadow • Clusters and HPC systems • Large-scale: failures happen often (MTTF: 30 – 150 hours) • Complex: processing a Condor job involves 10+ processes • The Grid: Beyond a single supercomputer • Decentralized • Heterogeneous: different schedulers, architectures • Hard to detect failures, let alone debug them

Approach Agent Host A Process P Host B network Process R Process Q • User provides activation and deactivation events • Agent propagates through the system • Collects distributed control-flow traces • Framework analyzes traces automatically • Separates traces into flows (e.g., HTTP requests) • Identifies anomalous flows and the causes of anomalies

Self-Propelled Instrumentation: Overview • The agent sits inside the process • Agent = small code fragment • The agent propagates through the code • Receives control • Inserts calls to itself ahead of the control flow • Crosses process, host, and kernel boundaries • Returns control to the application • Key features • On-demand distributed deployment • Application-agnostic distributed deployment

Within-process Propagation Inject instrumenter.so Propagate Analyze: build call graph/CFG with Dyninst Activate a.out 83f0: 83f1: 83f3: 8400: 8405: 8413: 8414: push mov ... call mov pop ret %ebp %esp,%ebp foo %ebp,%esp %ebp bar Patch1 call call jmp instrument(foo) foo 0x8405 jmp patch jmp 8430: 8431: 8433: 8444: 8446: 8449: 844b: 844c: push mov ... call mov xor pop ret %ebp %esp,%ebp *%eax %ebp,%esp %eax,%eax %ebp foo Patch2 call call jmp instrument(%eax) *%eax 0x8446 jmp Dynamic, low-overhead control flow tracing

Cross-process Propagation Host B Host A spDaemon recv(portQ) pidQ=port2pid(portQ) hijack(pidQ, agent.so) send(done) Process P Process Q a.out agent.so agent.so a.out send(msg) get peer inject send(mark) send(msg) jmp back if at mark propagate recv(msg) jmp back recv(msg) mark msg socket On-demand distributed deployment

PDG for a Simple Socket Program start connect send recv stop client server close accept recv send • PDG: Parallel Dynamic Program Dependence Graph • Nodes: observed events • Intra-process edges: link consecutive events • Cross-process edges: link sends with matching recvs • PDGs from real systems are more complex

PDG for One Condor Job

Automated Diagnosis • Challenge for manual examination • High volume of trace data • Automated Approach: find anomalies • Normal behavior often is repetitive • Pathological problems often are easy to find • Focus on anomalies: infrequent bad behavior

Overview of the Approach Φ1 • Obtain a collection of control flows • E.g., per-request traces in a Web server • Anomaly detection: find an unusual flow • Summarize each flow as a profile • Assign suspect scores to profiles • Root cause analysis: find why a profile is anomalous • Function responsible for the anomaly Φ2 Flows Φ3 cause Φ4

Anomaly Detection: Distributed Profiles main entry foo entry foo exit main exit send h tbar tmain tfoo 2s 2s 1s 1s pt(h) = <0.4, 0.3, 0.3> 1s 3s bar exit main exit recv

Anomaly Detection: Suspect Scores gk σ(g,gk) g h • σ(g) = distance to a common or known-normal node • Can detect multiple anomalies • Does not require known examples of prior runs • Unsupervised algorithm • Can use such examples for fewer false positives • One-class ranking algorithm

Finding the Cause: Coverage Analysis Anom Norm Δ = Anom - Norm main main main A D D A D B C E C B C E C • Find call paths taken only in the anomalous flow • Δ = {main→A, main→A→B, main→A→C, main→D→E, main→D→C} • Correlated with the failure • Likely location of the problem

Finding the Cause: Coverage Analysis Δ’ Δ = Anom - Norm main main A D A D E,C B C E C • Limitation of coverage analysis: too many reports • Noise in the trace, different input, workload • Can eliminate effects of earlier differences • Retain the shortest prefixes in Δ • Merge leaves • Can rank paths by the time of occurrence or length • Put the cause ahead of the symptoms or simplify manual examination

PDG for One Condor Job

PDG for Two Condor Jobs

Separating Concurrent Flows • Concurrency produces interleaved traces • Servers switch from one request to another • Analyzing interleaved traces is difficult • Irrelevant details from other users • High trace variability → everything is an anomaly • Solution: separate traces into flows

Flow-Separation Algorithm send URL recv page show page click1 connect • Decide when two events are in the same flow • (send → recv) and (local → non-recv) • Remove all other edges • Flow = events reachable from a start event browser1 recv URL send page select select accept Web server accept select select recv URL send page browser2 click2 connect send URL recv page show page

Limitation client1 • Rules violated for programs with queues • enQ1 and deQ1 must belong to the same flow • Assigned to different flows by our application-independent algorithm deQ1 enQ1 server enQ2 deQ2 client2

Addressing the Limitation: Directives client1 • Pair events using <evt,joinattr> custom directives • Evt: location in the code • Joinattr: related events have equal attr values deQ1 enQ1 server enQ2 deQ2 client2

Experimental Study: Condor Submitting host Central Manager Execution Host Job description negotiator startd Condor_submit collector starter schedd user job shadow Job output

Job-run-twice Problem • Fault handling in Condor • Any component can fail • Detect the failure • Restart the component • Bug in the shadow daemon • Symptoms: user job ran twice • Cause: intermittent crash after shadow reported successful job completion

Debugging Approach • Insert an intermittent fault into shadow • Submit a cluster of several jobs • Start tracing condor_submit • Propagate into schedd, shadow, collector, negotiator, startd, starter, mail, the user job • Separate the trace into flows • Processing each job is a separate flow • Identify anomalous flow • Use unsupervised and one-class algorithms • Find the cause of the anomaly

Finding Anomalous Flow • Suspect scores for composite profiles • Without prior knowledge, Flows 1 and 5 are unusual • Infrequent but normal activities • Use prior known-normal traces to filter them out • Flow 3 is a true anomaly

Finding the Cause • Computed coverage difference • 900+ call paths • Filtered the differences • 37 call paths left • Ranked the differences • 14th path by time / 1st by length as called by schedd: main → DaemonCore::Driver → DaemonCore::HandleDC_SERVICEWAITPIDS → DaemonCore::HandleProcessExit → Scheduler::child_exit → DaemonCore::GetExceptionString • Called when shadow terminates with a signal • Last function called by shadow = failure location

Conclusion • Self-propelled instrumentation • On-demand, low-overhead control-flow tracing • Across process and host boundaries • Automated root cause analysis • Finds anomalous control flows • Finds the causes of anomalies • Separation of concurrent flows • Little application-specific knowledge

Related Publications • A.V. Mirgorodskiy and B.P. Miller, “Diagnosing Distributed Systems with Self-Propelled Instrumentation", Under submission, • ftp://ftp.cs.wisc.edu/paradyn/papers/Mirgorodskiy07DistDiagnosis.pdf • A.V. Mirgorodskiy, N. Maruyama, and B.P. Miller, “Problem Diagnosis in Large-Scale Computing Environments", SC’06, Tampa, FL, November 2006, • ftp://ftp.cs.wisc.edu/paradyn/papers/Mirgorodskiy06ProblemDiagnosis.pdf • A.V. Mirgorodskiy and B.P. Miller, "Autonomous Analysis of Interactive Systems with Self-Propelled Instrumentation", 12th Multimedia Computing and Networking (MMCN 2005), San Jose, CA, January 2005, • ftp://ftp.cs.wisc.edu/paradyn/papers/Mirgorodskiy04SelfProp.pdf

Distributed Self-Propelled Instrumentation