1 / 29

Distributed Self-Propelled Instrumentation

Distributed Self-Propelled Instrumentation. Alex Mirgorodskiy VMware, Inc . Barton P. Miller University of Wisconsin-Madison. Motivation. Diagnosis of production systems is hard. Problems are difficult to reproduce Intermittent or environment-specific (anomalies) “Rare but dangerous”

karenkbrown
Download Presentation

Distributed Self-Propelled Instrumentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Self-Propelled Instrumentation Alex Mirgorodskiy VMware, Inc. Barton P. Miller University of Wisconsin-Madison

  2. Motivation Diagnosis of production systems is hard • Problems are difficult to reproduce • Intermittent or environment-specific (anomalies) • “Rare but dangerous” • Systems are large collections of black boxes • Many distributed components, different vendors • Little support for monitoring/debugging • Collected data are difficult to analyze • High volume • High concurrency

  3. Common Environments W1 A1 client4 D1 client1 Internet Load balancer A2 W2 firewall client2 D2 proxy server A3 W3 E-commerce systems • Multi-tier: Clients, Web, DB servers, Business Logic • Hard to debug: vendors have SWAT teams to fix bugs • Some companies get paid $1000/hour client3 Database Server processes Web Server processes App Server processes

  4. Common Environments Central Manager Execution Host Submitting host Job description negotiator startd submit collector starter schedd user job shadow • Clusters and HPC systems • Large-scale: failures happen often (MTTF: 30 – 150 hours) • Complex: processing a Condor job involves 10+ processes • The Grid: Beyond a single supercomputer • Decentralized • Heterogeneous: different schedulers, architectures • Hard to detect failures, let alone debug them

  5. Approach Agent Host A Process P Host B network Process R Process Q • User provides activation and deactivation events • Agent propagates through the system • Collects distributed control-flow traces • Framework analyzes traces automatically • Separates traces into flows (e.g., HTTP requests) • Identifies anomalous flows and the causes of anomalies

  6. Self-Propelled Instrumentation: Overview • The agent sits inside the process • Agent = small code fragment • The agent propagates through the code • Receives control • Inserts calls to itself ahead of the control flow • Crosses process, host, and kernel boundaries • Returns control to the application • Key features • On-demand distributed deployment • Application-agnostic distributed deployment

  7. Within-process Propagation Inject instrumenter.so Propagate Analyze: build call graph/CFG with Dyninst Activate a.out 83f0: 83f1: 83f3: 8400: 8405: 8413: 8414: push mov ... call mov pop ret %ebp %esp,%ebp foo %ebp,%esp %ebp bar Patch1 call call jmp instrument(foo) foo 0x8405 jmp patch jmp 8430: 8431: 8433: 8444: 8446: 8449: 844b: 844c: push mov ... call mov xor pop ret %ebp %esp,%ebp *%eax %ebp,%esp %eax,%eax %ebp foo Patch2 call call jmp instrument(%eax) *%eax 0x8446 jmp Dynamic, low-overhead control flow tracing

  8. Cross-process Propagation Host B Host A spDaemon recv(portQ) pidQ=port2pid(portQ) hijack(pidQ, agent.so) send(done) Process P Process Q a.out agent.so agent.so a.out send(msg) get peer inject send(mark) send(msg) jmp back if at mark propagate recv(msg) jmp back recv(msg) mark msg socket On-demand distributed deployment

  9. PDG for a Simple Socket Program start connect send recv stop client server close accept recv send • PDG: Parallel Dynamic Program Dependence Graph • Nodes: observed events • Intra-process edges: link consecutive events • Cross-process edges: link sends with matching recvs • PDGs from real systems are more complex

  10. PDG for One Condor Job

  11. Automated Diagnosis • Challenge for manual examination • High volume of trace data • Automated Approach: find anomalies • Normal behavior often is repetitive • Pathological problems often are easy to find • Focus on anomalies: infrequent bad behavior

  12. Overview of the Approach Φ1 • Obtain a collection of control flows • E.g., per-request traces in a Web server • Anomaly detection: find an unusual flow • Summarize each flow as a profile • Assign suspect scores to profiles • Root cause analysis: find why a profile is anomalous • Function responsible for the anomaly Φ2 Flows Φ3 cause Φ4

  13. Anomaly Detection: Distributed Profiles main entry foo entry foo exit main exit send h tbar tmain tfoo 2s 2s 1s 1s pt(h) = <0.4, 0.3, 0.3> 1s 3s bar exit main exit recv

  14. Anomaly Detection: Suspect Scores gk σ(g,gk) g h • σ(g) = distance to a common or known-normal node • Can detect multiple anomalies • Does not require known examples of prior runs • Unsupervised algorithm • Can use such examples for fewer false positives • One-class ranking algorithm

  15. Finding the Cause: Coverage Analysis Anom Norm Δ = Anom - Norm main main main A D D A D B C E C B C E C • Find call paths taken only in the anomalous flow • Δ = {main→A, main→A→B, main→A→C, main→D→E, main→D→C} • Correlated with the failure • Likely location of the problem

  16. Finding the Cause: Coverage Analysis Δ’ Δ = Anom - Norm main main A D A D E,C B C E C • Limitation of coverage analysis: too many reports • Noise in the trace, different input, workload • Can eliminate effects of earlier differences • Retain the shortest prefixes in Δ • Merge leaves • Can rank paths by the time of occurrence or length • Put the cause ahead of the symptoms or simplify manual examination

  17. PDG for One Condor Job

  18. PDG for Two Condor Jobs

  19. Separating Concurrent Flows • Concurrency produces interleaved traces • Servers switch from one request to another • Analyzing interleaved traces is difficult • Irrelevant details from other users • High trace variability → everything is an anomaly • Solution: separate traces into flows

  20. Flow-Separation Algorithm send URL recv page show page click1 connect • Decide when two events are in the same flow • (send → recv) and (local → non-recv) • Remove all other edges • Flow = events reachable from a start event browser1 recv URL send page select select accept Web server accept select select recv URL send page browser2 click2 connect send URL recv page show page

  21. Limitation client1 • Rules violated for programs with queues • enQ1 and deQ1 must belong to the same flow • Assigned to different flows by our application-independent algorithm deQ1 enQ1 server enQ2 deQ2 client2

  22. Addressing the Limitation: Directives client1 • Pair events using <evt,joinattr> custom directives • Evt: location in the code • Joinattr: related events have equal attr values deQ1 enQ1 server enQ2 deQ2 client2

  23. Experimental Study: Condor Submitting host Central Manager Execution Host Job description negotiator startd Condor_submit collector starter schedd user job shadow Job output

  24. Job-run-twice Problem • Fault handling in Condor • Any component can fail • Detect the failure • Restart the component • Bug in the shadow daemon • Symptoms: user job ran twice • Cause: intermittent crash after shadow reported successful job completion

  25. Debugging Approach • Insert an intermittent fault into shadow • Submit a cluster of several jobs • Start tracing condor_submit • Propagate into schedd, shadow, collector, negotiator, startd, starter, mail, the user job • Separate the trace into flows • Processing each job is a separate flow • Identify anomalous flow • Use unsupervised and one-class algorithms • Find the cause of the anomaly

  26. Finding Anomalous Flow • Suspect scores for composite profiles • Without prior knowledge, Flows 1 and 5 are unusual • Infrequent but normal activities • Use prior known-normal traces to filter them out • Flow 3 is a true anomaly

  27. Finding the Cause • Computed coverage difference • 900+ call paths • Filtered the differences • 37 call paths left • Ranked the differences • 14th path by time / 1st by length as called by schedd: main → DaemonCore::Driver → DaemonCore::HandleDC_SERVICEWAITPIDS → DaemonCore::HandleProcessExit → Scheduler::child_exit → DaemonCore::GetExceptionString • Called when shadow terminates with a signal • Last function called by shadow = failure location

  28. Conclusion • Self-propelled instrumentation • On-demand, low-overhead control-flow tracing • Across process and host boundaries • Automated root cause analysis • Finds anomalous control flows • Finds the causes of anomalies • Separation of concurrent flows • Little application-specific knowledge

  29. Related Publications • A.V. Mirgorodskiy and B.P. Miller, “Diagnosing Distributed Systems with Self-Propelled Instrumentation", Under submission, • ftp://ftp.cs.wisc.edu/paradyn/papers/Mirgorodskiy07DistDiagnosis.pdf • A.V. Mirgorodskiy, N. Maruyama, and B.P. Miller, “Problem Diagnosis in Large-Scale Computing Environments", SC’06, Tampa, FL, November 2006, • ftp://ftp.cs.wisc.edu/paradyn/papers/Mirgorodskiy06ProblemDiagnosis.pdf • A.V. Mirgorodskiy and B.P. Miller, "Autonomous Analysis of Interactive Systems with Self-Propelled Instrumentation", 12th Multimedia Computing and Networking (MMCN 2005), San Jose, CA, January 2005, • ftp://ftp.cs.wisc.edu/paradyn/papers/Mirgorodskiy04SelfProp.pdf

More Related