1 / 15

Scalable Analysis of Distributed Workflow Traces

This paper discusses the motivation, related work, and objective of analyzing distributed workflow traces using NetLogger. It explores the challenges of debugging and optimizing large-scale applications and presents solutions for collecting, managing, and analyzing log data. The paper also introduces a tool for detecting anomalous workflows and highlights key differences in the NetLogger approach.

barnettm
Download Presentation

Scalable Analysis of Distributed Workflow Traces

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory

  2. Outline • Motivation / Why do we care? • Related Work / What have others done? • NetLogger’s Objective / What would we like to do? • Background / What is NetLogger? • How does NetLogger address the problems? • What are the results / costs of the solution?

  3. Motivation • Large-scale applications are widely used in science and business. • Astronomy, Biology, Weather Models, etc. • Large-scale apps are complex and difficult to debug and optimize. • Large number of concurrent operations • Distributed resources • Hard to find bottlenecks

  4. Related Work • Applications can be “tightly coupled”, “loosely coupled” or “uncoupled”. • Tools have mostly focused on tightly coupled applications. • Profiling and Tracing code segments. (TAU, Paraver, FPMPI, Intel Trace Collector) • Tools extended to loosely coupled apps • SvPablo – Auto code instrumentation and statistics collected for sections of source code. • Phopesy – Auto code instrumentation and database of performance info. Tunable granularity. • Paradyn – Dynamic instrumentation insertion at runtime. Designed for message passing and pthreads programs

  5. End Objective • Focus on loosely coupled and uncoupled applications. • We would like a tool that can combine performance information of multiple resources and application components and expose their interactions.

  6. NetLogger Background • Log Generation – calls to logger libraries added to source code at critical points to create event logs. • Log Management – The various logs are collected and merged based on event timestamps. • Visualization and Analysis – Events, systems stats and “lifelines” are displayed.

  7. Extensions to NetLogger • Scaling NetLogger to large scale systems (100’s of machines) • Collecting distributed log files • Evaluating large log data-sets • Addition of Work Flow identifiers

  8. Log Collection and Management • Netlogd • Collection daemon which accepts logs across the network (UDP or TCP) • Nlforward • For finer-grain instrumentation, events can be written to local disk and forwarded in batches • Nldemux • Server-side tool to scan incoming logs • Split events into separate files • Allows for log file rollovers.

  9. Sifting Through the data • Huge amount of log data from just 5 nodes obscures important events.

  10. Anomalous Workflow Detection Tool • Define a linear sequence of events in a configuration file. • Mark any workflow lifeline that is missing these events. • Problems: • We would like some context for normal behavior. (solved by and option to include neighbors of anomalous lifelines) • Too many events to keep them all in memory for scanning.

  11. Solutions • Solution 1. • Create a histogram with 100 bins for normal workflow execution times. • Timeout when after 99th percentile. • Runs in fixed memory footprint. • Supports additional parameters (min time, max time, etc) • Solution 2 • Calculate a running mean and standard deviation of workflow runtimes. • Assumes statistically normal distribution of times.

  12. NetLogger Workflow-logging Architecture

  13. 3 incomplete events from previous picture shown in blue with context events shown in red. Able to detect several errors in SNFactory Workflow application. New Log Visualization

  14. Key Differences in NetLogger • Use of “Lifelines” to trace sequence of actions. • Workflow anomaly detection. • Facilitate log collection from multiple locations. • Manual instrumentation of source code. • Must have source code and understand it.

  15. The End. • Questions? • Comments?

More Related