150 likes | 159 Views
This paper discusses the motivation, related work, and objective of analyzing distributed workflow traces using NetLogger. It explores the challenges of debugging and optimizing large-scale applications and presents solutions for collecting, managing, and analyzing log data. The paper also introduces a tool for detecting anomalous workflows and highlights key differences in the NetLogger approach.
E N D
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory
Outline • Motivation / Why do we care? • Related Work / What have others done? • NetLogger’s Objective / What would we like to do? • Background / What is NetLogger? • How does NetLogger address the problems? • What are the results / costs of the solution?
Motivation • Large-scale applications are widely used in science and business. • Astronomy, Biology, Weather Models, etc. • Large-scale apps are complex and difficult to debug and optimize. • Large number of concurrent operations • Distributed resources • Hard to find bottlenecks
Related Work • Applications can be “tightly coupled”, “loosely coupled” or “uncoupled”. • Tools have mostly focused on tightly coupled applications. • Profiling and Tracing code segments. (TAU, Paraver, FPMPI, Intel Trace Collector) • Tools extended to loosely coupled apps • SvPablo – Auto code instrumentation and statistics collected for sections of source code. • Phopesy – Auto code instrumentation and database of performance info. Tunable granularity. • Paradyn – Dynamic instrumentation insertion at runtime. Designed for message passing and pthreads programs
End Objective • Focus on loosely coupled and uncoupled applications. • We would like a tool that can combine performance information of multiple resources and application components and expose their interactions.
NetLogger Background • Log Generation – calls to logger libraries added to source code at critical points to create event logs. • Log Management – The various logs are collected and merged based on event timestamps. • Visualization and Analysis – Events, systems stats and “lifelines” are displayed.
Extensions to NetLogger • Scaling NetLogger to large scale systems (100’s of machines) • Collecting distributed log files • Evaluating large log data-sets • Addition of Work Flow identifiers
Log Collection and Management • Netlogd • Collection daemon which accepts logs across the network (UDP or TCP) • Nlforward • For finer-grain instrumentation, events can be written to local disk and forwarded in batches • Nldemux • Server-side tool to scan incoming logs • Split events into separate files • Allows for log file rollovers.
Sifting Through the data • Huge amount of log data from just 5 nodes obscures important events.
Anomalous Workflow Detection Tool • Define a linear sequence of events in a configuration file. • Mark any workflow lifeline that is missing these events. • Problems: • We would like some context for normal behavior. (solved by and option to include neighbors of anomalous lifelines) • Too many events to keep them all in memory for scanning.
Solutions • Solution 1. • Create a histogram with 100 bins for normal workflow execution times. • Timeout when after 99th percentile. • Runs in fixed memory footprint. • Supports additional parameters (min time, max time, etc) • Solution 2 • Calculate a running mean and standard deviation of workflow runtimes. • Assumes statistically normal distribution of times.
3 incomplete events from previous picture shown in blue with context events shown in red. Able to detect several errors in SNFactory Workflow application. New Log Visualization
Key Differences in NetLogger • Use of “Lifelines” to trace sequence of actions. • Workflow anomaly detection. • Facilitate log collection from multiple locations. • Manual instrumentation of source code. • Must have source code and understand it.
The End. • Questions? • Comments?