Path-Based Failure and Evolution Management in Self-* Systems

CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox, Eric Brewer (UC Berkeley, Stanford U, Tellme Networks, eBay Inc.) Presented by: Arjun R. Nath

The Problem.. • Computing systems increasing in complexity • Tending towards large, complex, distributed systems • Sometimes there are thousands of machines involved • Basic system management is becoming increasingly difficult. • Detecting and diagnosing failures to understanding application behaviour is becoming very difficult. 2

..the Problem Existing techniques such as code-level debuggers, program slicing, process profiling and application logs fail to characterize overall system behaviour. Distribuged debuggers are available but focus on a homogenous subset of the system. 3

Goal of the paper Techniques to help us understand large distributed systems. Improve availability reliability manageability Why are we looking at this paper ? (Self-* context) This paper is about techniques for monitoring of large, complex, distributed systems. 4

Two main principles • Path-Based Measurement: • Model the system as a collection of paths thru heterogenous components. • Make local observations along the paths and store these. These can be accessed via queries and visualization techniques. (Focus is on correctness rather than performance) • Statistical Behaviour Analysis: • Large volumes of system requests are stored for statistical analysis using classical techniques to identify deviations from normal behaviour. This can be applied to live systems or used for offline analysis. 5

What is a "Path" ? Associated with a request Control Flow Resources Paths may have inter-path dependencies : shared state, shared database tables, shared filesystems, shared memory. Multiple paths may be grouped together in sessions. 6

Coarse grained paths

Fine grained paths

How do paths help ? Failure Management Evolution (of the system) 9

Failure Management... Detection: Reduce downtime associcated with detection delays Using paths can help in noticing developing problems before they become severe The Key is to define "normal" behaviour statistically and then check for deviations Diagnosis: Isolate problems using solely the recorded path observations and then drive the diagnosis process with the path information. Paths help identify which components are involved in a given failure and aid in identifiying causes. 10

...Failure Management Impact Analysis: Helps in knowing the scale of the problem -> estimate time-to-repair Which other paths are at risk. 11

Evolution (of the system) • Its very difficult to get an overall picture of how a complex distributed system changes with time: - Software/hardware upgrades, patches, code changes etc. - Systems evolve through changes to their components and also thru changes in how they interact • Paths help in revealing system structure and dependencies and tracking changes. 12

Implementation

Implementation: Architecture

…Implementation... Tracers - tracking a request through the target system. Each request has an identifier associated that is maintained throughout the path Ids may be stored in extensible headers (HTTP, SOAP) Tracers are platform specific but can be generic to applications using the same platform (J2EE, .NET) Pinpoint, ObsLogs, SuperCal all have tracers. 15

…Implementation: tools.. Three systems that support path-based analysis

...Implementation Aggregator and Repository Aggregator receives observations from tracers reconstructs paths using IDs Stores this in the Repository There may be also a Central Repository that collects from distributed repositories. Analysis Engines and Visualization. Single and multi-path analysis Dedicated engines for various statistical tests Support for some data mining tools\ Visualization: Tukey’s boxplots generated using Octave

…Implementation A trend specific to recognition time in Tellme application A suggests a regression in a speech grammar in that application. The Tukey boxplots shown illustrate a distribution’s center, spread, and asymmetries by using rectangles to show the upper and lower quartiles and the median, and explicitly plotting each outlier.

Limitations and constraints Cannot resolve fault causes at a very detailed level Overheads can be high for fine grained paths Need to decide which observations to include in paths. This is an iterative process. Can be difficult to implement especially for existing systems

Its important so understand that Path-based analysis is an aid to fault detection and recovery and not a solution in itself. It is meant to be used in combination with traditional fault handling techniques.

Conclusion As systems get more complex, Path-based analysis tools will have increasing importance. Path based fault analysis complements traditional techniques Hardly any fully functional, path-based, fault management tools available. This paper: Has breadth but lacks depth in some places. Needs some more data around production environment experiments Should have concentrated on 1 or 2 implementations and included more details. Not much info on SuperCal and ObsLogs

Other related stuff “Pinpoint” project at Stanford http://swig.stanford.edu/pinpoint.shtml (Some interesting papers here) Magpie project (MicroSoft) Quest Software : Jprobe – Java performance profiler Borland's OptimizeIt Enterprise Suite

That’s all folks, • Thank You

Path-Based Failure and Evolution Management in Self-* Systems

Path-Based Failure and Evolution Management in Self-* Systems

Presentation Transcript