Pythia Detection, Localization, and Diagnosis of Performance Problems using perfSONAR

PythiaDetection, Localization, and Diagnosis of Performance Problemsusing perfSONAR • Constantine Dovrolis (PI), • Partha Kanuparthy, Sajjad Zarifzadeh, Madhwaraj GK • Georgia Institute of Technology

Basics • Pythia is a data-analysis tool, utilizing data collected through perfSONAR • Our focus: performance problems • Objectives: detection, localization, diagnosis • Funded by DoE: started Sep/2011

One tool, three objectives • Detection “noticeable loss rate between ORNL and SLAC on 07/11/11 at 09:00:02 EDT” • Localization“it happened at DENV-SLAC link” • Diagnosis“it was due to insufficient router buffers”

Pythia – System architecture • Centralized process pulls data from perfSONAR MAs (OWAMP, traceroute, ..) OWAMP MA 3 traceroute MA Pythia server OWAMP MA 1 BWCTL MA ... MA OWAMP MA 2 Localization Preprocessing Detection Diagnosis

Detection “noticeable loss rate between ORNL and SLAC on 07/11/11 at 09:00:02 EDT”

Detection • Look for statistically significant deviations from baseline OWDs • But, baseline can change abruptly • Scalability requirement: only a single pass through OWAMP timeseries is allowed baseline Congestion: NY-CLEV

Detection (cont’) • Dynamic estimation of baseline OWD • Based on kernel density estimator in sliding window • Identify level-shifts (e.g., NTP clock shifts, routing changes) • All stat-significant deviations are considered potential perf-problems Congestion: NY-CLEV

Some detection results • Detection outputs congestion events • > 10s long • start, end timestamps • ESnet data: • 12 days, 33 monitors • Internet2 data: • 22 days, 9 monitors

How long are the observed congestion events? • ESnet, I2: 90% of events 10-20sec long • this is sufficient to affect app-performance • delay increases by 10s of milliseconds • Some events are common across paths ESnet Internet2

Are lossy events common? • ESnet: no lossy congestion events • Internet2: 6 of 2268 events are lossy < 0.1% loss rate as sampled by OWAMP Internet2

Localization “it happened at DENV-SLAC link”

Network tomography • Given N sensors, monitor N(N-1) directed paths in terms of OWD & L3-routing (traceroute) • Given path measurements {mi,j}, infer link measurements {xl}, so that the following path-metric constraints are satisfied

Prior work in net-tomography • Either analogue tomography, i.e., the link and path metrics are real numbers • Example: path delay = sum (link delays) • Very sensitive to measurement noise, requires long measurements • Or binary tomography, i.e., the link and path metrics are Boolean (Good vs Bad) • Example: path is Bad if at least one of its links is Bad (lossy) • More robust, but its outcome is of limited resolution

What happens in practice? • In practice, path measurements are always noisy, and they have to be short (due to non-stationarities) • So, two paths may go through the same bottleneck even if their path measurements are not exactly equal

Example Lossy paths: P(1,4): 15% P(2,4): 5% P(3,4): 7% • Boolean tomography would infer that link (4,5) is the only Lossy link (why?) • With a=0.5, paths P(2,4) & P(3,4) are a-similar • Then, a more plausible solution is that link (4,5) has loss rate [5%-7%], while link (1.4) has loss rate [8%-10%]

We propose: Range Tomography For each link l, estimate a range [sl,el].

We solved two instances of the range tomography problem • MIN function (e.g., avail-bw or capacity): • SUM function (e.g., queueing delay): • The loss rate metric can be approximated by SUM if link loss rates are small and independent

The location of bad links (Esnet) • ESnet: 9 congestion events • 1 bad link localized for each • up to 75 paths affected by an event: Bad link

The location of bad links (Internet2) • Internet2: 266 congestion events in 22 days • 3 bad links: 1 case • 2 bad links: 6 cases • 1 bad link: rest • Few bad links dominate 90% events: • ge-6-2-0.0.rtr.kans (58% events)ge-1-2-0.0.rtr.chic (25%)xe-1-1-0.0.rtr.hous (6%) Timeline of bad links: peaks around 7th March 2011

A case with two bad links at the same time • Internet2: event with two bad links • 28th Feb 2011, 00:10:51 GMT • Localized bad links:ge-6-2-0.0-rtr.KANS ge-6-1-0.0-rtr.LOSA • Predicted bad link performance (avg):26ms and 57ms path: CHIC to LOSA path: ATLA to KANS path:HOUS to LOSA

Diagnosis “it was due to insufficient router buffers”

Diagnosis: Approach • How can we go from set of observed symptoms to underlying root-cause? • Most existing network problem diagnosis systems take a machine learning approach, but that requires many training examples • Most existing diagnosis systems do not focus on network performance problems • Our focus: use model-based approach to associate each root-cause with an expected set of symptoms (signatures)

Which pathologies do we currently consider? • Various congestion types • Routing events and anomalies • Various loss-episode types • Reordering causes • Various end-host effects

Congestion types Overload: ESnet • “Overload”: persistent queue build-up • “Bursty traffic”: intermittent queues (high jitter) • Very small buffers • Excessive buffers Bursty: PlanetLab Excessive buffer: Home link Bursty: Home link

Loss nature • Random losses: observed losses do not have significant correlation with queueing delays of “nearby” packets • Otherwise: non-random losses Random losses: Home link Non-random loss: ESnet

End-host effects • Delays and losses induced due to: • context switches • clock synchronization (NTP) • OS virtualization (e.g., PlanetLab) PlanetLab: end-host noise Internet2: context switch

Input: Detected Events (delay, loss, reordering) Pythia Diagnosis Tree End-host effects Not shown: Unknown type NTP vs. route events Reordering nature Loss events Congestion

Diagnosis of ESnet events No buffer-based congestion events TBD: Reordering & routing/clock-syncs • About 700 paths • End-host events: 53% of total • Diagnosed network events: 1653

Pythia: Work in-progress • Diagnose more performance problems and improve existing tests • Unsupervised clustering to identify unknown events • Open-source system implementation: • Detection, localization, diagnosis • Real-time data collection framework: • ESnet, I2, PL-testbed, broadband networks • Create front-end for users/operators

Q&A • For any additional questions and for related papers, plz email me: constantine@gatech.edu

Pythia Detection, Localization, and Diagnosis of Performance Problems using perfSONAR

Pythia Detection, Localization, and Diagnosis of Performance Problems using perfSONAR

Presentation Transcript

Diagnosis and Classification of Psychological Problems

Diagnosis of road accident problems

Using the perfSONAR Visualisation Tools

Pedestrian Detection and Localization

Diagnosis and Classification of Psychological Problems

Tamper Detection and Localization for Categorical Data Using Fragile Watermarks

Pedestrian Detection and Localization

HERWIG and PYTHIA

Lightning detection and localization using extended Kalman filter

perfSONAR Performance Monitoring Framework

IMAGERET Detection and Decision-Support Diagnosis of Diabetic Retinopathy Using Machine Vision

Point Source Detection and Localization

Using Lane Detection for Vehicle Localization

ELISA Immuno Exlorer TM : Using Antibodies for Diagnosis and Detection

Fault Detection and Diagnosis (II)

Multiple Audio Sources Detection and Localization

Localization Using Xavier

Diagnosis of road accident problems

Fault Detection and Diagnosis

Pythia and Vincia

PDF_⚡ Pythia