Correlations in E2E Network Metrics: Impact on Large Scale Network Monitoring

Correlations in E2E Network Metrics: Impact on Large Scale Network Monitoring Praveen Yalagandula Sung-Ju Lee Puneet Sharma Sujata Banerjee HP Labs, Palo Alto http://networking.hpl.hp.com

Motivation • Large scale E2E network monitoring • Application management, Flow control, Fault Diagnosis, etc. • A key question: What granularity should we measure? • Coarse-grained: lower cost but higher inaccuracy • Fine-grained: lower inaccuracy but higher cost • Observation: Heterogeneity in measurement costs • PING < TRACEROUTE < PATHRATE • Our investigation • Are different E2E network metrics correlated? • Can we leverage such dependencies (if any) to • Lower monitoring cost while maintaining high accuracies?

Our Approach • We consider two correlations in the current work • Changes in Hop and Latency  Changes in Route • Changes in Route  Changes in Capacity • We use data from S3 deployment on Planet-Lab • ~2years of data • E2E measurements: Traceroute and Pathrate (capacity) • On thousands of paths • Perform Cost vs. Accuracy analysis for two cases • Base: Only higher cost measurements are performed • Strategy: • Perform lower cost measurements • If change detected, perform higher cost measurements

State-of-the-art • Correlations assumed by previous systems • GNP, Vivaldi, and other co-ordinate based systems • Correlation in latencies across paths • NetQuest • Correlation between hop changes and route changes • CoDeen • Correlation between route changes and capacity • Our work • Quantify the correlation • Perform accuracy vs cost tradeoff analysis

Outline • Motivation: Quantify & leverage metric correlations • S3: Scalable Sensing Service • Deployment on PlanetLab • Correlations: • Changes in Hop and Latency  Changes in Route • Changes in Route  Changes in Capacity • Cost-Accuracy Tradeoff Analysis • Summary and Future work

S3: Architecture • Sensor pods • Collection of sensors • Measure system state from a node’s view • Backplane • Programmable fabric • Connects pods and aggregates measured system state • Inference Engines • Infers O(n2) E2E paths info by measuring few paths • Schedules measurements on pods • Aggregates data on backplane • Applications

Sensor Pod Configuration& Data SNMP Agent Repository Load Memory Secure Web Interface Capacity API: query, control, and notification Lossrate Controller Bandwidth Latency

S3 Deployment on Planet-Lab • Running since January 2006 • All pair network metrics • Latency: Inferred by Netvigator • Lossrate: Measured using Tulip lossrate tool • Available Bandwidth: Measured using Spruce and PathChirp • Capacity: Measured using Pathrate • Stats:~14GB raw data every day, ~1GB compressed

Two correlations quantified • Changes in hop and latency  changes in route (HLR)? • PING can be used to measure both hops and latency • Original TTL - Remaining TTL value = Num of hops • Change in number of hops will always means change in the route • But does change in the route  change in the number of hops? • Obviously NO; but how often & how it affects monitoring accuracy? • Changes in route  changes in capacity (RC)? • Capacity can change when route is not changed • CAP Limits • Especially in PlanetLab • Becoming common in other networks: e.g., Cable networks • Same route, but link upgraded or link-level change not visible in IP route • Question: • How often does this happen and how it affects monitoring accuracy?

S3 Dataset • HL  R • Use Traceroute measurements • Performed at each node to 20 landmark nodes • Landmark nodes (20) chosen across the globe • Performed once every 30 minutes • R  C • Use Traceroute and Pathrate measurements • Each node performs Pathrate to all other nodes • In a round-robin fashion • Takes about a day (avg.) to complete a round of measurements • We use Pathrate measurements iff (0 < COV < 1)

Defining metric changes • Route changes (R) • R=1: If current route does not match previous sample • Else R=0 • Some times routers do not respond: ‘*’ in output • We ignore those hops during above route change detection • Latency changes (L) • L=1: If current latency is p% or more different than the previous sample • Else L=0 • We use p=5% for this analysis • Hop changes (H) • H=1: If current number of hops does not match with the previous • H=0: otherwise

Measurements where route changed but hops did not change  If we use changes in hops to detect route changes, we will miss these Case counts • Averaged across all paths • H: Change in hops; L: Change in Latency; R: Change in route

Case counts Measurements where route changed but neither hops nor latency changed  If we use changes in hops and/or latency to detect route changes, we will miss these • Averaged across all paths • H: Change in hops; L: Change in Latency; R: Change in route

Case counts Overall, these two numbers are small  changes in hop and latency can be a good indicator of changes in route • Averaged across all paths • H: Change in hops; L: Change in Latency; R: Change in route

Cost-Accuracy Tradeoff • What if we perform only PING and then perform Traceroute only when a hop or latency change is observed? • Reduces cost: PING is relatively inexpensive • Increases inaccuracy: Might miss some some route changes • Base method: Traceroutes every T seconds • Strategy: • Perform Traceroutes every s.T seconds • We refer to s as the sampling factor • Perform PING every t seconds when a Traceroute is not performed • Further, perform a Traceroute if change in hop/latency is observed

0.25 Plain Case: Cost decreases with reduced sampling 0.08 Plain Case: Inaccuracy increases with reduced sampling. If wrong frequency is chosen, we can have very high inaccuracy! 0.33 Hop & Hop-Lat Strategies: Bounded inaccuracies even when any traceroutes are performed only when changes are detected with Pings 0.12 Cost-Accuracy Tradeoff

Defining capacity changes for a path • Pathrate gives an estimate of capacity (with some error) • Link-Mapping based change detection • Mapped result from Pathrate measurement to one of the several known link types • C=1: If current link type is different from the previous link type • Percent-Change • C=1: If current value is p% or more different from the previous value • We use p=10% for our analysis

Case counts • Averaged across all paths • C: Change in Capacity; R: Change in Route R & C take same value in only 63% and 58% cases  Modest positive correlation

Cost-Accuracy Tradeoff • Link-Mapping

Cost-Accuracy Tradeoff • Percent-Change

Conclusions and Next Steps • Methodology for correlation quantification • Case counting • Cost-Accuracy tradeoff analysis • Hop & Latency changes  Route changes • Route changes  Capacity changes • Promising results in both cases • Low cost measurements can be used to trigger high cost measurements • Further steps • Other correlations: Capacity and Available Bandwidth correlation • Application level inaccuracy aka impact on E2E apps

Ongoing work http://networking.hpl.hp.com/s-cube Email: s-cube@hpl.hp.com

Correlations in E2E Network Metrics: Impact on Large Scale Network Monitoring