220 likes | 232 Views
This paper investigates the correlations between different E2E network metrics and explores the possibility of leveraging these dependencies to lower monitoring costs while maintaining high accuracies. The study analyzes changes in hop, latency, route, and capacity to quantify the correlation and performs a cost vs. accuracy tradeoff analysis. The findings provide insights into optimizing network monitoring strategies.
E N D
Correlations in E2E Network Metrics: Impact on Large Scale Network Monitoring Praveen Yalagandula Sung-Ju Lee Puneet Sharma Sujata Banerjee HP Labs, Palo Alto http://networking.hpl.hp.com
Motivation • Large scale E2E network monitoring • Application management, Flow control, Fault Diagnosis, etc. • A key question: What granularity should we measure? • Coarse-grained: lower cost but higher inaccuracy • Fine-grained: lower inaccuracy but higher cost • Observation: Heterogeneity in measurement costs • PING < TRACEROUTE < PATHRATE • Our investigation • Are different E2E network metrics correlated? • Can we leverage such dependencies (if any) to • Lower monitoring cost while maintaining high accuracies?
Our Approach • We consider two correlations in the current work • Changes in Hop and Latency Changes in Route • Changes in Route Changes in Capacity • We use data from S3 deployment on Planet-Lab • ~2years of data • E2E measurements: Traceroute and Pathrate (capacity) • On thousands of paths • Perform Cost vs. Accuracy analysis for two cases • Base: Only higher cost measurements are performed • Strategy: • Perform lower cost measurements • If change detected, perform higher cost measurements
State-of-the-art • Correlations assumed by previous systems • GNP, Vivaldi, and other co-ordinate based systems • Correlation in latencies across paths • NetQuest • Correlation between hop changes and route changes • CoDeen • Correlation between route changes and capacity • Our work • Quantify the correlation • Perform accuracy vs cost tradeoff analysis
Outline • Motivation: Quantify & leverage metric correlations • S3: Scalable Sensing Service • Deployment on PlanetLab • Correlations: • Changes in Hop and Latency Changes in Route • Changes in Route Changes in Capacity • Cost-Accuracy Tradeoff Analysis • Summary and Future work
S3: Architecture • Sensor pods • Collection of sensors • Measure system state from a node’s view • Backplane • Programmable fabric • Connects pods and aggregates measured system state • Inference Engines • Infers O(n2) E2E paths info by measuring few paths • Schedules measurements on pods • Aggregates data on backplane • Applications
Sensor Pod Configuration& Data SNMP Agent Repository Load Memory Secure Web Interface Capacity API: query, control, and notification Lossrate Controller Bandwidth Latency
S3 Deployment on Planet-Lab • Running since January 2006 • All pair network metrics • Latency: Inferred by Netvigator • Lossrate: Measured using Tulip lossrate tool • Available Bandwidth: Measured using Spruce and PathChirp • Capacity: Measured using Pathrate • Stats:~14GB raw data every day, ~1GB compressed
Two correlations quantified • Changes in hop and latency changes in route (HLR)? • PING can be used to measure both hops and latency • Original TTL - Remaining TTL value = Num of hops • Change in number of hops will always means change in the route • But does change in the route change in the number of hops? • Obviously NO; but how often & how it affects monitoring accuracy? • Changes in route changes in capacity (RC)? • Capacity can change when route is not changed • CAP Limits • Especially in PlanetLab • Becoming common in other networks: e.g., Cable networks • Same route, but link upgraded or link-level change not visible in IP route • Question: • How often does this happen and how it affects monitoring accuracy?
S3 Dataset • HL R • Use Traceroute measurements • Performed at each node to 20 landmark nodes • Landmark nodes (20) chosen across the globe • Performed once every 30 minutes • R C • Use Traceroute and Pathrate measurements • Each node performs Pathrate to all other nodes • In a round-robin fashion • Takes about a day (avg.) to complete a round of measurements • We use Pathrate measurements iff (0 < COV < 1)
Defining metric changes • Route changes (R) • R=1: If current route does not match previous sample • Else R=0 • Some times routers do not respond: ‘*’ in output • We ignore those hops during above route change detection • Latency changes (L) • L=1: If current latency is p% or more different than the previous sample • Else L=0 • We use p=5% for this analysis • Hop changes (H) • H=1: If current number of hops does not match with the previous • H=0: otherwise
Measurements where route changed but hops did not change If we use changes in hops to detect route changes, we will miss these Case counts • Averaged across all paths • H: Change in hops; L: Change in Latency; R: Change in route
Case counts Measurements where route changed but neither hops nor latency changed If we use changes in hops and/or latency to detect route changes, we will miss these • Averaged across all paths • H: Change in hops; L: Change in Latency; R: Change in route
Case counts Overall, these two numbers are small changes in hop and latency can be a good indicator of changes in route • Averaged across all paths • H: Change in hops; L: Change in Latency; R: Change in route
Cost-Accuracy Tradeoff • What if we perform only PING and then perform Traceroute only when a hop or latency change is observed? • Reduces cost: PING is relatively inexpensive • Increases inaccuracy: Might miss some some route changes • Base method: Traceroutes every T seconds • Strategy: • Perform Traceroutes every s.T seconds • We refer to s as the sampling factor • Perform PING every t seconds when a Traceroute is not performed • Further, perform a Traceroute if change in hop/latency is observed
0.25 Plain Case: Cost decreases with reduced sampling 0.08 Plain Case: Inaccuracy increases with reduced sampling. If wrong frequency is chosen, we can have very high inaccuracy! 0.33 Hop & Hop-Lat Strategies: Bounded inaccuracies even when any traceroutes are performed only when changes are detected with Pings 0.12 Cost-Accuracy Tradeoff
Defining capacity changes for a path • Pathrate gives an estimate of capacity (with some error) • Link-Mapping based change detection • Mapped result from Pathrate measurement to one of the several known link types • C=1: If current link type is different from the previous link type • Percent-Change • C=1: If current value is p% or more different from the previous value • We use p=10% for our analysis
Case counts • Averaged across all paths • C: Change in Capacity; R: Change in Route R & C take same value in only 63% and 58% cases Modest positive correlation
Cost-Accuracy Tradeoff • Link-Mapping
Cost-Accuracy Tradeoff • Percent-Change
Conclusions and Next Steps • Methodology for correlation quantification • Case counting • Cost-Accuracy tradeoff analysis • Hop & Latency changes Route changes • Route changes Capacity changes • Promising results in both cases • Low cost measurements can be used to trigger high cost measurements • Further steps • Other correlations: Capacity and Available Bandwidth correlation • Application level inaccuracy aka impact on E2E apps
Ongoing work http://networking.hpl.hp.com/s-cube Email: s-cube@hpl.hp.com