Server-based Characterization and Inference of Internet Performance

Server-based Characterization and Inference of Internet Performance Venkat Padmanabhan Lili Qiu Helen Wang Microsoft Research UCLA/IPAM Workshop March 2002

Outline • Overview • Server-based characterization of performance • Server-based inference of performance • Passive Network Tomography • Summary and future work

Overview • Goals • characterize end-to-end performance • infer characteristics of interior links • Approach: server-based monitoring • passive monitoring  relatively inexpensive • enables large-scale measurements • diversity of network paths

Web server ACKs ACKs DATA clients

Research Questions • Server-based characterization of end-to-end performance • correlation with topological metrics • spatial locality • temporal stability • Server-based inference of internal link characteristics • identification of lossy links

Related Work • Server-based passive measurement • 1996 Olympics Web server study (Berkeley, 1997 & 1998) • characterization of TCP properties (Allman 2000) • Active measurement • NPD (Paxson 1997) • stationarity of Internet path properties (Zhang et al. 2001)

Experiment Setting • Packet sniffer at microsoft.com • 550 MHz Pentium III • sits on spanning port of Cisco Catalyst 6509 • packet drop rate < 0.3% • traces up to 2+ hours long, 20-125 million packets, 50-950K clients • Traceroute source • sits on a separate Microsoft network, but all external hops are shared • infrequent and in the background

Topological Metrics and Loss Rate Topological distance is a poor predictor of packet loss rate. All links are not equal  need to identify the lossy links

Spatial Locality • Do clients in the same cluster see similar loss rates? • Loss rate is quantized into buckets • 0-0.5%, 0.5-2%, 2-5%, 5-10%, 10-20%, 20+% • suggested by Zhang et al. (IMW 2002) • Focus on lossy clusters • average loss rate > 5% Spatial locality there may be shared cause for packet loss

Temporal Stability • Loss rate again quantized into buckets • Metric of interest: stability period (i.e., time until transition into new bucket) • Median stability period ≈ 10 minutes • Consistent with previous findings based on active measurements

Putting it all together • All links are not equal  need to identify the lossy links • Spatial locality of packet loss rate  lossy links may well be shared • Temporal stability  worthwhile to try and identify the lossy links

Passive Network Tomography • Goal:determine characteristics of internal network links using end-to-end, passive measurements • We focus on the link loss rate metric • primary goal: identifying lossy links • Why is this interesting? • locating trouble spots in the network • keeping tabs on your ISP • server placement and server selection

Web server Why is it so slow? AT&T Sprint C&W Earthlink UUNET Darn, it’s slow! AOL Qwest

Related Work • MINC (Caceres et al. 1999) • multicast-based active probing • Striped unicast (Duffield et al. 2001) • unicast-based active probing • Passive measurement (Coates et al. 2002) • look for back-to-back packets • Shared bottleneck detection • Padmanabhan 1999, Rubenstein et al. 2000, Katabi et al. 2001

S A B Active Network Tomography S A B Striped unicast probes Multicast probes

Problem Formulation server Collapse linear chains into virtual links (1-l1)*(1-l2)*(1-l4) = (1-p1) (1-l1)*(1-l2)*(1-l5) = (1-p2) … (1-l1)*(1-l3)*(1-l8) = (1-p5) Under-constrained system of equations l1 l3 l2 l4 l5 l6 l7 l8 p1 p2 p3 p4 p5 clients

#1: Random Sampling • Randomly sample the solution space • Repeat this several times • Draw conclusions based on overall statistics • How to do random sampling? • determine loss rate bound for each link using best downstream client • iterate over all links: • pick loss rate at random within bounds • update bounds for other links • Problem: little tolerance for estimation error server l1 l3 l2 l4 l5 l6 l7 l8 p1 p2 p3 p4 p5 clients

#2: Linear Optimization Goals • Parsimonious explanation • Robust to estimation error Li = log(1/(1-li)), Pj = log(1/(1-pj)) minimize Li + |Sj| L1+L2+L4 + S1 = P1 L1+L2+L5 + S2 = P2 … L1+L3+L8 + S5 = P5 Li >= 0 Can be turned into a linear program server l1 l3 l2 l4 l5 l6 l7 l8 p1 p2 p3 p4 p5 clients

#3: Bayesian Inference • Basics: • D: observed data • sj: # packets successfully sent to client j • fj: # packets that client j fails to receive • Θ: unknown model parameters • li: packet loss rate of link i • Goal: determine the posterior P(Θ|D) • inference is based on loss events, not loss rates • Bayes theorem • P(Θ|D) = P(D|Θ)P(Θ)/∫P(D|Θ)P(Θ)dΘ • hard to compute since Θ is multidimensional server l1 l3 l2 l4 l5 l6 l7 l8 (s1,f1) (s2,f2) (s3,f3) (s4,f4) (s5,f5) clients

Gibbs Sampling • Markov Chain Monte Carlo (MCMC) • construct a Markov chain whose stationary distribution is P(Θ|D) • Gibbs Sampling: defines the transition kernel • start with an arbitrary initial assignment of li • consider each link i in turn • compute P(li|D) assuming lj is fixed for j≠i • draw sample from P(li|D) and update li • after burn-in period, we obtain samples from the posterior P(Θ|D)

Gibbs Sampling Algorithm 1) Initialize link loss rates arbitrarily 2) For j = 1 : burn-in for each link i compute P(li|D, {li’}) where li is loss rate of link i, and {li’} = ji lj 3) For j = 1 : realSamples for each link i compute P(li|D, {li’}) Use all the samples obtained at step 3 to approximate P(|D)

Experimental Evaluation • Simulation experiments • Internet traffic traces

Simulation Experiments • Advantage: no uncertainty about link loss rate • Methodology • Topologies used: • randomly-generated: 20 - 3000 nodes, max degree = 5-50 • real topology obtained by tracing paths to microsoft.com clients • randomly-generated packet loss events at each link • a fraction fof the links are good, and the rest are “bad” • LM1: good links: 0 – 1%, bad links: 5 – 10% • LM2: good links: 0 – 1%, bad links: 1 – 100% • Goodness metrics: • Coverage: # correctly inferred lossy links • False positives: # incorrectly inferred lossy links

Simulation Results

Simulation Results High confidence in top few inferences

Trade-off

Internet Traffic Traces • Challenge: validation • Divide client traces into two: tomography set and validation set • Tomography data set => loss inference • Validation set => check if clients downstream of the inferred lossy links experience high loss • Results • false positive rate is between 5 – 30% • likely candidates for lossy links: • links crossing an inter-AS boundary • links having a large delay (e.g. transcontinental links) • links that terminate at clients • example lossy links: • San Francisco (AT&T)  Indonesia (Indo.net) • Sprint  PacBell in California • Moscow  Tyumen, Siberia (Sovam Teleport)

Summary • Poor correlation between topological metrics & performance • Significant spatial locality and temporal stability • Passive network tomography is feasible • Tradeoff between computational cost and accuracy • Future directions • real-time inference • selective active probing • Acknowledgements: • MSR: Dimitris Achlioptas, Christian Borgs, Jennifer Chayes, David Heckerman, Chris Meek, David Wilson • Infrastructure: Rob Emanuel, Scott Hogan http://www.research.microsoft.com/~padmanab

Server-based Characterization and Inference of Internet Performance