370 likes | 389 Views
Explore forecasting & anomaly detection techniques for network optimization, using active BW measurements & Netflow analysis. Learn about ABwE, Pathchirp, and implementing K-S algorithm. Discover uses for Grid Middleware and improving end-to-end performance monitoring.
E N D
Forecasting Network Performance Les Cottrell, Grid Performance Workshop, Edinburgh, June 22-34, 2005 http://www.slac.stanford.edu/grp/scs/net/talk05/predict-edinburgh05.ppt Partially funded by DOE/MICS for Internet End-to-end Performance Monitoring (IEPM)
Outline • Why do we want forecasting & anomaly detection? • What are we using for the input data • And what are the problems • How do we make forecasts, detect anomaly? • First approaches • The real world • Results • Conclusions & Futures • Possible uses
Uses of Techniques • Automated problem identification: • Alerts for network administrators, e.g. • Bandwidth changes in time-series, iperf, SNMP • Alerts for systems people • OS/Host metrics • Anomalies for security • Forecasts (are a fallout of the techniques) for Grid Middleware, e.g. replica manager, data placement
Using Active IEPM-BW measurements • Focus on high performance for a few hosts needing to send data to a small number of collaborator sites, e.g. HEP tiered model • Makes regular measurements • Ping (RTT, connectivity), traceroute • pathchirp, ABwE (packet pair dispersion) • iperf (single & multi-stream), thrulay, • Bbftp (file transfer application) • Looking at GridFTP but complex requiring renewing certificates • Lots of analysis and visualization • Running at CERN, SLAC, FNAL, BNL, Caltech to about 40 remote sites • http://www.slac.stanford.edu/comp/net/iepm-bw.slac.stanford.edu/slac_wan_bw_tests.html
ABwE/abing • Uses packet pair dispersion of 20 packets to provide: • Capacity, X-traffic, available bandwidth • At 3 minute intervals • Very noisy time series data Moving averaged over 1 hour Capacity
Pathchirp/Rice/INCITE • From PAM paper, pathchirp more accurate but • Ten times as long (10s vs 1s) • More network traffic (~factor of 10) • Pathload factor of 10 again more • IEPM-BW now supports both
BUT… • Packet pair dispersion relies on accurate timing of inter packet separation • At > 1Gbps this is getting beyond resolution of Unix clocks • AND 10GE NICs are offloading function • Coalescing interrupts, Large Send & Receive Offload, TOE • Need to work with TOE vendors • Turn off offload • Do timing in NICs
Iperf vs thrulay Thrulay Maximum RTT • Iperf has multi streams • Thrulay more manageable & gives RTT • They agree well • Throughput ~ 1/avg(RTT) Average RTT RTT ms Minimum RTT Achievable throughput Mbits/s
BUT… • At 10Gbits/s on transatlantic path Slow start takes over 6 seconds • To get 90% of measurement in congestion avoidance need to measure for 1 minute (5.25 GBytes at 7Gbits/s (today’s typical performance)
Passive • Use Netflow records at border • Per flow provide start/stop time, bytes/packets etc. • Collect records for several weeks • Divide by remote site, add parallel streams • Fold data onto one week, see bands at known capacities
Netflow 2/2 • Use existing traffic, no extra traffic • Works on fast networks
Anomaly Detection • Anomaly is when the actual value significantly differs from the expected value • So need forecasts to find anomalies • Focus has been on ABwE time-series measurements: • Packet pair dispersion on 20 packets • Send 20 packet pairs back to back and measure one-way packet separation at remote end • Minimum gives an indication of bottleneck capacity of link • Measurement each 3 minutes • Low network impact BUT very noisy so hard test case
Plateau, most intuitive • Each observation: • If outside history buffer mean mh±b*sh then add to trigger buffer • Else add to history, and remove oldest from trigger buffer • When trigger buffer > t points then trigger issued • Check if (mh -mt) / mh > D& 90% trigger in last T mins then have trigger • Move trigger buffer to history buffer Observations Event * • = history length = 1 day, t = trigger length = 3 hours • = standard deviations = 2 Trigger % full History mean History mean – 2 * stdev
K-S • For each observation: for the previous 100 observations with next 100 observations • Compare the vertical difference in CDFs • How does it differ from random CDFs • Expressed as % difference Compare K-S with Plateau
Compare • Results between K-S & plateau very similar, using K-S coefficient threshold = 70% • Current plateau only finds negative changes • Useful to see when condition returns to normal • K-S implemented in C and executes faster than Plateau (in Perl), depends on parameters • K-S more formalized • Plateau and K-S work well for non seasonal observations (e.g. small changes day/night)
Seasons & false alerts • Congestion on Monday following a quiet weekend causes a high forecast, gives an alert • Also a history buffer of not a day causes History mean to be out of sync with observations
Diurnal Variation People arriving at work between 19:00 & 22:00 PDT (7:00 & 10:00 PK time) cause sudden drop in dynamic capacity
Effect on events • Change in bandwidth (drops) between 19:00 & 22:00 Pacific Time (7:00-10:00am PK time) • Causes more anomalous events around this time
Seasonal Changes • Use Holt-Winters (H-W) technique: • Uses triple exponential weighted moving average • EWMA(i) = Obs(i) * a + (1-a) * EWMA(i-1) • Three terms each with its own parameter (a, b, ) that take into account local smoothing, long term seasonal smoothing, and trends
H-W Implementation • Need regularly spaced data (else going back one season is difficult, and gets out of sync): • Interpolate data: select bin size • Average points in bin • If no points in first week bin then get data from future weeks • For following weeks, missing data bins filled from previous week • Initial values for smoothing from NIST “Engineering Statistics Handbook” • Choose parms by minimizing (1/N)Σ(Ft-yt)2 • Ft=forecast for time t as function of parameters, yt= observation at time t
H-W Implementation • Three implementations evaluated (two new) • FNAL (Maxim Grigoriev) • Inspiration for evaluating this method • Part of RRD (Brutlag) • Limited control over what it produces and how it works • SLAC • Implemented NIST formulation, different formulation/parameter values from Brutlag/FNAL, also added minimize sums of squares to get parms
Example • Local smoothing 99% weight for last 24 hours • Linear trend 50% last 24 hours • Seasonal mainly from last week, but includes several weeks • Within an 80 minute window, 80% points outside deviation envelope ≡ event 1 hr avg Observations Deviations Forecast Weekend Weekdays
Evaluation • Created a library of time series for 100 days from June through Sep 2004 for 40 hosts • Analyzed using Plateau and saved all events where trigger buffer filled (no filters on size of step) • 23 hosts had 120 candidate events • Event types: steps; diurnal changes; congestion from cron jobs, bandwidth tests, flash crowds • Classify ~120 events as to whether interesting • Large, sharp drop in bandwidth, persist for >> 3hrs
Results • K-S shows similar results to Plateau • As adjust parameters to reduce false positives then increase missed events • E.g. for plateau with trigger buffer = 3 hrs filled to 90% in < 220 minutes, history buffer=1 day, effect of threshold D=(mh-mt)/mh Plateau (b=2) K-S with ± 100 observations
Conclusions • A few paths (10%) have strong seasonal effects • Plateau & K-S work well if only weak seasonal effects • K-S detects both step downs & up, also gives accurate time estimate of event (good for correlations) • H-W promising for seasonal effects, but • Is more complex, and requires more parameters which may not be easy to estimate • Requires regular data (interpolation step) • CPU time can depend critically on parameters chosen, e.g. increasing K-S range from ±100 to say ±400 increases CPU time by factor 14 • H-W works, still need to quantify its effectiveness • Looking at PCA to evaluate multiple metrics simultaneously (e.g. fwd & bwd traffic, RTT, multiple paths) AND multiple paths
Future Work • Future Development in PCA • Enable looking at multiple measurements simultaneously • E.g. RTT, loss, capacity …; multiple routes • Neural networks to interpolate heavyweight/infrequent measurements based on light weight more frequent • Continue Netflow passive exploration
Some Uses: • Detect anomalies reliably (few false positives, few misses): • Make extra measurements related to anomaly, e.g. ping, traceroute, performance history etc. • Notify people (e.g. via email) • Forecast into future taking account diurnal changes: • Make long-term (hours – days) integrated estimates of performance with probabilities • Use for data location selection
Apply forecasts to Router utilizations to find bottlenecks • Get measurements from Internet2/ESnet/Geant SONAR project via NMWG web services • Save as time series, forecast for each interface • For given path and duration forecast most probable bottlenecks • Use MPLS to apply QoS at bottlenecks (rather than for the entire path) for selected applications
More information • SLAC Plateau implementation • www.acm.org/sigs/sigcomm/sigcomm2004/workshop_papers/nts26-logg1.pdf • SLAC H-W implementation • www-iepm.slac.stanford.edu/monitoring/forecast/hw.html • Eng. Statistics Handbook • http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc435.htm • IEPM-BW Measurement Infrastructure • http://www-iepm.slac.stanford.edu/
Events • Can look at residuals (Ft – yt), or Χ2 • Could use K-S or plateau on: residuals, or on the local smoothing (i.e. after removing long term seasonal effects)
Mark Burgess Method • A two dimensional time-series approach in order to classify a periodic, adaptive threshold for service level anomaly detection • An iterative algorithm is applied to history analysis on this periodic time to provide a smooth roll-off in the significance of the data with time. • This method was originally designed to detect anomalous behavior on a single host.
Compare with KS Iperf from SLAC to Caltech – Feb & Mar 05 KS-Result KS Technique works Very well for the long Term anomalous Variations in internet End-to-end traffic. Mark Burgess technique detects the anomalies for each and every Unwanted huge spikes/variation (Real Time) Mark Burgess Tech-Result
PCA • PCA is a coordinate transformation method that maps a given set of data points onto new axes. These axes are called the principal axes or principal components. • For network anomaly detection PCA divides the data into normal & abnormal subspace • Procedure • Arrangement of data into matrix form • Zero meaning the matrix data • Calculating the covariance matrix • Calculate principal components • Application of the formulae (I-PPT)(data-matrix) yields the result. P is the matrix of Principal Components.
PCA Results PCA Results on SLAC-BINP (June-Sep, 2004) Due to 10% rise in dbcap Anomalous Good Events 10% rise in RTT • Caught all the events that were detected by HW, Plateau and KS • Can work on multiple parameters • Tested PCA on six routes so far SLAC-FZK, SLAC-DESY, SLAC-CALTECH, SLAC-NIIT, SLAC-BINP, SLAC-UMICH