Terapaths Monitoring DWMI: Datagrid Wide area Monitoring Infrastructure

Terapaths MonitoringDWMI: Datagrid Wide area Monitoring Infrastructure Les Cottrell & Yee-Ting Li, SLAC US-LHC End-To-End Networking Meeting, FNAL October 25, 2006 Partially funded by DOE/MICS for Internet End-to-end Performance Monitoring (IEPM), and by Internet2

Active E2E Monitoring

Active IEPM-BW measurements • Focus on high performance for a few hosts needing to send data to a small number of collaborator sites, e.g. HEP tiered model • Makes regular measurements with probe tools • ping (RTT, connectivity), owamp (1 way delay) traceroute (routes) • pathchirp, pathload (available bandwidth) • iperf (one & multi-stream), thrulay, (achievable throughput) • supports bbftp, bbcp (file transfer applications, not network) • Looking at GridFTP but complex requiring renewing certificates • Choice of probes depends on importance of path, e.g. • For major paths (tier 0, 1 & some 2) use full suite • For tier 3 use just ping and traceroute • Running at major HEP sites: CERN, SLAC, FNAL, BNL, Caltech, Taiwan, SNV to about 40 remote sites • http://www.slac.stanford.edu/comp/net/iepm-bw.slac.stanford.edu/slac_wan_bw_tests.html

IEPM-BW Measurement Topology • 40 target hosts in 13 countries • Bottlenecks vary from 0.5Mbits/s to 1Gbits/s • Traverse ~ 50 AS’, 15 major Internet providers • 5 targets at PoPs, rest at end sites • Added Sunnyvale for UltraLight • Covers all USATLAS tier 0, 1, 2 sites • Adding FZK Karlsruhe

Top page

Visualization & Forecasting in Real World

Examples of real data Caltech: thrulay • Misconfigured windows • New path • Very noisy • Seasonal effects • Daily & weekly 800 Mbps 0 Nov05 Mar06 UToronto: miperf 250 Mbps 0 Jan06 Nov05 Pathchirp UTDallas • Some are seasonal • Others are not • Events may affect multiple-metrics 120 thrulay Mbps 0 iperf Mar-20-06 Mar-10-06 • Events can be caused by host or site congestion • Few route changes result in bandwidth changes (~20%) • Many significant events are not associated with route changes (~50%)

Scattter plots & histograms Scatter plots: quickly identify correlations between metrics Thrulay Pathchirp Iperf Thrulay (Mbps) RTT (ms) Pathchirp & iperf (Mbps) Throughput (Mbits/s) Pathchirp Thrulay Histograms: quickly identify variability or multimodality

Changes in network topology (BGP) can result in dramatic changes in performance Hour Samples of traceroute trees generated from the table Los-Nettos (100Mbps) Remote host Snapshot of traceroute summary table Notes: 1. Caltech misrouted via Los-Nettos 100Mbps commercial net 14:00-17:00 2. ESnet/GEANT working on routes from 2:00 to 14:00 3. A previous occurrence went un-noticed for 2 months 4. Next step is to auto detect and notify Drop in performance (From original path: SLAC-CENIC-Caltech to SLAC-Esnet-LosNettos (100Mbps) -Caltech ) Back to original path Dynamic BW capacity (DBC) Changes detected by IEPM-Iperfand AbWE Mbits/s Available BW = (DBC-XT) Cross-traffic (XT) Esnet-LosNettos segment in the path (100 Mbits/s) ABwE measurement one/minute for 24 hours Thurs Oct 9 9:00am to Fri Oct 10 9:01am

However… • Elegant graphics are great to understand problems BUT: • Can be thousands of graphs to look at (many site pairs, many devices, many metrics) • Need automated problem recognition AND diagnosis • So developing tools to reliably detect significant, persistent changes in performance • Initially using simple plateau algorithm to detect step changes • Holt-Winters for forecasting if seasonal effects

Seasonal Effects on events • Change in bandwidth (drops) between 19:00 & 22:00 Pacific Time (7:00-10:00am PK time) • Causes more anomalous events around this time

Forecasting • Over-provisioned paths should have pretty flat time series • Short/local term smoothing • Long term linear trends • Seasonal smoothing • But seasonal trends (diurnal, weekly need to be accounted for) on about 10% of our paths • Use Holt-Winters triple exponential weighted moving averages

Experimental Alerting • Have false positives down to reasonable level (few per week), so sending alerts to developers • Saved in database • Links to traceroutes, event analysis, time-series

Passive • Active monitoring • Pro: regularly spaced data on known paths, can make on-demand • Con: adds data to network, can interfere with real data and measurements • What about Passive?

Netflow et. al. • Switch identifies flow by sce/dst ports, protocol • Cuts record for each flow: • src, dst, ports, protocol, QoS, start, end time • Collect records and analyze • Can be a lot of data to collect each day, needs lot cpu • Hundreds of MBytes to GBytes • No intrusive traffic, real: traffic, collaborators, applications • No accounts/pwds/certs/keys • No reservations etc • Characterize traffic: top talkers, applications, flow lengths etc.

Application to LHCnet • LHC-OPN requires edge routers to provide Netflow data • SLAC developing Netflow visualization at BNL • Allows selection of destinations, services • Displays time series, tables, pie charts, spider plots • Will port to other LHCOPN sites Choose aggregation Choose services

Netflow limitations • Use of dynamic ports makes harder to detect app. • GridFTP, bbcp, bbftp can use fixed ports (but may not) • P2P often uses dynamic ports • Discriminate type of flow based on headers (not relying on ports) • Types: bulk data, interactive … • Discriminators: inter-arrival time, length of flow, packet length, volume of flow • Use machine learning/neural nets to cluster flows • E.g. http://www.pam2004.org/papers/166.pdf • Aggregation of parallel flows (needs care, but not difficult) • Can use for giving performance forecast • Unclear if can use for detecting steps in performance

perfSONAR (pS) • See Joe Metzger’s talk later today • SLAC/IEPM formally joined pS • De-centralised management of pS allows us to concentrate more on analysis rather than deployment/maintenance, provide rich source of data to analyze, leverages IEPM group skills • PerfSONAR allows transparent data access • Bring closer US HEP influence to pS • Make iepm-bw data available via pS infrastructure • Porting of our analysis tools to work with pS • Test perfSONAR api’s • Provide useful features such as analysis, visualization, event detection, alerting and diagnosis. • pS enables the unification of both end-to-end and router metric representation • Worry about finding correlations for diagnosis rather than determine ‘how’ to gather the data.

Questions, More information • Comparisons of Active Infrastructures: • www.slac.stanford.edu/grp/scs/net/proposals/infra-mon.html • Some active public measurement infrastructures: • www-iepm.slac.stanford.edu/ • www-iepm.slac.stanford.edu/pinger/ • e2epi.internet2.edu/owamp/ • amp.nlanr.net/ # No longer funded • Monitoring tools • www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html • www.caida.org/tools/ • Google for iperf, thrulay, bwctl, pathload, pathchirp, pathneck • Event detection • www.slac.stanford.edu/grp/scs/net/papers/noms/noms14224-122705-d.doc • Netflow: • Internet 2 backbone • http://netflow.internet2.edu/weekly/ • SLAC: • www.slac.stanford.edu/comp/net/slac-netflow/html/SLAC-netflow.html • BNL (SLAC developed, work in progress) • http://iepmbw.bnl.org/netflow/index.html

Terapaths Monitoring DWMI: Datagrid Wide area Monitoring Infrastructure