430 likes | 611 Views
Lessons Learned Monitoring. Les Cottrell, SLAC ESnet R&D Advisory Workshop April 23, 2007 Arlington, Virginia. Partially funded by DOE and by Internet2. Uses of Measurements. Automated problem identification & trouble shooting: Alerts for network administrators, e.g.
E N D
Lessons Learned Monitoring Les Cottrell, SLAC ESnet R&D Advisory Workshop April 23, 2007 Arlington, Virginia Partially funded by DOE and by Internet2
Uses of Measurements • Automated problem identification & trouble shooting: • Alerts for network administrators, e.g. • Baselines, bandwidth changes in time-series, iperf, SNMP • Alerts for systems people • OS/Host metrics • Forecasts for Grid Middleware, e.g. replica manager, data placement • Engineering, planning, SLA (set & verify), expectations • Also (not addressed here): • Security: spot anomalies, intrusion detection • Accounting
History • PingER (1994), IEPM-BW (2001) • E2E, active, regular, end user view, • all hosts owned by individual sites, • core mainly centrally designed & developed (homogenous), contributions from FNAL, GATech, NIIT (close collaboration) • Why are you monitoring: • network trouble management, planning, auditing/setting SLAs, Grid forecasting are very different though may use same measurements
PingER (1994) • PingER project originally (1995) for measuring network performance for US, Europe and Japanese HEP community - now mainly R&E sites • Extended this century to measure Digital Divide: • Collaboration with ICTP Science Dissemination Unit http://sdu.ictp.it • ICFA/SCIC: http://icfa-scic.web.cern.ch/ICFA-SCIC/ • >120 countries (99% world’s connected population) • >35 monitor sites in 14 countries • Uses ubiquitous ping facility • Monitor 44 sites in S. Asia • Most extensive active E2E monitoring in world
PingER Design Details • PingER Design (1994: no web services, RRD, security not a big thing, etc.) • Simple, no remote software (ping everywhere), no probe development, monitor host install 0.5 day effort for sys-admin • Data centrally gathered, archived, analyzed, so hard jobs (archiving, analysis, viz) do NOT require distribution, only one copy • Database flat ASCII files, rawdata, analyzed data, file/pair/day. Compressed saves factor 6 (90GB) • Data available via web (lot of use, some uses unexpected, often analysis by Excel)
PingER Lessons • Measurement code rewritten twice, once to add extra data, once to document (perldoc) / parameterize / simplify installation • Gathering code (uses LYNX or FTP) pull from archive, no major mods in 10 years • Most of development: for download, analyze data, viz, manage • New ways to use data: jitter, out of order, duplicates, derive throughput, MOS all required study of data then implement and integrate • Dirty data (pathologies not related to network) require filtering or filling before analysis • Had to develop easy make/install download, instructions, FAQ, still new installs require communication: • pre-reqs, getting name registration, getting cron jobs running, getting web server running, unblock, clarify documentation (often non-native English speakers) • Documentation (tutorials, help, FAQs), publicity (brochures, papers, maps, presentations/travel), get funding/proposals • Monitor availability of (developed tools to simplify/automate): • monitor sites (hosts stop working: security blocks, hosts replaced, site forgets), nudge contacts • critical remote sites (beacons), choose new one (automatically updates monitor sites) • Validate/update meta data (name, address, institute, lat/long, contact …) in database (need easy update)
IEPM-BW (2001) • 40 target hosts in 13 countries • Bottlenecks vary from 0.5Mbits/s to 1Gbits/s • Traverse ~ 50 AS’, 15 major Internet providers • 5 targets at PoPs, rest at end sites • Added Sunnyvale for UltraLight • Covers all USATLAS tier 0, 1, 2 sites • Recently Added FZK, QAU • Main author (Connie Logg) retired
IEPM Design Details • IEPM-BW (2001): • More focused (than PingER), fewer sites (e.g. BaBar collaborators), more intense, more probe tools (iperf, thrulay, pathload, traceroute, owamp, bbftp …), more flexibility • Complete code set (measurement, archive, analyze, viz) at each monitoring site. Data distributed. • Needs dedicated host • Remote sites need code installed • Originally executed remote via ssh, still needed code installed • Security, accounts (require training), recovery problems • Major changes with time: • Use servers rather than ssh for remote hosts • Use mysql for configuration data bases rather than require perl scripts • Provide management tools for configuration data etc. • Add/replace probes
IEPM Lessons (1) • Problems & recommendations: • Need right versions of mysql, gnuplot, perl (and modules) installed on hosts • All possible failure modes for probe tools need to be understood and accomodated • Timeout everything, clean up hung processes • Keep logfiles for day or so for debugging • Review how processes run with Netflow (mainly manual) • Scheduling: • don’t run file transfer, iperf, thrulay, pathload at same time on same path • Limit duration and frequency of intensive probes so do not impact network • Host loses disk, upgrades OS, loses DNS, applications upgraded (e.g. gnuplot), IEPM database zapped etc. • Need backup • Have a local host as target for sanity check (e.g. monitoring host based issues) • Monitor monitoring host load (e.g. Ganglia, Nagios…)
IEPM Lessons (2) • Different paths need different probes (performance and interest related) • Experiences with probes (lot of work to understand, analyze & compare): • Owamp vs ping: owamp needs server and accurate time; ping only round trip available everywhere, may be blocked • Traceroute: need to analyze significance of results • Packet pair separation: • Abwe noisy, inaccurate especially on Gbps paths • Pathchirp better, pathload best (most intense approaches iperf), problems at 10Gbps, look at pathneck • TCP: • thrulay more information, more manageable than iperf, • need to keep TCP buffers optimized/updated • File transfer • Disk to disk close to iperf/thrulay • disk measures file/disk system – not network, but end user important • Adding new hosts still not easy
Other Lessons • Traceroute no good for layers 2 &1 • Packet pair surpasses time granularity at 10Gbps • Forecasting hard if path is congested, need to account for diurnal etc. variations • Net admin cannot review thousands of graphs each day: • need event detection, alert notification, and diagnosis assistance • Comparing QoS vs best effort requires adding path reservation • Keeping TCP buffer parameters optimized difficult • Network & configurations not static • Passive/Netflow valuable, complementary
PerfSONAR • Our future focus (for us 3rd Generation): • Open source, open community • Both end users (LHC, GATech, SLAC, Delaware) and network providers (ESnet, I2, GEANT, Eu NRENs, Brazil, …) • Many developers from multiple fields • Requires from the get go: shared code, documentation, collaboration • Hopefully not as dependent on funding as a single team, so persistent? • Transparent gathering and storage of measurement, both from NRENs and end users • Sharing of information across autonomous domains • Uses standard formats • More comprehensive view • AAA to provide protection for of sensitive data • Reduces debugging time • Access to multiple components of the path • No need to play telephone tag • Currently mainly middleware, needs: • Data mining and viz • Topology also at layers 1 & 2 • Forecasting • Event detection and event diagnosis
E.g. Using Active IEPM-BW measurements • Focus on high performance for a few hosts needing to send data to a small number of collaborator sites, e.g. HEP tiered model • Makes regular measurements with probe tools • ping (RTT, connectivity), owamp (1 way delay) traceroute (routes) • pathchirp, pathload (available bandwidth) • iperf (one & multi-stream), thrulay, (achievable throughput) • supports bbftp, bbcp (file transfer applications, not network) • Looking at GridFTP but complex requiring renewing certificates • Choice of probes depends on importance of path, e.g. • For major paths (tier 0, 1 & some 2) use full suite • For tier 3 use just ping and traceroute • Running at major HEP sites: CERN, SLAC, FNAL, BNL, Caltech, Taiwan, SNV to about 40 remote sites • http://www.slac.stanford.edu/comp/net/iepm-bw.slac.stanford.edu/slac_wan_bw_tests.html
IEPM-BW Measurement Topology • 40 target hosts in 13 countries • Bottlenecks vary from 0.5Mbits/s to 1Gbits/s • Traverse ~ 50 AS’, 15 major Internet providers • 5 targets at PoPs, rest at end sites Taiwan • Added Sunnyvale for UltraLight • Adding FZK Karlsruhe TWaren
Probes: Ping/traceroute • Ping still useful • Is path connected/node reachable? • RTT, jitter, loss • Great for low performance links (e.g. Digital Divide), e.g. AMP (NLANR)/PingER (SLAC) • Nothing to install, but blocking • OWAMP/I2 similar but One Way • But needs server installed at other end and good timers • Now built into IEPM-BW • Traceroute • Needs good visualization (traceanal/SLAC) • No use for dedicated λlayer 1 or 2 • However still want to know topology of paths
Probes: Packet Pair Dispersion Bottleneck Min spacing At bottleneck Spacing preserved On higher speed links Used by pathload, pathchirp, ABwE available bw • Send packets with known separation • See how separation changes due to bottleneck • Can be low network intrusive, e.g. ABwE only 20 packets/direction, also fast < 1 sec • From PAM paper, pathchirp more accurate than ABwE, but • Ten times as long (10s vs 1s) • More network traffic (~factor of 10) • Pathload factor of 10 again more • http://www.pam2005.org/PDF/34310310.pdf • IEPM-BW now supports ABwE, Pathchirp, Pathload
BUT… • Packet pair dispersion relies on accurate timing of inter packet separation • At > 1Gbps this is getting beyond resolution of Unix clocks • AND 10GE NICs are offloading function • Coalescing interrupts, Large Send & Receive Offload, TOE • Need to work with TOE vendors • Turn off offload (Neterion supports multiple channels, can eliminate offload to get more accurate timing in host) • Do timing in NICs • No standards for interfaces • Possibly use packet trains, e.g. pathneck
Achievable Throughput • Use TCP or UDP to send as much data as can memory to memory from source to destination • Tools: iperf (bwctl/I2), netperf, thrulay (from Stas Shalunov/I2), udpmon … • Pseudo file copy: Bbcp also has memory to memory mode to avoid disk/file problems
BUT… • At 10Gbits/s on transatlantic path Slow start takes over 6 seconds • To get 90% of measurement in congestion avoidance need to measure for 1 minute (5.25 GBytes at 7Gbits/s (today’s typical performance) • Needs scheduling to scale, even then … • It’s not disk-to-disk or application-to application • So use bbcp, bbftp, or GridFTP
AND … • For testbeds such as UltraLight, UltraScienceNet etc. have to reserve the path • So the measurement infrastructure needs to add capability to reserve the path (so need API to reservation application) • OSCARS from ESnet developing a web services interface (http://www.es.net/oscars/): • For lightweight have a “persistent” capability • For more intrusive, must reserve just before make measurement
Examples of real data Caltech: thrulay • Misconfigured windows • New path • Very noisy • Seasonal effects • Daily & weekly 800 Mbps 0 Nov05 Mar06 UToronto: miperf 250 Mbps 0 Jan06 Nov05 Pathchirp UTDallas • Some are seasonal • Others are not • Events may affect multiple-metrics 120 thrulay Mbps 0 iperf Mar-20-06 Mar-10-06 • Events can be caused by host or site congestion • Few route changes result in bandwidth changes (~20%) • Many significant events are not associated with route changes (~50%)
Scattter plots & histograms Scatter plots: quickly identify correlations between metrics Thrulay Pathchirp Iperf Thrulay (Mbps) RTT (ms) Pathchirp & iperf (Mbps) Throughput (Mbits/s) Pathchirp Thrulay Histograms: quickly identify variability or multimodality
Changes in network topology (BGP) can result in dramatic changes in performance Hour Samples of traceroute trees generated from the table Los-Nettos (100Mbps) Remote host Snapshot of traceroute summary table Notes: 1. Caltech misrouted via Los-Nettos 100Mbps commercial net 14:00-17:00 2. ESnet/GEANT working on routes from 2:00 to 14:00 3. A previous occurrence went un-noticed for 2 months 4. Next step is to auto detect and notify Drop in performance (From original path: SLAC-CENIC-Caltech to SLAC-Esnet-LosNettos (100Mbps) -Caltech ) Back to original path Dynamic BW capacity (DBC) Changes detected by IEPM-Iperfand AbWE Mbits/s Available BW = (DBC-XT) Cross-traffic (XT) Esnet-LosNettos segment in the path (100 Mbits/s) ABwE measurement one/minute for 24 hours Thurs Oct 9 9:00am to Fri Oct 10 9:01am
On the other hand • Route changes may affect the RTT (in yellow) • Yet have no noticeable effect on on available bandwidth or throughput Available Bandwidth Achievable Throughput Route changes
However… • Elegant graphics are great to understand problems BUT: • Can be thousands of graphs to look at (many site pairs, many devices, many metrics) • Need automated problem recognition AND diagnosis • So developing tools to reliably detect significant, persistent changes in performance • Initially using simple plateau algorithm to detect step changes
Seasonal Effects on events • Change in bandwidth (drops) between 19:00 & 22:00 Pacific Time (7:00-10:00am PK time) • Causes more anomalous events around this time
Forecasting • Over-provisioned paths should have pretty flat time series • Short/local term smoothing • Long term linear trends • Seasonal smoothing • But seasonal trends (diurnal, weekly need to be accounted for) on about 10% of our paths • Use Holt-Winters triple exponential weighted moving averages
Experimental Alerting • Have false positives down to reasonable level (few per week), so sending alerts to developers • Saved in database • Links to traceroutes, event analysis, time-series
Passive • Active monitoring • Pro: regularly spaced data on known paths, can make on-demand • Con: adds data to network, can interfere with real data and measurements • What about Passive?
Netflow et. al. • Switch identifies flow by sce/dst ports, protocol • Cuts record for each flow: • src, dst, ports, protocol, TOS, start, end time • Collect records and analyze • Can be a lot of data to collect each day, needs lot cpu • Hundreds of MBytes to GBytes • No intrusive traffic, real: traffic, collaborators, applications • No accounts/pwds/certs/keys • No reservations etc • Characterize traffic: top talkers, applications, flow lengths etc. • LHC-OPN requires edge routers to provide Netflow data • Internet 2 backbone • http://netflow.internet2.edu/weekly/ • SLAC: • www.slac.stanford.edu/comp/net/slac-netflow/html/SLAC-netflow.html
Typical day’s flows • Very much work in progress • Look at SLAC border • Typical day: • ~ 28K flows/day • ~ 75 sites with > 100KB bulk-data flows • Few hundred flows > GByte • Collect records for several weeks • Filter 40 major collaborator sites, big (> 100KBytes) flows, bulk transport apps/ports (bbcp, bbftp, iperf, thrulay, scp, ftp …) • Divide by remote site, aggregate parallel streams • Look at throughput distribution
Netflow et. al. • Peaks at known capacities and RTTs • RTTs might suggest windows not optimized, peaks at default OS window size(BW=Window/RTT)
How many sites have enough flows? • In May ’05 found 15 sites at SLAC border with > 1440 (1/30 mins) flows • Maybe Enough for time series forecasting for seasonal effects • Three sites (Caltech, BNL, CERN) were actively monitored • Rest were “free” • Only 10% sites have big seasonal effects in active measurement • Remainder need fewer flows • So promising
Mining data for sites • Real application use (bbftp) for 4 months • Gives rough idea of throughput (and confidence) for 14 sites seen from SLAC
Multi months • Bbcp SLAC to Padova Bbcp throughput from SLAC to Padova • Fairly stable with time, large variance • Many non network related factors
Netflow limitations • Use of dynamic ports makes harder to detect app. • GridFTP, bbcp, bbftp can use fixed ports (but may not) • P2P often uses dynamic ports • Discriminate type of flow based on headers (not relying on ports) • Types: bulk data, interactive … • Discriminators: inter-arrival time, length of flow, packet length, volume of flow • Use machine learning/neural nets to cluster flows • E.g. http://www.pam2004.org/papers/166.pdf • Aggregation of parallel flows (needs care, but not difficult) • Can use for giving performance forecast • Unclear if can use for detecting steps in performance
Conclusions • Some tools fail at higher speeds • Throughputs often depend on non-network factors: • Host: interface speeds (DSL, 10Mbps Enet, wireless), loads, resource congestion • Configurations (window sizes, hosts, number of parallel streams) • Applications (disk/file vs mem-to-mem) • Looking at distributions by site, often multi-modal • Predictions may have large standard deviations • Need automated assist to diagnose events
In Progress • Working on Netflow viz (currently at BNL & SLAC) then work with other LHC sites to deploy • Add support for pathneck • Look at other forecasters: e.g. ARMA/ARIMA, maybe Kalman filters, neural nets • Working on diagnosis of events • Multi-metrics, multi-paths • Signed collaborative agreement with Internet2 to collaborate with PerfSONAR • Provide web services access to IEPM data • Provide analysis forecasting and event detection to PerfSONAR data • Use PerfSONAR (e.g. router) data for diagnosis • Provide viz of PerfSONAR route information • Apply to LHCnet • Look at layer 1 & 2 information
Questions, More information • Comparisons of Active Infrastructures: • www.slac.stanford.edu/grp/scs/net/proposals/infra-mon.html • Some active public measurement infrastructures: • www-iepm.slac.stanford.edu/ • www-iepm.slac.stanford.edu/pinger/ • e2epi.internet2.edu/owamp/ • amp.nlanr.net/ • Monitoring tools • www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html • www.caida.org/tools/ • Google for iperf, thrulay, bwctl, pathload, pathchirp • Event detection • www.slac.stanford.edu/grp/scs/net/papers/noms/noms14224-122705-d.doc
Outline • Deployment, keeping in sync, management, timeouts, killing hung processes, host OS/env different • Implementation: • MySQL dbs for data and configuration (host, tools, plotting etc.) info • Scheduler, prevents backup • Log files, analyze for troubles • Local target as a sanity check on monitor