250 likes | 369 Views
SLAC and PerfSONAR. Yee-Ting Li PerfSONAR developers workshop October 2006. SLAC IEPM. SLAC used to be primarily a High Energy Particle Physics institute Now beginning to diverge into other science’s Photon Science (SSRL and LCLS) Impact to chemistry and molecular biology
E N D
SLAC and PerfSONAR Yee-Ting Li PerfSONAR developers workshop October 2006
SLAC IEPM • SLAC used to be primarily a High Energy Particle Physics institute • Now beginning to diverge into other science’s • Photon Science (SSRL and LCLS) • Impact to chemistry and molecular biology • First US based webpage at SLAC! • Internet End-to-end Performance Monitoring Group • Focus on problem detection and long term performance/trend analysis • Origin’s in PingER monitoring • Currently deploying more intrusive IEPM-BW tests
PingER • PingER project originally (1995) for measuring network performance for US, Europe and Japanese HEP community • Extended this century to measure Digital Divide • Last year added monitoring sites in S. Africa, Pakistan & India • Uses ICMP to determine: • RTT • Loss • Connectivity • Derived TCP throughput, ie 1/sqrt(LOSS)
PingER: Deployment • ~120 countries • 99% world’s connected population • 35 monitor sites in 14 countries • Over 600 nodes currently being monitored worldwide
PingER: Digital Divide Behind Europe 6 Yrs: Russia, Latin America 7 Yrs: Mid-East, SE Asia 10 Yrs: South Asia 11 Yrs: Cent. Asia 12 Yrs: Africa
IEPM-BW • Developed as an exhibit for SC2001 • Conducts tests using various tools • Achievable BW: Iperf, thrulay • Estimated BW: pathchirp, pathload,abwe • File Transfer: bbcp, bbftp, gridftp Latency/Loss: ping, traceroute, owamp • MySQL backend with Web-based front end • Collection of scripts to: • start/stop deamons • Conduct analysis (and produce web-accessible graphs) • Forecasting and Event detection (and notification)
IEPM-BW: Deployment • Running at CERN, SLAC, FNAL, BNL, Caltech, Taiwan to about 40 remote sites (in a semi-mesh) • 40 target hosts in 13 countries • Bottlenecks vary from 0.5Mbits/s to 10Gbits/s • Traverse ~50 AS’s, 15 major Internet providers • 5 targets at PoPs, rest at end sites
IEPM-BW: Presentation • Timeseries plots
IEPM-BW: Presentation • Diurnal Plots
IEPM-BW: Presentation • CDF Diagrams
IEPM-BW: Topology • Topology
IEPM-BW: Event Detection • Automated problem identification: • Administrator’s cannot review 100’s of graphs each day • Alerts for network administrators • Changes in time-series, loss, latency, iperf, SNMP • Alerts for systems people • OS/Host metrics • Anomalies for security • Anomalous event detection • A series of no measurements (network out?) • Determine that something ‘wrong’ has happened; measured value significantly differs from expected value • Forecasts • Given trends in previous measurements, determine what is within tolerance of being ‘okay’
Observations Event * Trigger % full History mean Event Detection: Plateau • Circular buffer of observations • Define trigger buffer of results • Buffer fills if an observation deviates significantly from mean of circular buffer • Event occurs when trigger buffer exceeds threshold • Filters: • Check if (mh -mt) / mh > D& 90% trigger in last T mins then have trigger • Move trigger buffer to history buffer History mean – 2 * stdev • = history length = 1 day, t = trigger length = 3 hours • = standard deviations = 2
Event Detection: K-S • For each observation: for the previous 100 observations with next 100 observations • Compare the vertical difference in CDFs • How does it differ from random CDFs • Expressed as % difference • Define threshold for % difference
Event Detection: Holt-Winters • Use Holt-Winters (H-W) technique: • Uses triple exponential weighted moving average • Three parameters (a, b, ) that take into account local smoothing, long term seasonal smoothing, and trends respectively. • Choose parameters by minimizing (1/N)Σ(Ft-yt)2 • Ft=forecast for time t as function of parameters, yt= observation at time t • H-W is a forecasting technique; need to complement with a method to identify events • If a percentage of residuals are outside twice the EWMA of absolute deviation, then generate event (HWE) • Apply Plateau on H-W residuals (PHR) and K-S on H-W residuals (KHR)
Event Diagnosis • Once we get alert(s) of Events, how do we correlate to diagnose problems? • Define heuristic’s of ‘effect and cause’ • Define probabilities to pin-point the location of the problem • First pass: narrows down to where the problem occurs on a high level • End-host or network? • Next step: is to define heuristics for the location of problems in a network path and subsystems on hosts • Interrogate using tools such as pS, ganglia, nagios • Cross correlate with other measurements (eg. Meshed traceroutes)
PerfSONAR • De-centralised network monitoring • Reduces overhead for us at IEPM to gather network statistics • Unified access to network information • Should enable easier methods to gather and use the network information • However, not all sites may provide the most useful information for our purposes • Define/recommend a base set of MP’s? (eg ping, traceroute, port up?…) • Middleware platform • Therefore requires applications to prove usefulness of design • Alarm services (event detection), trend analysis etc.
PerfSONAR Interests to SLAC/IEPM • More statistics allow us to better understand Internet performance • Event Diagnosis - pS enables easier gathering of network performance data • Backbone and End-to-end allows us to corroborate suspicions • First need event detection in order to identify where problems are seen • Grid software development • SLAC will become a LHC ATLAS Tier-2 site • Network Service’s • Use of network metrics to help replica management, light path reservations etc
PerfSONAR Questions • Test and possibly extend NMWG schemas to support the metrics that we are interested in • Interface for reoccurring and scheduled test initialisation • Waiting on AAA? • Conflicting tests? • Porting of our visualisation and analysis tools • Currently untie’ing and modularising analysis tools from IEPM-BW infrastructure • API • Input: use NMWG/pS • Output: Extend perfSONAR API to support ‘alerts’? • Access patterns for data: • We are more interested in gathering large windows of data rather than individual results • Too slow to gather data dynamically? • Should we cache data locally for our analysis?
PerfSONAR: Installation • Java Version • Relatively easy; however, I have worked with java and web-services in the past • Documentation could do with more detail • What are all the ‘extra’ packages actually for? E.g. exist • Had to install separately; why couldn’t the perfSONAR install do that? • List of prerequisites/requirements • Machine types • Security requirements/Ports opened etc
PerfSONAR: SQL-MA • Idea was to create a IEPM-BW MA • Provide extra characteristics • Easiest way to enable NMWG compliant reports • Tests NMWG for our purposes • SQL-MA • All data currently in MySQL tables! • Installation problems • Different snapshots give different errors! • Difficult to get help due to time-zone differences • Security policies at SLAC prevent quick and easy access to non-SLAC users • Class diagrams seem to make sense • Will report on how easy it is to actually get it working!
PerfSONAR: Security Issues • SLAC (DOE) does not allow us to run application servers individually (eg ports are blocked) • We are currently deploying pS on a ‘community’ tomcat installation • Running two instances of tomcat for LS and MA is not possible for us • SLAC has a ‘prove that you need it’ attitude to allow external access to network data
Summary • De-centralised management of pS allows us to concentrate more on analysis rather than deployment/maintenance • IEPM would like specific tools that have proven to be the most useful for diagnosis • Latency (connectivity) and traceroute • Extend to other metrics such as throughput etc. • PerfSONAR allows transparent data access • pS enables the unification of both end-to-end and router metric representation • Worry about finding correlations for diagnosis rather than determine ‘how’ to gather the data. • Porting of our analysis tools • Test perfSONAR api’s • Provide useful features such as event detection, other UI4 examples etc