1 / 25

SLAC and PerfSONAR

SLAC and PerfSONAR. Yee-Ting Li PerfSONAR developers workshop October 2006. SLAC IEPM. SLAC used to be primarily a High Energy Particle Physics institute Now beginning to diverge into other science’s Photon Science (SSRL and LCLS) Impact to chemistry and molecular biology

Download Presentation

SLAC and PerfSONAR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SLAC and PerfSONAR Yee-Ting Li PerfSONAR developers workshop October 2006

  2. SLAC IEPM • SLAC used to be primarily a High Energy Particle Physics institute • Now beginning to diverge into other science’s • Photon Science (SSRL and LCLS) • Impact to chemistry and molecular biology • First US based webpage at SLAC! • Internet End-to-end Performance Monitoring Group • Focus on problem detection and long term performance/trend analysis • Origin’s in PingER monitoring • Currently deploying more intrusive IEPM-BW tests

  3. PingER • PingER project originally (1995) for measuring network performance for US, Europe and Japanese HEP community • Extended this century to measure Digital Divide • Last year added monitoring sites in S. Africa, Pakistan & India • Uses ICMP to determine: • RTT • Loss • Connectivity • Derived TCP throughput, ie 1/sqrt(LOSS)

  4. PingER: Deployment • ~120 countries • 99% world’s connected population • 35 monitor sites in 14 countries • Over 600 nodes currently being monitored worldwide

  5. PingER: Digital Divide Behind Europe 6 Yrs: Russia, Latin America 7 Yrs: Mid-East, SE Asia 10 Yrs: South Asia 11 Yrs: Cent. Asia 12 Yrs: Africa

  6. IEPM-BW • Developed as an exhibit for SC2001 • Conducts tests using various tools • Achievable BW: Iperf, thrulay • Estimated BW: pathchirp, pathload,abwe • File Transfer: bbcp, bbftp, gridftp Latency/Loss: ping, traceroute, owamp • MySQL backend with Web-based front end • Collection of scripts to: • start/stop deamons • Conduct analysis (and produce web-accessible graphs) • Forecasting and Event detection (and notification)

  7. IEPM-BW: Deployment • Running at CERN, SLAC, FNAL, BNL, Caltech, Taiwan to about 40 remote sites (in a semi-mesh) • 40 target hosts in 13 countries • Bottlenecks vary from 0.5Mbits/s to 10Gbits/s • Traverse ~50 AS’s, 15 major Internet providers • 5 targets at PoPs, rest at end sites

  8. IEPM-BW: Presentation • Timeseries plots

  9. IEPM-BW: Presentation • Diurnal Plots

  10. IEPM-BW: Presentation

  11. IEPM-BW: Presentation • CDF Diagrams

  12. IEPM-BW: Topology • Topology

  13. IEPM-BW: Event Detection • Automated problem identification: • Administrator’s cannot review 100’s of graphs each day • Alerts for network administrators • Changes in time-series, loss, latency, iperf, SNMP • Alerts for systems people • OS/Host metrics • Anomalies for security • Anomalous event detection • A series of no measurements (network out?) • Determine that something ‘wrong’ has happened; measured value significantly differs from expected value • Forecasts • Given trends in previous measurements, determine what is within tolerance of being ‘okay’

  14. Observations Event * Trigger % full History mean Event Detection: Plateau • Circular buffer of observations • Define trigger buffer of results • Buffer fills if an observation deviates significantly from mean of circular buffer • Event occurs when trigger buffer exceeds threshold • Filters: • Check if (mh -mt) / mh > D& 90% trigger in last T mins then have trigger • Move trigger buffer to history buffer History mean – 2 * stdev • = history length = 1 day, t = trigger length = 3 hours • = standard deviations = 2

  15. Event Detection: K-S • For each observation: for the previous 100 observations with next 100 observations • Compare the vertical difference in CDFs • How does it differ from random CDFs • Expressed as % difference • Define threshold for % difference

  16. Event Detection: Holt-Winters • Use Holt-Winters (H-W) technique: • Uses triple exponential weighted moving average • Three parameters (a, b, ) that take into account local smoothing, long term seasonal smoothing, and trends respectively. • Choose parameters by minimizing (1/N)Σ(Ft-yt)2 • Ft=forecast for time t as function of parameters, yt= observation at time t • H-W is a forecasting technique; need to complement with a method to identify events • If a percentage of residuals are outside twice the EWMA of absolute deviation, then generate event (HWE) • Apply Plateau on H-W residuals (PHR) and K-S on H-W residuals (KHR)

  17. Event Detection: Holt-Winters

  18. Event Diagnosis • Once we get alert(s) of Events, how do we correlate to diagnose problems? • Define heuristic’s of ‘effect and cause’ • Define probabilities to pin-point the location of the problem • First pass: narrows down to where the problem occurs on a high level • End-host or network? • Next step: is to define heuristics for the location of problems in a network path and subsystems on hosts • Interrogate using tools such as pS, ganglia, nagios • Cross correlate with other measurements (eg. Meshed traceroutes)

  19. PerfSONAR • De-centralised network monitoring • Reduces overhead for us at IEPM to gather network statistics • Unified access to network information • Should enable easier methods to gather and use the network information • However, not all sites may provide the most useful information for our purposes • Define/recommend a base set of MP’s? (eg ping, traceroute, port up?…) • Middleware platform • Therefore requires applications to prove usefulness of design • Alarm services (event detection), trend analysis etc.

  20. PerfSONAR Interests to SLAC/IEPM • More statistics allow us to better understand Internet performance • Event Diagnosis - pS enables easier gathering of network performance data • Backbone and End-to-end allows us to corroborate suspicions • First need event detection in order to identify where problems are seen • Grid software development • SLAC will become a LHC ATLAS Tier-2 site • Network Service’s • Use of network metrics to help replica management, light path reservations etc

  21. PerfSONAR Questions • Test and possibly extend NMWG schemas to support the metrics that we are interested in • Interface for reoccurring and scheduled test initialisation • Waiting on AAA? • Conflicting tests? • Porting of our visualisation and analysis tools • Currently untie’ing and modularising analysis tools from IEPM-BW infrastructure • API • Input: use NMWG/pS • Output: Extend perfSONAR API to support ‘alerts’? • Access patterns for data: • We are more interested in gathering large windows of data rather than individual results • Too slow to gather data dynamically? • Should we cache data locally for our analysis?

  22. PerfSONAR: Installation • Java Version • Relatively easy; however, I have worked with java and web-services in the past • Documentation could do with more detail • What are all the ‘extra’ packages actually for? E.g. exist • Had to install separately; why couldn’t the perfSONAR install do that? • List of prerequisites/requirements • Machine types • Security requirements/Ports opened etc

  23. PerfSONAR: SQL-MA • Idea was to create a IEPM-BW MA • Provide extra characteristics • Easiest way to enable NMWG compliant reports • Tests NMWG for our purposes • SQL-MA • All data currently in MySQL tables! • Installation problems • Different snapshots give different errors! • Difficult to get help due to time-zone differences • Security policies at SLAC prevent quick and easy access to non-SLAC users • Class diagrams seem to make sense • Will report on how easy it is to actually get it working!

  24. PerfSONAR: Security Issues • SLAC (DOE) does not allow us to run application servers individually (eg ports are blocked) • We are currently deploying pS on a ‘community’ tomcat installation • Running two instances of tomcat for LS and MA is not possible for us • SLAC has a ‘prove that you need it’ attitude to allow external access to network data

  25. Summary • De-centralised management of pS allows us to concentrate more on analysis rather than deployment/maintenance • IEPM would like specific tools that have proven to be the most useful for diagnosis • Latency (connectivity) and traceroute • Extend to other metrics such as throughput etc. • PerfSONAR allows transparent data access • pS enables the unification of both end-to-end and router metric representation • Worry about finding correlations for diagnosis rather than determine ‘how’ to gather the data. • Porting of our analysis tools • Test perfSONAR api’s • Provide useful features such as event detection, other UI4 examples etc

More Related