1 / 42

MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance. Warren Matthews Stanford Linear Accelerator Center (SLAC). Abstract.

trynt
Download Presentation

MAGGIE Monitoring and Analysis for the Global Grid and Internet End-to-end performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MAGGIEMonitoring and Analysis for the Global Grid and Internet End-to-end performance Warren Matthews Stanford Linear Accelerator Center (SLAC)

  2. Abstract The ambitious distributed computing goals of data intensive science requires careful study of end-to-end performance across the networks involved. Since 1995, the Internet End-to-end Performance Monitoring (IEPM) group at the Stanford Linear Accelerator Center (SLAC) has been tracking connectivity between High Energy and Nuclear Physics (HENP) laboratories and their collaborating Universities and Institutes around the world. In this talk, results from measurements will be presented. Long term trends will be discussed. In particular, the development of a large end-to-end performance monitoring infrastructure involving automatic trouble-detection and notification will be featured.

  3. Overview • Motivation for MAGGIE • High Performance Networks • Network Monitoring • Results • Publishing • Trouble-shooting and Fault Finding

  4. Motivation • High Energy and Nuclear Physics • BaBar database contains ~1.5 billion particle physics event - over 750 TB • Increasing at 100 events per second – 8 MBps • 100s TB exported to BaBar centers and 100s TB Monte Carlo Simulations Imported • LHC will be an order of magnitude larger • Future of HENP is distributed data grid

  5. More Motivation • Also other data intensive science • Astronomy, genetics • Other demanding applications • High-Res medical scans • Video-on-demand • Other fields • Digital Divide • Malaria Centers in Africa, SARS, AIDS.

  6. High Performance Networks • SLAC has 2xOC12 (622Mbps) connections to Energy Sciences Network (ESnet) and California Research and Education Network (CALREN) • ESnet provides connectivity to labs, commercial and international • CALREN provides connectivity to UC sites and Abilene • High capacity well engineered networks • Bandwidth is required but not sufficient

  7. This image taken from the ESnet web site

  8. Abilene Backbone • PDF Map on Internet2 WebSite This image taken from the Internet2 web site

  9. Monitoring Projects (1/2) • Active (and over-active) • PingER/HEP (SLAC, FNAL) • PingER/eJDS (SLAC, ICTP) • AMP and AMP-IPV6 (NLANR) • RIPE-TT (RIPE) • Surveyor (Internet2, Wisconsin) • NASA • IEPM-BW (SLAC, FNAL) • NIMI (ICIR, PSC) • MAGGIE (ICIR, PSC, SLAC, LBL, ANL)

  10. Monitoring Projects (2/2) • Passive • Netflow (Cisco, IETF) • SCNM (LBNL) • IPEX (XIWT, Telcordia) • NetPhysics • Also home-grown system.

  11. End-to-end Monitoring • In reality most projects measure End-to-end performance • End-host effects are unavoidable • Internet2 End-to-end Performance Initiative • Most useful to users • Performance Evaluation System (PIPES) • MAGGIE

  12. MAGGIE ICIR PSC IEPM-BW Measurement Engine SLAC FNAL NIMI Security and scheduling Other tools MAGGIE ANL SCIDAC Publishing AMP Fault Finding Analysis Engine NMWG LBNL UCL SLAC RIPE SLAC

  13. IEPM-BW • SLAC package for monitoring and analysis • Currently 10 monitoring sites • SLAC, FNAL, GATech (SOX), INFN (Milan), NIKHEF, APAN (Japan) • UMich, Internet2 (Michigan), UManchester, UCL (Both UK) • 2-36 targets

  14. KEK LANL EDG CERN NIKHEF TRIUMF FNAL IN2P3 NERSC ANL PPDG/GriPhyN CHI CERN ORNL RAL SNV ESnet JLAB NY UManc SLAC UCL SLAC JAnet DL NNW BNL APAN RIKEN Stanford INFN-Roma APAN Geant INFN-Milan CalREN Abilene SEA CESnet NY ATL SNV HSTN SOX CLV IPLS Monitoring Site CALTECH SDSC UTDallas I2 UFL UMich Rice NCSA

  15. Measurement Engine • Ping, Traceroute • Iperf, Bbftp, Bbcp (mem and disk) • Abwe • Gridftp, UDPmon • Web100 • Passive (netflow)

  16. PingER project has been tracking ping times to HEP collaborators since early 1995

  17. Traffic Typically, Internet traffic is 70% http

  18. Conclusions from IEPM-BW • Bbftp vs bbcp => Implementation • Iperf vs bbftp => Disk, CPU • Packet loss < 0.1% • TCP/IP parameters must be tuned • Web 100 • FAST, Tsunami • LSR

  19. Publishing • Usual method is on the web • Too much to review frequently • Also time delay • Want to resolve problems before users complain • Alarm System based on Web Services • GGF NMWG/OGSA

  20. Demo • Web service is fully described by WSDL • http://www-iepm.slac.stanford.edu/tools/soap/MAGGIE.html • Path.delay.oneWay (Demo)

  21. Troubleshooting • RIPE-TT Testbox Alarm • AMP Automatic Event Detection • Our approach is diurnal changes

  22. Diurnal Changes (1/2) • Parameterize performance in terms of hour and variability within that hourly bin • Median and standard deviation of measurements on Monday 7pm-8pm • AMP uses mean and variance • RIPE-TT uses rolling average and breaks day into 4

  23. Diurnal Changes (2/2) • Measurements can be classified in terms of how they differ from historical value • “Concerned” if latest measurement is more than 1 s.d. from median • “Alarmed” if latest measurement is more than 2 s.d. from median • Recent problems are flagged due to difference from historical value • Compare to measurement in previous bin (e.g. Monday 6pm-7pm) to reduce false-positives

  24. Limitations • Could be over an hour before alarm is generated • Need more frequent but sufficiently low impact measurements to allow finer grained troubleshooting • Migrating to ABWE

  25. Trouble Detection $ tail maggie.log 04/28/2003 14:58:47 (1:14) gnt4 0.51 Alarm (AThresh=38.33) 04/28/2003 16:25:45 (1:16) gnt4 3.83 Concern (CThresh=87.08) 04/28/2003 17:55:21 (1:17) gnt4 169.57 Within boundaries Status Throughput (iperf) Date and Time Bin Node Only write to the log if an alarm is triggered Keep writing to the log until alarm is cleared

  26. Trouble Status • Tempted to make color-coded web page • All the hard work still left to do • Use knowledge to see common point of failure • Production table would be very large • Instead figure out where to flag

  27. Net Rat • Inform on possible problem locations • Starting point for human intervention • No measurement is ‘authoritative’ • Cannot even believe a measurement • Multiple tools and Multiple measurement point - Cross reference • Trigger further measurements (NIMI)

  28. Net Rat Methodology (1/4) • If last measurement was Within 1sd • Mark each hop as Good • Hop.performance = good • If last measurement was “Concern” • Mark each hop as acceptable • If last measurement was an “Alarm” • Mark Each hop as poor

  29. Net Rat Methodology (2/4) • Measurement generates an alarm • Set each hop.performance = poor

  30. Net Rat Methodology (3/4) • Other measurements from same site do not generate alarms. • Set each hop.performance = good • Immediately ruled out problem in local LAN or host machine

  31. Net Rat Methodology (4/4) • Different site monitors same target • No alarm is generated • Set each hop.performance = good • Pinpointed possible problem in intermediate network. • Of course it couldn’t be that simple

  32. Arena • Report findings to informant database • Internet2 Arena database • PingER Nodes database • PIPES Culprit/Contact Database

  33. Toward a Monitoring Infrastructure • Certainly the need • DOE Science Community • Grid • Troubleshooting / E2Epi • Many of the ingredients • Many monitoring projects • Many tools • PIPES • MAGGIE

  34. Summary “It is widely believed that a ubiquitous monitoring infrastructure is required”.

  35. IEPM-BW ESnet ABwE AMP NIMI RIPE-TT E2E PI SLAC Web Services GGF NMWG Arena AMP TroubleShooting Links

  36. Credits • Les Cottrell • Connie Logg, Jerrod Williams • Jiri Navratil • Fabrizio Coccetti • Brian Tierney • Frank Nagy, Maxim Grigoriev • Eric Boyd, Jeff Boote • Vern Paxson, Andy Adams • Iosif Legrand • Jim Ferguson, Steve Englehart • Local admins and other volunteers • DoE/MICS

  37. Output from the demo on slide 25 #!/usr/bin/perl use SOAP::Lite; my $answer = SOAP::Lite -> service('http://www-iepm.slac.stanford.edu/tools/soap/wsdl/profile_07.wsdl' ) -> pathDelayOneWay("tt81.ripe.net:tt28.ripe.net",""); print $answer->{NetworkTestTool}->{toolName},"\n"; print $answer->{NetworkTestInfo}->{time},"\n"; print $answer->{NetworkPathDelayStatistics}->{value},"\n"; % ./soap_client.pl ripe-tt 20030628215739.91155397892 0.075347

More Related