IEPM-BW

IEPM-BW Warren Matthews (SLAC) Presented at the UCL Monitoring Infrastructure Workshop, London, May 15-16, 2003.

Overview / Goals • IEPM-BW monitoring and results • Other measurements • Publishing • Troubleshooting Tools • Further work

IEPM-BW • SLAC package for monitoring and analysis • Currently 10 monitoring sites • SLAC, FNAL, GATech (SOX), INFN (Milan), NIKHEF, APAN (Japan) • Manchester, UMich, UCL, Internet2 • 2-36 targets

KEK LANL EDG CERN NIKHEF TRIUMF FNAL IN2P3 NERSC ANL PPDG/GriPhyN CHI CERN ORNL RAL SNV ESnet JLAB NY UManc SLAC UCL SLAC JAnet DL NNW BNL APAN RIKEN Stanford INFN-Roma APAN Geant INFN-Milan CalREN Abilene SEA CESnet NY ATL SNV HSTN SOX CLV IPLS Monitoring Site CALTECH SDSC UTDallas I2 UFL UMich Rice NCSA

Measurement Engine • Ping, Traceroute • Iperf, Bbftp, Bbcp (mem and disk) • Abwe • Gridftp, UDPmon • Web100 • Passive (netflow)

Other Projects (U.S.) • PingER (SLAC, FNAL) • eJDS (SLAC, ICTP) • AMP (NLANR) • NIMI (ICIR, PSC) • MAGGIE (ICIR, PSC, SLAC, LBL, ANL) • NASA, SCNM (LBL) • Surveyor (Internet2) • E2e PI and PIPES (Internet2) • Also SLAC has a RIPE-TT box

Publishing • Web Service • SOAP::Lite perl module • Python • Java • NMWG • OGSA

Publishing • NMWG Properties document • Path.delay.roundtrip (Demo) • Hop.bandwidth.capacity (tracespeed) • Guthrie (demo) • Almost 1000 nodes in database • PingER Networks • Arena

Advisor Screenshot taken from the talk by Jim Ferguson at the e2e workshop, Miami Feb 2003.

MonaLisa • Front-end visualization • Vital component for development of the LHC Computing Model • JINI/JAVA and WSDL/SOAP • demo

Troubleshooting • RIPE-TT Testbox Alarm • AMP Automatic Event Detection • Our approach is diurnal changes

Diurnal Changes (1/4) • Either Performance varies during the day • Or it doesn’t • No variation is the special case of variation=0

Diurnal Changes (2/4) • Either performance (within the bin) is variable • Or it isn’t • No variation is the special case of variation=0

Diurnal Changes (3/4) • Parameterize performance in terms of hour and variability within that hourly bin • Measurements can be classified in terms of how they differ from historical value • Recent problems are flagged due to difference from historical value • Compare to measurement in previous bin to reduce false-positives

Diurnal Changes (4/4) • Calculate Median and standard deviation of last five measurement in bin • e.g. Monday 7pm-8pm • “Concerned” if latest measurement is more than 1 s.d. from median • “Alarmed” if latest measurement is more than 2 s.d. from median

Trouble Detection $ tail maggie.log 04/28/2003 14:58:47 (1:14) gnt4 0.51 Alarm (AThresh=38.33) 04/28/2003 16:25:45 (1:16) gnt4 3.83 Concern (CThresh=87.08) 04/28/2003 17:55:21 (1:17) gnt4 169.57 Within boundaries Status Throughput (iperf) Date and Time Bin Node Only write to the log if an alarm is triggered Keep writing to the log until alarm is cleared

Trouble Status • Tempted to make color-coded web page • All the hard work still left to do • Use knowledge to see common point of failure • Production table would be >> 36x700 • Instead figure out where to flag

Net Rat • Alarm System • Multiple tools • Multiple measurement points • Cross reference • Trigger further measurements • Starting point for human intervention • Informant database • hop.performance • No measurement is ‘authoritative’ • Cannot even believe a measurement

Limitations • Could be over an hour before alarm is generated • More frequent measurements impact the network and measurements overlap • Low impact tools allow finer grained measurement

Where next ? • GLUE, OGSA, CIM • Work with Other Projects • Publishing and troubleshooting • Discovery • Security

Toward a Monitoring Infrastructure • Certainly the need • DOE Science Community • Japanese Earth Simulator • Grid • Troubleshooting / E2Epi • Many of the ingredients • Many monitoring projects • PIPES • MAGGIE

Summary “It is widely believed that a ubiquitous monitoring infrastructure is required”.

This talk IEPM-BW PingER ABwE AMP NIMI MAGGIE RIPE-TT Surveyor E2E PI SLAC Web Services GGF NMWG Arena Monalisa Advisor TroubleShooting Links

Credits • Les Cottrell • Connie Logg, Jerrod Williams • Jiri Navratil • Fabrizio Coccetti • Brian Tierney • Frank Nagy, Maxim Grigoriev • Eric Boyd, Jeff Boote • Vern Paxson, Andy Adams • Iosif Legrand • Jim Ferguson, Steve Englehart • Local admins and other volunteers • DoE/MICS

Demos • This is the output from the “Publishing” Demo on slide 9. $ more soap_client.pl #!/usr/local/bin/perl use SOAP::Lite; print SOAP::Lite -> service('http://www-iepm.slac.stanford.edu/tools/soap/wsdl/profile_0002.wsdl') -> hopBandwidthCapacity("brdr.slac.stanford.edu:i2-gateway.stanford.edu"); $ ./soap_client.pl 1000Mb

Demos • This is the output from the “tracespeed” demo on slide 9. $ ./tracespeed thunderbird.internet2.edu 0 doris 10Mb 1 core (134.79.122.32) 1000Mb 2 brdr (134.79.235.45) 1000Mb 3 i2-gateway.stanford.edu (192.68.191.83) No Data. 4 stan.pos.calren2.net (171.64.1.213) No Data. 5 sunv--stan.pos.calren2.net (198.32.249.73) No Data. 6 abilene--qsv.pos.calren2.net (198.32.249.162) No Data. 7 kscyng-snvang.abilene.ucaid.edu (198.32.8.103) No Data. 8 iplsng-kscyng.abilene.ucaid.edu (198.32.8.80) No Data. 9 so-0-2-0x1.aa1.mich.net (192.122.183.9) No Data. 10 so-0-0-0x0.ucaid2.mich.net (198.108.90.118) No Data. 11 thunderbird.internet2.edu (207.75.164.95) No Data.

Aside: NetRat (1/5) • If last measurement was Within 1sd • Mark each hop as Good • Hop.performance = good • If last measurement was “Concern” • Mark each hop as acceptable • If last measurement was an “Alarm” • Mark Each hop as poor

Aside: NetRat (2/5) • Measurement generates an alarm • Set each hop.performance = poor

Aside: NetRat (3/5) • Other measurements from same site do not generate alarms. • Set each hop.performance = good • Immediately ruled out problem in local LAN or host machine

Aside: NetRat (4/5) • Different site monitors same target • No alarm is generated • Set each hop.performance = good • Pinpointed possible problem in intermediate network.

IEPM-BW

IEPM-BW

Presentation Transcript

SLAC IEPM PingER and BW monitoring & tools

IEPM-BW a new network/application throughput performance measurement infrastructure

SCIC Monitoring WG PingER (Also IEPM-BW)

BW

BW 2.1

BW and BW Reporting Introduduction

DataGrid Wide Area Network Monitoring Infrastructure (DWMI) aka IEPM-BW

BOF Discussion: Uploading IEPM-BW data to MonALISA

IEPM-BW Deployment Experiences

IEPM-BW and MAGGIE

IEPM

Overview of IEPM-BW - Bandwidth Testing of Bulk Data Transfer Tools

IEPM-BW (or PingER on steroids) and the PPDG

BW Citations

BW Industries

SLAC IEPM PingER and BW monitoring & tools

IEPM-BW Deployment Experiences

Overview of IEPM-BW - Bandwidth Testing of Bulk Data Transfer Tools

IEPM-BW

IEPM-BW

Presentation Transcript

SLAC IEPM PingER and BW monitoring &amp; tools

IEPM-BW a new network/application throughput performance measurement infrastructure

SCIC Monitoring WG PingER (Also IEPM-BW)

BW

BW 2.1

BW and BW Reporting Introduduction

DataGrid Wide Area Network Monitoring Infrastructure (DWMI) aka IEPM-BW

BOF Discussion: Uploading IEPM-BW data to MonALISA

IEPM-BW Deployment Experiences

IEPM-BW and MAGGIE

IEPM

Overview of IEPM-BW - Bandwidth Testing of Bulk Data Transfer Tools

IEPM-BW (or PingER on steroids) and the PPDG

BW Citations

BW Industries

SLAC IEPM PingER and BW monitoring &amp; tools

IEPM-BW Deployment Experiences

Overview of IEPM-BW - Bandwidth Testing of Bulk Data Transfer Tools

SLAC IEPM PingER and BW monitoring & tools

SLAC IEPM PingER and BW monitoring & tools