The PERT and Network Performance Monitoring

The PERT and Network Performance Monitoring • EVN-NREN, • Amsterdam 28//01/05 Toby Rodwell, Network Engineer DANTE

Network Performance Problems • Historically, long distance circuits (the “wide-area”) have been the bottleneck in a network • In recent years, the capacity of long distance circuits has significantly increased • End-to-end performance bottle-necks may now occur at any point in a system – end-system (application, OS, hardware), LAN or WAN • As such, it is becoming more and more difficult for a non-expert end-user to diagnose their network performance issues

Origins of the PERT • Conception of the PERT … Jan 01 Internet2 Meeting • Performance Enhancement and Response Team • To provide a support structure to investigate and resolve problems in the performance of applications over computer networks • Comparable to CERT structure • Realization of the PERT … Dec 2002 TERENA meeting • GARR, TERENA, DANTE, SWITCH, CESnet, HEAnet and UKERNA committed to a practical trial of a basic PERT

The GEANT PERT • PERT 2002-2004 • Informal, unregulated access to PERT; anybody can request PERT’s help • PERT communicated via e-mail list • Primary purpose of investigation was to improve PERT’s knowledge and experience • Problems were addressed on a best efforts basis • No dedicated Monitoring tools • RoundUp tracking system (off-the-shelf) used

GEANT2 PERT • A development of the existing PERT • Pilot phase Nov 04 –Feb 05 • Fully operational from Mar 05 • A virtual team consisting of • Case Managers, who receive new requests and manage unresolved issues • Subject specialists who can be called upon to help resolve complex issues • Monitoring tools • During the course of the GEANT2 project a monitoring infrastructure will be developed and deployed which should be of particular help with performance troubleshooting

PERT Staff • Case Managers • Part-time staff provided by GEANT2 project participants • On a roster to ensure continuous cover during normal working hours (once PERT fully operational) • Cross-discipline experts who are capable of identifying the locations of performance bottle-necks • Subject Matter Experts • Unfunded volunteers from a potentially wide variety of organizations who provide help on a best efforts basis • Have specialist knowledge in one or more subjects and so can precisely diagnose the cause of a given problem and help the end-users resolve it

Pilot PERT Systems • Issue Tracker • Record of PERT issues (cases) and their investigation • Use open-source, “Roundup” software • Publicly accessible at http://roundup.geant2.net:8080/pert (eVLBI performance case issue4) • PERT Diary • For assessing the performance of the PERT and highlighting issues • Uses TWiki open-source software (user editable website) • Publicly accessible at http://cemp1.switch.ch/cgi-bin/twiki/view/PERTDiary/WebHome

PERT Systems • PERT Ticket System • Similar to Trouble Ticket systems used by NOCs • Optimised for the collaborative nature of PERT investigations (will collect and records e-mails and Instant Messaging threads) • May directly contact SMEs who have expressed interest in a particular subject • Knowledge Base • Known performance issues, with possible ways to address them • Successful diagnostic strategies

Lessons Learned to Date • Identify technical contact at each end • Determine the scope of testing possible • If production machines involved, some configurations changes may not be acceptable for testing purposes • Wherever possible, use methods to minimise the amount of variables • e.g. sink data to /dev/null, memory to memory transfer not to disk

Contacting the PERT • Normally via NREN • Selected pan-European projects (including EVN) may contact PERT directly • Because the PERT is not 24x7 quick response, suspected network failures are best reported to NREN/GEANT NOCs • E-mail address – pert-report@geant2.net

GEANT Network Monitoring

Monitoring Tools • GEANT status monitoring • 5 minute polling - state of equipment, circuits and services • Failed hardware or circuits detected within 10 minutes and action taken by GEANT NOC, 24x7 • GEANT traffic statistics collection • 5 minute polling of router interface counters (default and customised) • Collected data stored in a Round Robin Database (RRD), that is kept a constant size by aggregating data as it ages • GEANT traffic statistics display • For quick, real-time view – Weathermap • For back history and specialist counters – Taksometro http://stats.geant.net/

Monitoring Tools - Taksometro

Monitoring Tools Taksometro

Monitoring Tools ‘Weathermap’ Kairos

Monitoring Tools ‘Weathermap’ KairosHyperlinked traffic chart

Monitoring Tools – Synagon(GEANT Ops only) Before After

Any Questions? Thank you. toby.rodwell@dante.org.uk

Example Case • … from last year. • Project has since moved on, but sequence of events is still instructive • EVN throughput test • Test the download of 430MB file from the JIVE website in Dwingerloo to the University of Oxford • Problems with the systems in Oxford, therefore test done between JIVE and a GÉANT workstation.

Example Case • Initial transfer test: • Via http, using wget • Took 5 minutes to complete the 430Mbps transfer, (approximately 10Mbps throughput) • PERT case opened • Potential causes • Ethernet interfaces not full duplex mode • Insufficiently large TCP buffers

Example Case • The TCP receive buffers max size on GEANT of reasonable size • wget uses the default TCP buffer size. TCP default buffer size increased on two receiver (ws4.uk: Linux -> 8MB, ws1.de: Unix -> 196kB) Dramatic improvement: 40Mbps • Could not access the JIVE webserver to increase the Tx buffer (critical production machine) • Access was granted to the JIVE FTP server, where the Tx buffer was increased to 2MB Improvement: 90Mbps

The PERT and Network Performance Monitoring