150 likes | 326 Views
Network Monitoring in the BaBar Experiment. S. Luitz, D. Millsom, D. Salomoni. Summary. The BaBar Data Acquisition Network A Typical Scenario... Traffic Monitoring and Recording Traffic Dump Analysis Tools Real-Time Analysis of Traffic Conclusions and Outlook.
E N D
Network Monitoring in the BaBar Experiment S. Luitz, D. Millsom, D. Salomoni CHEP2000 - Padova, February 2000
Summary • The BaBar Data Acquisition Network • A Typical Scenario... • Traffic Monitoring and Recording • Traffic Dump Analysis Tools • Real-Time Analysis of Traffic • Conclusions and Outlook CHEP2000 - Padova, February 2000
The BaBar Data Acquisition Network (1) • ca. 200 VME single board computers (VxWorks): 100 Mbit/s full duplex Ethernet • 78 Sun Ultra 5 "farm" workstations for Level-3 trigger and fast monitoring: 2 100 Mbit/s full duplex Ethernet each ("dual homed") • 5 Sun Ultra 60 application servers (e.g. Run control): 100 Mbit/s full duplex Ethernet • 15 Sun Ultra 5 display console machines: 10 or 100 Mbit/s Ethernet CHEP2000 - Padova, February 2000
The BaBar Data Acquisition Network (2) • 1 Sun E 450 (4 CPU, 780 Gbyte RAID) central boot/NFS/database/data buffer server: 2 x 1GBit/s Ethernet • various development and user workstations • 3 Cisco Cat 5500 switches • 2 VLANs / IP subnets: • dedicated real-time DAQ network (35-40 MByte/s) • general purpose / data transfer network CHEP2000 - Padova, February 2000
A Typical Scenario • Problem: • Shift crew reports: "Run control server problem ca. 45 min ago at 23:50" • A look at the system logs shows NFS timeouts at 23:08 but no network-related events (like spanning tree reconfigurations) • Central network monitoring shows "normal" traffic • What was going on? Did someone/something overload the NFS server? Data base access? ...? • Server based performance monitoring very poor ! • Wouldn´t it be nice to be able to have a close look at the network traffic around 23:05? CHEP2000 - Padova, February 2000
Traffic Monitoring and Recording (1) • We can! Even with free software tools! • Configure switch to forward all traffic in the BaBar general-purpose VLAN/subnet to a monitoring port (SPAN) • Standard protocol analyzers no good: small buffers, what to trigger on? • Sun E 250 with 72 Gbyte disk and Gigabit Ethernet as traffic recorder and protocol analyzer • Record packet headers into "circular" disk buffer CHEP2000 - Padova, February 2000
Traffic Monitoring and Recording (2) • Use tcpdump (ftp://ftp.ee.lbl.gov) to capture packet headers and write them to files • In our environment: • We can´t monitor the real-time network, switch backplane capacity could be exceeded at peak • We have 3 switches, however presently we only monitor the switch where the file server is connected • Typical captured data rates during normal operation: 4 Gbytes / hour CHEP2000 - Padova, February 2000
Analysis Tools (1) • How to look at Gigabytes of recorded network data? • Use tcpdump to filter dump file (e.g. "host bbr-srv02 and host bbr-srv03 and port nfs") into a smaller file • Use tcpslice (ftp://ftp.ee.lbl.gov) to isolate time intervals from the dump files • Use tcptrace to automatically analyze TCP connections and plot throughput graphshttp://jarok.cs.ohiou.edu/software/tcptrace/tcptrace.html • Look at low rate events directly with tcpdump CHEP2000 - Padova, February 2000
Analysis Tools (2) • Sample tcptrace output for a connection (NFS) NFS port on server TCP connection 4: host g: BBR-SRV03.SLAC.Stanford.EDU:32769 host h: BBR-SRV02.SLAC.Stanford.EDU:2049 complete conn: yes first packet: Fri Jan 28 23:24:35.019938 2000 last packet: Fri Jan 28 23:24:35.027876 2000 elapsed time: 0:00:00.007938 total packets: 11 filename: srv02srv03.dump g->h: h->g: total packets: 6 total packets: 5 ack pkts sent: 5 ack pkts sent: 5 pure acks sent: 3 pure acks sent: 2 unique bytes sent: 44 unique bytes sent: 28 actual data pkts: 1 actual data pkts: 1 actual data bytes: 44 actual data bytes: 28 data xmit time: 0.000 secs data xmit time: 0.000 secs idletime max: 4.4 ms idletime max: 4.1 ms throughput: 5543 Bps throughput: 3527 Bps Not much happened! Much more info available, edited to fit ... CHEP2000 - Padova, February 2000
Analysis Tools (3) Throughput between two hosts Yellow dots: instantaneous rate, quantization due to time resolution of packet time (GBit!) Red line: Averaged rates CHEP2000 - Padova, February 2000
Analysis Tools (4) • The network dump can e.g. answer the following questions (and many more): • Who (UID,GID) has read the 25 Gbyte data file over NFS? • Were NFS timeouts correlated to a high NFS transaction volume/rate? • Which hosts were accessing the file server? • Do we have hosts/software with configuration problems? (Wrong subnet masks, applications using incorrect subnet broadcast addresses) • However, the analysis of the files is complicated, we´d like to have better tools! CHEP2000 - Padova, February 2000
Real-Time Analysis of Traffic • A very interesting and promising free tool is NTOP (www.ntop.org) • Captures packets, analyzes the protocol headers in real-time and dynamically generates web pages, e.g.: • Protocols and their distribution • Hosts, host info, data sources and destinations • Throughput graphs • Traffic matrix • Still in development, not perfectly stable yet CHEP2000 - Padova, February 2000
Real-Time Monitoring NTOP example CHEP2000 - Padova, February 2000
Conclusions and Outlook • Network traffic recording and analysis • is feasible (with some restrictions) even in high performance switched network environments • looking forward to the next generation of gigabit-speeds-monitoring-capable switches and workstations • has shown to be very helpful in understanding host and network performance problems and computing infrastructure troubleshooting • Powerful free software tools are available: • but multiple programs, command line based, make analysis of network traffic log files quite a complicated procedure • The ultimate tool would be a PAW-like program for networks which allows filtering and plotting with a simple command language CHEP2000 - Padova, February 2000