230 likes | 390 Views
Advanced Monitoring Techniques for the ATLAS TDAQ Network. Matei Ciobotaru CERN University of California, Irvine “Politehnica” University of Bucharest on behalf of the ATLAS Networking Group: B. Martin, A. Al-Shabibi, S. Batraneanu, S. Stancu, L. Leahu, L. Darlea, M. Ivanovici.
E N D
Advanced Monitoring Techniques for the ATLAS TDAQ Network Matei Ciobotaru CERN University of California, Irvine “Politehnica” University of Bucharest on behalf of the ATLAS Networking Group: B. Martin, A. Al-Shabibi, S. Batraneanu, S. Stancu, L. Leahu, L. Darlea, M. Ivanovici
The ATLAS TDAQ Network – Role • The ATLAS Trigger and Data Acquisition Network (TDAQ) handles the data transfers from the ATLAS detector to the analysis and storage nodes • Built with Gigabit Ethernet switches and routers • Sustained rates of 150 Gbit/s • The experiment relies on the network to function 24/7 with a minimal number of failures ATLAS detector TDAQ system
The ATLAS TDAQ Network – Photos • Almost 3000 devices and 5000 network connections… • How to make sure everything is working correctly? 2500 computers installed in 90 racks 2 concentrator switches per rack 5 “big” chassis-based devices at the core
Inside this talk • Requirements in terms in network management • Commercial software we are using • Tools we developed in-house • Services for users, integration with ATLAS • Plans for the future • The big picture
ATLAS Requirements • Installation • Ease the equipment registration, inventory and verification • Configure the devices • Operation • Check the state of health of devices and links • Monitor traffic conditions, raise alarms when needed • Assist the user in navigating the realm of information • Integration with the ATLAS TDAQ software • Diagnostics • Provide aids to the admin in case something goes wrong • Be able to suggest solutions to problems • Manage a large local area network which has to be very reliable and which has very high throughput requirements Complexity
Equipment registration • ATLAS equipment needs to be registered in four databases • Only some databases support batch registrations, others require manual intervention may lead to inconsistencies • Developed a web application to cope with this situation • Central place for querying all the information about a device • Ability to cross-check the data across all databases detect incomplete/incorrect registrations
Equipment inventory • Network diagrams for ATLAS are made in Microsoft Visio using the NetDesign package • We created tools which discover what really exists in the network (what is connected where) Visio Network Discovery • Developed an application which compares the two data sources (Visio and Auto-discovery) mismatches are detected and corrected in the field if necessary • For the network documentation – we also generate automatically a printable “report” with all the connectivity
Network configuration (1) • In ATLAS we have more than 200 switches • Different vendors • Different mechanisms for configuration and monitoring (telnet, SNMP, web) • Q: How to access all devices in a transparent manner? • A: Bring them all under a common denominator (common interface) • Q: How to automatize network management tasks? • A: Write scripts (little programs) switches + scripting = sw_script http://cern.ch/ciobota/projects/sw_script/ • sw_script = Set of Python modules which can be used as building blocks for network management solutions • Common programming interface to all devices (object-oriented) • “Intelligent” tools for configuration and monitoring can be developed
Interactive session with sw_script # Start the Python interpreter $ python2.5 # Load the sw_script module >>> import sw_script # Create an object associated with the switch (a Cisco device in this case) >>> sw = sw_script.Cisco_Catalyst_6500_Switch(ip_address = “192.168.100.59"); # List the ports available on this device >>> sw.get_port_names(); [’1/1’, ’1/2’, ’1/3’, ’1/4’, .... # Get all the information available for an interface >>> sw.get(“1/4"); [(’rx_packets’, 519.0), (’rx_bytes’, 127937.0), (’rx_discards’, 0.0), (’rx_errors’, 0.0), (’tx_packets’, 11199.0),(’tx_bytes’, 1111661.0), (’tx_discards’, 0.0), (’tx_errors’, 0.0), (’description’, ’GigabitEthernet1/4’), (’link_state’, ’up’), (’mac_addr’, [’00:90:27:8F:94:E3’])] # Set the description (ifAlias) of an interface >>> sw.set_interface_alias(“1/4”, “Uplink to Core Router”) # Show the serial number of this device >>> print sw.get_serial_number() FOC0913U075 sw_script is responsible for more than a half of our network management toolbox • Features • Supports devices from different vendors • Network topology auto-discovery • Can do traffic monitoring in real-time • Works as a module, can be easily embedded into other apps
Network configuration (2) • In ATLAS, we have programs which use sw_script to perform configuration changes on devices: • defining VLANs • enabling protocols: spanning tree, time synchronization, etc. • setting interface aliases (descriptions) • We use Python scripts to perform unattended firmware upgrades • For keeping track of configuration files we plan to use ZipTie (open-source software)
Basic monitoring • Spectrum from Computer Associates software for device health and traffic monitoring (used by the CERN IT department) • Monitors devices, raises alarms in case of failures • Auto-discovery for almost all network connections • Historical info – Gathers statistics from all devices • Throughput and error rates saved every 30 seconds • Limitations • The Spectrum GUIis hard to use • It is not easy to integrate with 3rd party apps • Limited support for network performance monitoring • Basic support for querying historical traffic data • No support for device configuration • Virtually no features for diagnostics Spectrum GUI • We developed software to fill-in the gaps
Navigating in the realm of monitoring data • Spectrum produces 3 plots for each network interface. We shall have 5000 ports and 15000 plots to look at… • We developed tools to browse, query and analyze the traffic plots.
Integration with ATLAS software • Network Panel • Shows network monitoring information relevant to an ATLAS data acquisition run • Alarm Watcher • Forwards alarms from Spectrum into the ATLAS “official” messaging channels • IS Feeder • Publish network statistics to the Information Services, a monitoring sub-system in ATLAS The network Panel
Network visualization – 2D approach • Application which shows a topological map of the network • Colors the connections in real-time in function of their state and usage • The overloaded links are detected easily • Good navigation features (zoom, pan) • Based on GUESS, a Java application for visualizing graphs • http://graphexploration.cond.org/ • We developed a network monitoring plug-in for GUESS
Network visualization – 3D approach (1) • Each object contains a panel with traffic information (updated in real-time) • Containers (racks, rooms) show aggregate values • Technologies used: X3D, Java and the Octaga Player • 3D model of the network • Racks, switches and computers Furniture in the 3D space • Navigation similar to Google Earth
Real-time traffic monitoring Real-time global top (most active connections) Connections for one switch (with traffic values) The ATLAS applications running now in the network
Diagnostics • For immediate response, we look in Spectrum and in the sw_script web pages • Human inspection of traffic plots (aggregates) – we search for abnormal patterns and correlations between plots • We have a collection of scripts to test different things • Checking that machines are configured properly and connections are ok • For bandwidth-related issues we use iperf • All the network operations are documented in a knowledge base (wiki)
Plans for the future • Better visualization techniques for traffic plots • Analysis tools for monitoring data. Pattern detection and recognition (periodic events, monotonic variations, etc.) • Add support for sFlow, the standard for statistical sampling – very useful to diagnose network congestion • Design and implement an expert system which will help us troubleshoot network issues
The big picture Browse, search and aggregate 2D and 3D network visualization Dynamic web-pages Historical traffic data Real-time traffic info Spectrum sw_script & co. Device health monitoring ATLAS software – network status and alarms Equipment auto-discovery, inventory and registration Equipment configuration Commercial package In-house development