1 / 13

Fabric monitoring for LCG-1 in the CERN Computer Center

Fabric monitoring for LCG-1 in the CERN Computer Center. Jan van Eldik CERN-IT/FIO/SM 7 th GridPP Collaboration meeting July 1, 2003. Outline. Fabric monitoring developments at CERN Architectural overview Deployment: status & plans for LCG-1 Outlook. Fabric Monitoring at CERN.

anja
Download Presentation

Fabric monitoring for LCG-1 in the CERN Computer Center

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fabric monitoring for LCG-1in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7th GridPP Collaboration meeting July 1, 2003

  2. Outline • Fabric monitoring developments at CERN • Architectural overview • Deployment: status & plans for LCG-1 • Outlook

  3. Fabric Monitoring at CERN • Improved fabric management is key part of LCG programme • EDG WP4 develops tools for automated installation, configuration, fabric monitoring, fault tolerance • IT/FIO Supervision & Monitoring section: develop and deploy a monitoring solution for LHC-era • A lot of expertise: EDG WP4 monitoring developments,PVSS Scada studies, SNMP studies, operator alarm displays, … • Architecture based on functional requirements gatheredby PEM project • Important objective: fabric monitoring for LCG-1 at Cern

  4. Monitored nodes Measurement Repository Database Monitoring Sensor Agent Sensor Sensor Sensor Cache Consumer Local Consumer Consumer Consumer Global Consumer Requirements and architecture • Both for performance and exception monitoring • Local and global consumers • Scalable, extensible, robust

  5. Repository API • SOAP RPC • Query history data • Subscription to new data EDG WP4 implementation • Monitoring Sensor Agent • Calls plug-in sensors to sample configured metrics • Stores all collected data in a local disk buffer • Sends the collected data to the global repository • Transport • Transport is pluggable. • Two protocols over UDP and TCP are currently supported where only the latter can guarantee the delivery • Measurement Repository • The data is stored in a database • A memory cache guarantees fast access to most recent data, which is normally what is used for fault tolerance correlations • Plug-in sensors • Programs/scripts that implements a simple sensor-agent ASCII text protocol • A C++ interface class is provided on top of the text protocol to facilitate implementation of new sensors Monitored nodes Measurement Repository (MR) Database Monitoring Sensor Agent (MSA) Sensor Sensor • Database • Proprietary flat-file database • Oracle • Open source interface to be developed Sensor Cache Consumer • The local cache • Assures data is collected also when node cannot connect to network • Allows for node autonomy for local repairs Local Consumer Consumer Consumer Global Consumer

  6. Deployment status in Cern CC • MSA with sensors for performance and exception monitoring, measuring 100-150 quantities per box • Deployed on ~1500 RedHat Linux nodes • 30 clusters, with specific configuration files

  7. Status of exception monitoring • ~50 possible alarms per monitored nodeHighLoad, DaemonDead, FileSysFull, install / config problems • Operator alarm displays • PVSS-based, developed as part of PVSS-tests • WP4 alarm display under active development

  8. PVSS operator alarm display

  9. WP4 operator alarm display

  10. Performance monitoring • WP4 Measurement Repository with Oracle backendis currently being deployed in the CERN CC for LCG-1 • Data access • C-API to the repository is available,Perl and Java implementations to be done • Simple CLI is being delivered • GUI is being delivered

  11. Anamon

  12. Open issues • Current solution is still very node-centric • Not much experience with consumers • No correlations engines, no corrective actions yet… • Integration with configuration system to be done

  13. Summary and Outlook • Fabric monitoring infrastructure for LCG-1 at Cernis being deployed • Monitoring Sensor Agent has been operating very well • Measurement Repository will now be challenged • Consumers can start consuming… • An interesting 6 months period await us!

More Related