130 likes | 281 Views
Fabric monitoring for LCG-1 in the CERN Computer Center. Jan van Eldik CERN-IT/FIO/SM 7 th GridPP Collaboration meeting July 1, 2003. Outline. Fabric monitoring developments at CERN Architectural overview Deployment: status & plans for LCG-1 Outlook. Fabric Monitoring at CERN.
E N D
Fabric monitoring for LCG-1in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7th GridPP Collaboration meeting July 1, 2003
Outline • Fabric monitoring developments at CERN • Architectural overview • Deployment: status & plans for LCG-1 • Outlook
Fabric Monitoring at CERN • Improved fabric management is key part of LCG programme • EDG WP4 develops tools for automated installation, configuration, fabric monitoring, fault tolerance • IT/FIO Supervision & Monitoring section: develop and deploy a monitoring solution for LHC-era • A lot of expertise: EDG WP4 monitoring developments,PVSS Scada studies, SNMP studies, operator alarm displays, … • Architecture based on functional requirements gatheredby PEM project • Important objective: fabric monitoring for LCG-1 at Cern
Monitored nodes Measurement Repository Database Monitoring Sensor Agent Sensor Sensor Sensor Cache Consumer Local Consumer Consumer Consumer Global Consumer Requirements and architecture • Both for performance and exception monitoring • Local and global consumers • Scalable, extensible, robust
Repository API • SOAP RPC • Query history data • Subscription to new data EDG WP4 implementation • Monitoring Sensor Agent • Calls plug-in sensors to sample configured metrics • Stores all collected data in a local disk buffer • Sends the collected data to the global repository • Transport • Transport is pluggable. • Two protocols over UDP and TCP are currently supported where only the latter can guarantee the delivery • Measurement Repository • The data is stored in a database • A memory cache guarantees fast access to most recent data, which is normally what is used for fault tolerance correlations • Plug-in sensors • Programs/scripts that implements a simple sensor-agent ASCII text protocol • A C++ interface class is provided on top of the text protocol to facilitate implementation of new sensors Monitored nodes Measurement Repository (MR) Database Monitoring Sensor Agent (MSA) Sensor Sensor • Database • Proprietary flat-file database • Oracle • Open source interface to be developed Sensor Cache Consumer • The local cache • Assures data is collected also when node cannot connect to network • Allows for node autonomy for local repairs Local Consumer Consumer Consumer Global Consumer
Deployment status in Cern CC • MSA with sensors for performance and exception monitoring, measuring 100-150 quantities per box • Deployed on ~1500 RedHat Linux nodes • 30 clusters, with specific configuration files
Status of exception monitoring • ~50 possible alarms per monitored nodeHighLoad, DaemonDead, FileSysFull, install / config problems • Operator alarm displays • PVSS-based, developed as part of PVSS-tests • WP4 alarm display under active development
Performance monitoring • WP4 Measurement Repository with Oracle backendis currently being deployed in the CERN CC for LCG-1 • Data access • C-API to the repository is available,Perl and Java implementations to be done • Simple CLI is being delivered • GUI is being delivered
Open issues • Current solution is still very node-centric • Not much experience with consumers • No correlations engines, no corrective actions yet… • Integration with configuration system to be done
Summary and Outlook • Fabric monitoring infrastructure for LCG-1 at Cernis being deployed • Monitoring Sensor Agent has been operating very well • Measurement Repository will now be challenged • Consumers can start consuming… • An interesting 6 months period await us!