160 likes | 318 Views
Monitoring and Fault Tolerance. Helge Meinhard / CERN-IT OpenLab workshop 08 July 2003. Fault Mgmt System. Monitoring System. Node. Configuration System. Installation System. Monitoring and Fault Tolerance: Context. History (1).
E N D
Monitoring and Fault Tolerance Helge Meinhard / CERN-IT OpenLab workshop 08 July 2003
Fault Mgmt System Monitoring System Node Configuration System Installation System Monitoring and Fault Tolerance: Context
History (1) • In the 1990s, “massive” deployments of Unix boxes required automated monitoring of system state • Answer: SURE • Pure exception/alarm system • No archiving of values, hence not useful for performance monitoring • Not scalable to O(1000) nodes
History (2) • PEM project at CERN (1999/2000) took fresh look at fabric mgmt, in particular monitoring • PEM tool survey: Commercial tools found not flexible enough and too expensive; free solutions not appropriate • Architecture, design and implementation from scratch
History (3) • 2001 - 2003: European DataGrid project with work package on Fabric Management • Subtasks: configuration, installation, monitoring, fault tolerance, resource management, gridification • Profited from PEM work, developed ideas further
History (4) • In 2001, some doubts about ‘do-it-all-ourselves’ approach of EDG WP4 • Parallel to EDG WP4, project launched to investigate whether commercial SCADA system could be used • Architecture deliberately kept similar to WP4
Monitoring and FT architecture (1) • Monitoring: Captures non-intrusively actual state of a system (supposed not to change its state) • Fault Tolerance: Reads and correlates data from monitoring system, triggers corrective actions (state-changing)
Localcache DB Monitoring and FT architecture (2) Sensor MonitoringSensorAgent (MSA) MR – Monitoring Repository WP4: MR code with lower layer as flat file archive, or using Oracle CCS: PVSS system Sensor Sensor Localconsumers Localconsumers API Localconsumers API
Monitoring and FT architecture (3) • MSA controls communication with Monitoring Repository, configures sensors, requests samples, listens to sensors • Sensors send metrics on request or spontaneously to MSA • Communication MSA – MR: UDP or TCP based
Monitoring and FT architecture (4) • FT system subscribing to metrics from monitoring subsystem • Rule-based correlation engine takes decisions on firing actuators • Actuators controlled by Actuator Agent, all actions logged by monitoring system
Deployment (1) • End 2001: Put early versions of MSA and sensors on big clusters (~800 Linux machines), sending data (~100 metrics per machine, 1/min…1/day) to a PVSS-based repository • At the same time, ~300 machines started sending performance metrics into flat file WP4 repository
Deployment (2) • Sensors more refined over time (metrics added according to operational needs) • Both exception and performance oriented sensors now deployed in parallel (some 150 metrics per node) • More special machines added, currently ~1500 machines being monitored • Test in May 2003: some 500 metric changes per second into the repository (~150 changes/s after “smoothing”)
Deployment (3) • Repository requirements: • Repository API implementation • Oracle based • fully functional alarm display for operators • Currently using both an Oracle-MR based repository, and a PVSS based one • Operators using PVSS based alarm screen as alternative to Sure display
Deployment (4) • Interfaces: C API available, simple command line interface by end July, prototype Web access to time series of a metric available • Fault tolerance: Just starting to look at WP4 prototype • Configuration of monitoring: ad-hoc, to be migrated to CDB
Outlook • Near term: Production services for LCG-1 • Add more machines (e.g. network), metrics • Software and service monitoring • Medium term (end 2003): Monitoring for Solaris and Windows, … • 2004 or 2005: Review of chosen solution for monitoring and FT • Some of 1999 arguments no longer valid • Will look at commercial and freeware solutions
Machine control • High level: interplay of State Management System, Configuration Management, Monitoring, Fault Tolerance, … • Low level: • Past: CPU boxes didn’t have anything (5 rolling tables with monitors and keyboards per 500…1000 machines), disk and tape servers with analog KVM switches • Future: Have investigated various options, benefit/cost analysis. Will go to serial consoles on all machines, 1 head node per 50…100 machines with serial multiplexers