170 likes | 491 Views
Lemon. Computer Monitoring at CERN Miroslav Siket, German Cancio, David Front, Maciej Stepniewski Presented by Harry Renshall CERN-IT/FIO-FS. Outline. Lemon Structure Deployment at CERN Use cases Alarms Web visualization Summary. Lemon – LHC Era Monitoring.
E N D
Lemon Computer Monitoring at CERN Miroslav Siket, German Cancio, David Front, Maciej Stepniewski Presented by Harry Renshall CERN-IT/FIO-FS
Outline • Lemon • Structure • Deployment at CERN • Use cases • Alarms • Web visualization • Summary Hepix 9-13/05/2005 Karlsruhe
Lemon – LHC Era Monitoring • Lemon is a software package containing tools for monitoring status and performance of computers: • Distributed monitoring system scalable to ~10k nodes • Provides active monitoring of software and hardware in the Computer Center on centrally managed clusters • Facilitates early error detection and problem prevention • Provides persistent storage of the monitoring data • Executes corrective actions and send notifications • Offers a framework for further creation of sensors for monitoring • Most of the functionality is site independent • It is used at CERN by: • System administrators, service managers, cluster responsibles • Developers and service/data challenges • Managers and general users • Link: http://cern.ch/lemon Hepix 9-13/05/2005 Karlsruhe
Repository backend SQL RRDTool / PHP Correlation Engines SOAP SOAP apache TCP/UDP HTTP Monitoring Repository Monitoring Agent Nodes Lemon CLI Web browser Sensor Sensor Sensor User Lemon - schema Hepix 9-13/05/2005 Karlsruhe
Components • MSA – Monitoring Sensor Agent • Spawns multiple Monitoring Sensors (MS) to measure data in defined intervals and sends data to Monitoring Repository • MS - Monitoring Sensor • Uses standard C++, perl API – it is easy to write your own sensor • Several sensors exist for performance, process, hw and sw monitoring, grid VO’s job reporting, database monitoring, security, alarms (total 260 metrics) • MR – Monitoring Repository • Stores data in an Oracle (the full history) – backed up to tape in Castor • Flat file version available as well (with most functionality preserved) • We run two of them on two independent machines with two databases with failover (aiming for High Availability with Oracle Real Application Cluster) • LRF - Lemon RRD Framework • is used to cache the data in easily accessible way (rrd files) for web graphics • In connection with Quattor Configuration DB provides service and cluster overview • RRD stands for Round Robin Database (time aging data with predefined binning) – developed by Tobias Oetiker in ETH, Zurich (http://www.rrdtool.org) • LAG – Lemon Alarm Gateway • Generic gateway for alarms Hepix 9-13/05/2005 Karlsruhe
Lemon at CERN • Lemon monitors about 2200 computers in ~100 clusters • On average it collects about 70 metrics from each host • Part of the ELFms tools • Integrated with Sure alarm system • Collecting about 1.5 GB/day • Integrated with CDB for configuration • Leaf (LHC-Era Automated Fabric) for scheduling of interventions Node Configuration Management Node Management • Configuration • Derived from Configuration Database (CDB) • individual configuration per cluster/host • hierarchical structure • monitoring state is derived from CDB • Leaf tools allow scheduled downtimes, interventions, on demand changes • Alarm system • Sure – legacy system receiving alarms from Lemon • Integration with new LASER system (LHC alarm system) is ongoing Hepix 9-13/05/2005 Karlsruhe
Computer Center Overview • Entry page displays status overview of the key services • Allows choosing the individual cluster, rack, host or other categories Hepix 9-13/05/2005 Karlsruhe
Reboot occurrence history graph Use(ful) cases (I) • Kernel upgrade • Kernel version is “measured” on the boot of the machine • Automatic tools for upgrading the kernel on a cluster retrieve information from Lemon and schedule reboot of a machine based on this info • Web interface allows monitoring of the progress Hepix 9-13/05/2005 Karlsruhe
Use(ful) case (II) • Searching for a host • High load, network usage,… • Metric distributions allow identification of hosts with problematic performance Hepix 9-13/05/2005 Karlsruhe
Integration of Web interface • Web interface has been through various plug-ins adopted to accommodate additional information/links to help management of the computer center • Examples: • Configuration database browser (browses external XML config files) • ITCM (Remedy) ticket – external error tracking database • CC tracker (synoptic view of the computer center) – XML defined geometry • Alarm display • Metric information display • Raw data grapher (JPgraph) • External functionalities are customizable Hepix 9-13/05/2005 Karlsruhe
Computer Center display • Lemon Web Interface is interfaced with Computer Center database of objects • Provides search of objects as well as listing • Interfaced through the XML defined geometry of the computer center • Generic design Hepix 9-13/05/2005 Karlsruhe
Automatic recovery actions • Alarm Sensor • For defined values of measured metrics an actuator is called with predefined action • An example: ssh daemon dead – action /sbin/service sshd start • Definition: metric X, field Y != reference value Z => call actuator • If success log only • Else call action up to max times • Each occurrence is logged in the Monitoring Repository • Already about 70 predefined alarms with automatic recovery actions • After first month of deployment it reduced number of problem tickets by half • Correlation engine • Allows wide definition of alarms and recovery actions (in development) Hepix 9-13/05/2005 Karlsruhe
ITCM (Remedy) tickets occurrence Remedy Ticket tracking • Error trending metric with values on number of interventions/occurrences of problems • Several categories created by: • Hardware • Software • Clustered by contract type/cluster • Reporting problems whether scheduled or not and whether system was rebooted • Allows tracking of interventions per type of problem • Web interface to show the trend Hepix 9-13/05/2005 Karlsruhe
Database (Oracle) Monitoring • In cooperation with ADC group at CERN we have developed a sensor for measuring performance entities in Oracle Database: • Number of logons, cursors, logical and physical I/O, user commits, index usage, parse statistics, … • Allows identification of bottlenecks and gives overview of the stability of the system • Works on both 9i and 10g version of the Oracle • Integration into services/RAC • Configuration of service integrated with Oracle Enterprise Repository Hepix 9-13/05/2005 Karlsruhe
Service challenges, GRID VOs • Lemon allows • Virtual clusters • clusters defined on request by service managers • Or defined by scripts – updated dynamically on demand • Or Defined for specific purpose • An example: Atlas DC04 challenge, Network challenges,… • Clusters defined dynamically • An example: hosts running GRID jobs on the batch cluster belonging to the given Virtual Organization • Provides hooks in Lemon for defining any dynamic grouping of hosts Hepix 9-13/05/2005 Karlsruhe
Summary • Lemon serves to provide monitoring information about the computers in the Computer Center at CERN • Thanks to its integration with Sure (alarm system) it allows fast and easy identification and repair of problems. We will convert to a new accelerator alarm system this year (LASER). Lemon provides LAG (Lemon Alarm Gateway) to feed alarms into arbitrary alarm systems. • In connection to CDB it allows easier overview of services and visualisation of their performance • In connection to Remedy (ITCM – problem tracking) allows an overview of the problems for the given service • It has been a useful tool for general monitoring of performance and also for system administrators in debugging problems • Lemon is also used and developed elsewhere – BARC institute in India, Accelerator department at CERN, CMS is adopting it for its online farm monitoring,… • Lemon is used for GridIce and can provide data to MonAlisa Hepix 9-13/05/2005 Karlsruhe