150 likes | 330 Views
Lemon Monitoring. Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, 24-26 May 2005. Outline. Lemon Structure and design How it works, deployment Use cases, web interface Installation and setup Summary.
E N D
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, 24-26 May 2005
Outline • Lemon • Structure and design • How it works, deployment • Use cases, web interface • Installation and setup • Summary LCG Operations Workshop 24-26/05/2005 Bologna
Lemon – LHC Era Monitoring • Lemon is a system containing tools for monitoring status and performance of computers: • Distributed monitoring system scalable to ~10k nodes • Provides active monitoring of software and hardware in the Computer Center on centrally managed clusters • Facilitates early error detection and problem prevention • Executes corrective actions and sends notifications • Provides persistent storage of the monitoring data • Offers a framework for further creation of sensors for monitoring • Site independent functionality • Link: http://cern.ch/lemon • Part of the ELFms toolsuite:http://cern.ch/elfms LCG Operations Workshop 24-26/05/2005 Bologna
Lemon Use • It is used in-and-outside CERN by: • System administrators, service managers, cluster responsibles • Developers and service/data challenges • Managers and general users • Deploymentsoutside CERN: • EDG testbeds • Accelerator (AB) department at CERN • CMS online • GridICE • BARC India (development partner) LCG Operations Workshop 24-26/05/2005 Bologna
Repository backend Prot RRDTool / PHP Correlation Engines SOAP SOAP apache TCP/UDP HTTP Monitoring Repository Monitoring Agent Nodes Lemon CLI Web browser Sensor Sensor Sensor User Lemon architecture LCG Operations Workshop 24-26/05/2005 Bologna
Components • Lemon is a typical server/client application with following components: • MSA – Monitoring Sensor Agent (Lemon Agent) • Daemon on a client machine that spawns multiple Monitoring Sensors to measure data in defined intervals and sends data to Monitoring Repository • MS - Monitoring Sensor • Uses standard C++, perl API – it is easy to write your own sensor • Several sensors exist for performance, process, hw and sw monitoring, grid VO’s job reporting, database monitoring, security, alarms (total 260 metrics) • MR – Monitoring Repository • Server application that receives samples and processes/validates them • Stores the full monitoring history data • Two implementations - flat files or Oracle DB based • LRF - Lemon RRD Framework • Pre-processes data into rrd files and creates cluster summaries • These are used for web graphics • Provides service and cluster overview in its web displays • LAG – Lemon Alarm Gateway • Generic gateway for alarms (in development) • Gateways to MonALISA and GridICE exist LCG Operations Workshop 24-26/05/2005 Bologna
Lemon at CERN • Lemon monitors about 2200 computers in ~100 clusters • On average it collects about 70 metrics from each host • Integrated with Sure alarm system • Collecting about 1.5 GB/day • LEAF (LHC-Era Automated Fabric) for high-level intervention scheduling Node Configuration Management Node Management • Configuration • Derived from the Quattor Configuration Database (CDB) • individual configuration per cluster/host • hierarchical structure • Alarm system • Sure – legacy system receiving alarms from Lemon • Integration with new LASER system (LHC alarm system) via LAG is ongoing LCG Operations Workshop 24-26/05/2005 Bologna
Web interface • Cluster view displays accumulated statistics and status for all machines in the cluster • Host view gives overview of the host status with basic metrics • Other views available: • Rack view • Hardware type view • Other views can be added, working on user defined views • With the newest version (to be released soon): • Generic entry page displaying status overview of the key services • Configurable views • In development: database services monitoring with database specific view LCG Operations Workshop 24-26/05/2005 Bologna
Use(ful) case Reboot occurrence history graph • Kernel upgrade • Kernel version is “measured” on the boot of the machine • Automatic tools for upgrading the kernel on a cluster retrieve information from Lemon and schedule reboot of a machine based on this info • Web interface allows monitoring of the progress LCG Operations Workshop 24-26/05/2005 Bologna
Computer Center display • Lemon Web Interface can be interfaced with a Computer Center database of objects (racks, silos, …) • Provides search of objects as well as listing • Interfaced through a XML defined geometry of the computer center • Generic design that can be used anywhere: • <?xml version="1.0" ?> • <CC> • <ROOM ID=“0513-S-0034" DESCRIPTION=“Tape Vault" R="0" G="0" B="0"> • <DOORS R="0" G="255" B="0"> • <DOOR X="63" Y="39" LX="64" LY="39" /> • <DOOR X="34" Y="0" LX="36" LY="0" /> • </DOORS> • <RACKS R="0" G="0" B="203"> • <RACK ID="EA01" X="73" Y="9" LX="75" LY="10" PLANNED="0"/> • <RACK ID="EA03" X="73" Y="8" LX="75" LY="9" PLANNED="0"/> • </RACKS> • <WALLS R="0" G="0" B="0"> • <WALL X="0" Y="0" LX="0" LY="60" /> • <WALL X="0" Y="0" LX="76" LY="0" /> • </WALLS> • <STEPS R="255" G="163" B="0"> • <STEP X="47" Y="36" LX="52" LY="37" /> • <STEP X="47" Y="37" LX="52" LY="38" /> • </STEPS> • </ROOM> • </CC> LCG Operations Workshop 24-26/05/2005 Bologna
Service challenges, GRID VOs • Lemon allows for • Virtual clusters • clusters defined on request by service managers • or defined by scripts – updated dynamically on demand • or defined for specific purpose • Examples: Alice MDC, network challenges,… • Clusters defined dynamically • example: hosts running GRID jobs on the batch cluster belonging to the given Virtual Organization • hooks in Lemon for defining any dynamic grouping of hosts LCG Operations Workshop 24-26/05/2005 Bologna
Automatic recovery actions and Alarms • Alarm Sensor • For defined values of measured metrics an actuator is called with predefined action • An example: ssh daemon dead – action /sbin/service sshd start • Definition: metric X, field Y <op> reference value Z => call actuator • <op> can be ==,<,>,regexp, range, etc.. • If success log only, else call action up to max times • Each occurrence is logged in the Monitoring Repository • Already about 70 predefined alarms with automatic recovery actions • After first month of deployment it reduced number of problem tickets by half • Correlation engine (CMDaemon) • Allows ‘global’ correlations, and in the future client/server alarms and recovery actions • Lemon Alarm gateway (LAG) • Lemon’s LAG can be used to feed alarms into arbitrary alarm systems (under development) LCG Operations Workshop 24-26/05/2005 Bologna
Installation and setup (I) Lemon installation consists of three steps: • Server installation • Client installation • Web interface installation 1. Server installation: • install edg-fabricMonitoring-server rpm (“flat file” server) • Configure receiving port in /etc/edg-fmon-server.conf • Start the server daemon 2. Client installation: • Install edg-fabricMonitoring-agent rpm (comes with default metric configuration) • Configure server and its port in /etc/edg-fmon-agent.conf • Start the client daemon on all monitored hosts LCG Operations Workshop 24-26/05/2005 Bologna
Installation and setup (II) 3. Web interface installation • Install and start apache server (with php) on your server • Install rrdtool and lrf (lemon rrd framework) rpms • Configure your clusters in clusters.conf file and start lemonmrd daemon • Drink Champagne… you have Lemon up and running! ;-) • You can do all this on your laptop! • Possible additional components: • Computer center synoptic view through xml file • Problem tracking system integration (through php plug-in to your DB/application) • Quattor CDB configuration view – through CDB xml profiles • Oracle based Repository (for very large installations with high scalability and increased functionality) • Other, new components are easy to add • View detailed instructions at: http://cern.ch/lemon/doc/installation/installation.html LCG Operations Workshop 24-26/05/2005 Bologna
Summary • Lemonserves to provide monitoring information about the farms in Computer Centers (or your laptop). • Lemon provides framework for recovery actions and alarms. • Lemon is easy to install (…and it is easy to add your own metrics and visualize them). • It is flexible with respect to your needs – you can add clusters, views, specify your definition of virtual and dynamic clusters. • It has been a useful tool for general monitoring of performance and also for system administrators in debugging problems. • For more information check http://cern.ch/lemon LCG Operations Workshop 24-26/05/2005 Bologna