100 likes | 424 Views
Lemon. Computer Monitoring at CERN Miroslav Siket CERN-IT/FIO-FS. Outline. Lemon – what it is? Structure Functionality Metrics Alarms Web visualization. Lemon – LHC Era Monitoring.
E N D
Lemon Computer Monitoring at CERN Miroslav Siket CERN-IT/FIO-FS
Outline • Lemon – what it is? • Structure • Functionality • Metrics • Alarms • Web visualization Sysadmin Introduction at CERN
Lemon – LHC Era Monitoring • Lemon is a software package containing tools for monitoring status and performance of the computers (currently limited to Linux and Solaris OS) • Contains following components: • Sensors (they measure individual metrics [values]) • MSA (Monitoring Sensor Agent) • Monitoring Repository (a daemon that receives the metrics) • Monitoring Repository Backend (storage) • LRF (Lemon RRD tool framework – caching and web presentation tools) • Correlation Engines • Lemon Client (tool for retrieving data) • LAG (Laser Alarm Gateway – tool for passing alarms to Laser system) • See http://cern.ch/lemon for more info Sysadmin Introduction at CERN
Repository backend SQL RRDTool / PHP Correlation Engines SOAP SOAP apache TCP/UDP HTTP Monitoring Repository Monitoring Agent Nodes Lemon CLI Web browser Sensor Sensor Sensor User Lemon - schema Sysadmin Introduction at CERN
Sensor (MS) and Sensor Agent (MSA) • Sensor measures the data based on the requests from MSA • MSA receives the data from sensor through the pipe • MSA sends the data to the Monitoring Repository (MR) through the UDP socket • Typical communication between the two: • MSA forks sensor system • MSA: INI 1 LoadAvg • MSA: GET 1 • Sensor: PUT 1 0.42 • MSA: sends UDP packet to MR • MSA controls the frequency and status of individual sensors (several of them) • You can write sensors yourself (bash, c++, perl,…) Sysadmin Introduction at CERN
Metrics • Measured metrics(about 255): • Status: OS, disk DMA, RPM ok?, ethlink,… • Daemons: sshd, ntpd, syslogd, friod,… alive • File size of files: /etc/nologin, /afs/cern.ch,… • Security: sshd md5chksum,… • Performace: CPU utilization, memory utilization, network bandwidth use,… • Misc: virtual organization number of jobs, smart status, temperature,… (see the list at http://cern.ch/lemon-status/metric_descriptions.php) • Status of the MSA can be seen in the /var/log/edg-fmon-agent.log file on each machine (log file to edg-fmon-agent daemon) Sysadmin Introduction at CERN
Lemon at CERN • Lemon monitors about 2100 computer within 100 clusters • On average it collects about 70 metrics from each host • Part of the ELFms • Integrated with Sure alarm system • Collecting about 1GB/day • Integrated with CDB Node Configuration Management Node Management Sysadmin Introduction at CERN
Sure system • Sure sensor checks values of the individual metrics with reference values and rises an alarms when the conditions are met • Examples: • Loadavg > 20 – raises Load_high alarm • # of sshd daemons < 1 – raises sshd_dead alarm • # of Smart failure in /var/log/messages > 0 – raises smart_failure alarm • Alarms are sent to the Sure servers • Operators acknowledge alarms, log them and if unable to resolve, notify responsible person • Sysadmins receive ITCM tickets – for each alarms there are procedures how to handle them • Special case – NO_CONTACT alarm Sysadmin Introduction at CERN
Web visualization and framework • LRF pre-process part of the data from Monitoring Repoistory and stores them into the RRD files for fast visualization • Groups the logical units (nodes) into clusters based on: • CDB [configuration database] definition • user defined clusters • HW type • Racks • Php based web interface displays preprocessed data on demand and gives together with CDB and status information general overview • Check it at http://cern.ch/lemon-status Sysadmin Introduction at CERN
Summary • Lemon serves to provide monitoring information about the computers in the Computer Center at CERN • Thanks to its integration with Sure (alarm system) it allows fast and easy identification and repair of problems • In connection to CDB it allows easier overview of services and visualization of their performance • In connection to Remedy (ITCM) allows overview of the problems for the given service Sysadmin Introduction at CERN