550 likes | 824 Views
Lemon Tutorial. Lemon Overview Miroslav Siket, Dennis Waldron http://cern.ch/lemon CERN-IT/FIO-FD. Tutorial. Why? Number of services is expanding. More to monitor every day. For whom? Service managers to configure monitoring of their services
E N D
Lemon Tutorial Lemon Overview Miroslav Siket, Dennis Waldron http://cern.ch/lemon CERN-IT/FIO-FD
Tutorial • Why? • Number of services is expanding. More to monitor every day. • For whom? • Service managers to configure monitoring of their services • Developers to simplify their life when writing sensors • Site managers to setup their monitoring instances Lemon Tutorial
Tutorial Outline • Architecture • Writing sensors • Running and configuring Agent • Using lemon tools • Running Lemon server(s) • Running and configuring web interface • Running alarm system Lemon Tutorial
Architecture Lemon Tutorial
Architecture II • Three layers: • Data producing/consuming • Data manipulation • Data Storage Lemon Tutorial
Client side • Agent • forks sensors and communicate with them using custom protocol over a bi-directional “pipes” • configures metric instances of metric classes of a sensor and pulls for metrics • checks on status of sensors • agent sends data to servers using TCP or UDP • monitors itself with internal MSA sensor • caches data locally Default Linux client distribution comes with the agent, linux and file sensors. Footprint: agent - 5.5MB and 0.02% of CPU utilization* core sensors (Linux, file, exception) – 10MB, 0.2% of CPU* parseLog – 9.4MB Currently C++ and perl APIs available. * i386, SLC3/4, RHES3/4 – average over CERN CC Lemon Tutorial
Server side • Two implementations: • Oracle based – OraMon • optimized for high performance and for large Computer Centers • runs on Oracle 9i+ (with alarms system on 10g) • validation of metric samples, metadata information • Flat files based – FlatMon (edg-fmon-server) • uses OS files for storing data • for smaller sites (scalable to 1000 machines max.) • General features: • multithreaded UDP/TCP server • built in authentication mechanism Lemon Tutorial
Server side - planning Space considerations • About 400kB of data per machine/day (Oracle Enterprise edition with compression) – 700kB without compression (XE, Standard) • About 1.2MB for FlatMon per machine per day CPU considerations • Dual PIV, 3GHz, 4GB of memory with Oracle DB server + OraMon requires about 15% CPU for 4000 monitored machines • Adding Alarm system on Oracle requires additional 5% of CPU • FlatMon saturates the above machine with 1000 monitored hosts • OraMon/FlatMon require about 105MB of memory Functionality considerations • FlatMon does not provide metric checks and has no metadata concept • Lemon Alarm System (LAS) runs on Oracle as PL/SQL procedures and requires Oracle 10g – integrated with OraMon schema in Oracle database • For HA architecture, use Oracle RAC and multiple OraMon servers Lemon Tutorial
User/administration tools Lemon-cli • Retrieving monitoring data from the local machine cache • Allows retrieving data from the server • Currently uses SOAP interface (to be retired soon) Lemon-host-check • Checks status of the machine based on the values of exceptions • Checks status of the monitoring agent and sensors • Manages status of exceptions Lemon Tutorial
Configuration management At CERN we use Quattor Configuration Database • Configuration is stored in hierarchical templates per domain/cluster/node • NCM framework is used to download configuration XML profile to nodes • NCM components are used: • For agent/sensors configuration – using fmonagent component • For server configuration (metadata) – using oramonserver component For smaller sites with homogeneous structures • Use default agent and sensor rpms from Lemon • Use rpms for custom sensors/settings Lemon Tutorial
Lemon RRD framework • User front-end for visualization and caching monitoring data • Two layers • Pre-processing – consumes monitoring data and creates rrd files per machine/cluster/… (aging, averages) - lemonmrd • Visualization – using rrd files for fast visualization or direct access to the monitoring repository – status web pages • Different plugins/options available: • Synoptic display of the Computer Center (XML driven) • Lemon Alarm GUI • Quattor .tpl file browser, … Requirements • Web server with PHP (v5+ if want to use LAS) • rrdtool rpm • 500kB space per machine’s rrd file Lemon Tutorial
Automatic recovery actions and alarms • Sensor exception • For defined values of measured metrics an actuator is called with predefined action • An example: ssh daemon dead – action /sbin/service sshd start • Definition: metric X, field Y <op> reference value Z => call actuator • <op> can be ==,<,>,regexp, range, +,-,*,/ etc.. • Each occurrence is logged in the Monitoring Repository • Already about 230 predefined exceptions with automatic recovery actions • Exceptions are base for alarms in Lemon Alarm System • Allow multi-valued metrics and on-behalf metrics • Allow corrective actions (actuators) up to n-times or within given time window • Allow distinguishing of the alarm state (failed actuator, silenced,…) • Example: • (10004:7 > 100 && (10005:3 – 34:5)>100:56) • On behalf: (soap_srvx:302:1 > 10) Lemon Tutorial
Lemon Alarm System Newest addition to Lemon Build on top of the OraMon schema in Oracle database Comes in two pieces: • PL/SQL stored procedures (requires Oracle 10g) to consume exceptions and to produce alarms • GUI – web based interface based on AJAX – part of LRF Features • Reduction of alarms (by type or by node/cluster) • Possibility to hide/inhibit alarms • Access control • History tracking • Future: notifications, RSS feeds Lemon Tutorial
Software distribution RPM • direct download from http://lemon.web.cern.ch/lemon/downloads.shtml or at http://linuxsoft.cern.ch/lemon/ • YUM setup with/etc/yum.repos.d/lemon.repo [lemon] name=Lemon baseurl=http://linuxsoft.cern.ch/lemon/linux/RPMS/i386/sl4/stable/ enabled=1 gpgcheck=1 gpgkey=http://linuxsoft/lemon/RPM-GPG-KEY-lemon • APT setup with /etc/apt/sources.list.d/lemon.list # Lemon stable rpm http://linuxsoft.cern.ch/lemon linux/RPMS/i386/sl4 lemon_stable_sl4 Source code • CVS CVSROOT=:pserver:anonymous@isscvs.cern.ch:/local/reps/elfms Lemon Tutorial
Future and additional information Things not covered/under development • XML gateway with API to several languages (C++, perl, python, java,…) • Python Sensor API • LAS notification, RSS feeds • Encryption of data between agent and server • Authentication for user access • Service views for LRF Check Web pages: http://cern.ch/lemon for additional information Lemon Tutorial
Lemon Tutorial Sensor How-To Miroslav Siket, Dennis Waldron http://cern.ch/lemon CERN-IT/FIO-FD
Outline • Terminology • Examples of existing sensors • Considerations • Live Examples • Hello World • Service based monitoring • Do’s and Don’ts Lemon Tutorial
Terminology • Sensor: • A process or script which is connected to the lemon-agent via a bi-directional pipe and collects information on behalf of the agent. Sensors implement, • Metric Classes: • The equivalent to a class in OOP (Object Orientated Programming) • Metric Instance: • Is an instance (an object) of a metric class which has its own configuration data. • Metric ID: • A unique identifier associated with a particular metric instance of a particular metric class. Lemon Tutorial
Existing sensors • At CERN: • Approx 40 active sensors defined, providing 264 metrics and 227 exceptions. • Default installation of the Lemon agent comes with three sensors: • MSA (builtin) – self monitoring of the agent. • Linux – performance, file system and process monitoring. • File – file tests e.g. size, mtime, ctime. • Together they provide 135metrics (51% of all CERN metrics) • Other officially distributed sensors include: • exception– correlation sensor for generating alarms. • remote– provides ping and http web server checks. • oracle – oracle database statistics monitoring. • parselog – log file parsing sensor. • All available from the lemon software repository http://linuxsoft.cern.ch/lemon/ • Other contributing sensors are available from CVS: CVSROOT=:pserver:anonymous@isscvs.cern.ch:/local/reps/elfms/sensors Lemon Tutorial
Considerations • Question: What is your goal? How do you intend to use the monitoring information you collect? • Is it for: • Pure data collection? • OK • Graphs displayed on the lemon status pages? • Just because you’ve collected data doesn’t give you graphs immediately! This is not automatic! • Information to be alarmed? • Make sure the structure of the data you collect can be alarmed! • Data that cannot be alarmed: • Timestamps as strings - NO • Timestamps as numbers - NO • Parsing of complex strings - NO Lemon Tutorial
Considerations (II) - Use Case • Grid Certificate Expiry Use Case Outline: you wish to be notified or raise an alarm if the Grid Certificate on a machine will expiry in the next two weeks. • You need 1 metric and 1 exception • The metric will record the expiry time of the certificate. • The exception will check the metric and decide if it expires in the next two weeks. • The metric needs to be structured in such a way that the correlation unit of the exception sensor can understand it. • Can I record the data as a: • String e.g. “Sun Oct 8 16:05:47 2006” NO(Cannot be converted to a number) • UnixTime e.g. “1160316347” NO(Correlation unit doesn’t understand time, yet!!) • Solution: • Record the number of seconds until the certificate expires. • E.g 1814400 seconds (3 wks) can be mathematical alarmed :- If metric < 1209600 (2 wks) then raise alarm Lemon Tutorial
Considerations (III) • Misconception: • In Lemon that a metric has to be related to one and only one distinct piece of information (1 to 1 mapping) • Not true: • A metric can be associated with multiple values and have multi rows with each row identified by a unique key. Lemon Tutorial
Considerations (IV) – Use Case • Recording partition information Outline: you would like to know the total size, space used in megabytes, space used as a % and the mount options of all mounted partitions on a machine. • Under the idea of a 1 to 1 mapping, that’s 4 metrics per partition. An average machine may have 7 partitions (4x7 = 28 metrics in total). • Why not: • Convert the data into a multi-valued metric? • 7 metrics each reporting 4 values. So, • Metric 1 total_space • Metric 2 space_used_mb • Metric 3 space_used_perc • Metric 4 mount_options Becomes: • Metric A total_space space_used_mb space_used_perc mount_options • Go one step further: • Convert the data into a multi-valued, multi-rowed metric • 1 metric reporting the values for all mount points. So, • Metric A total_space space_used_mb space_used_perc mount_options Becomes: • Metric B mountname1 total_space space_used_mb space_used_perc mount_options • Metric B mountname2 total_space space_used_mb space_used_perc mount_options • …. • Benefits: • Monitoring of new mount points is dynamic, no need for reconfigurations, no need to going through a registration process to get new metric ids. Lemon Tutorial
Example 1 – Hello World Objective: To create a Perl sensor which records the value “Hello World” into Lemon. • Simple sensor to demonstrate: • The generic build framework for sensors. • How to registering your Perl module with the API. • How to register metric classes that your modules provides. • How to store the text “Hello World” for the machine under which the sensor runs into Lemon. • Running and debugging your sensor on the command line. • Functions used: • registerVersion() • registerMetric() • storeSample01() • Documented at: http://lemon.web.cern.ch/lemon/doc/howto/sensor_tutorial.shtml Lemon Tutorial
Example 2 – Service Monitoring Objective: To check if a webpage is available on a remote web server and record the HTTP response code under a service name. • Demonstrates: • The basics of on behalf reporting • The ability to parse configuration arguments • The ability to log messages • Functions used: • registerMetric() • getParam() • log() • storeSample03() Lemon Tutorial
Do’s and Don’ts • Don’t: • Call die() or exit() from inside your sensor. • Open or write to files in locations writeable by non-root users such as /tmp/ • Read from filehandles (e.g sockets) that may block. This will make your sensor unresponsive to requests from the agent. • Never rely on, or have dependencies on files on remote file systems such as AFS (Andrew File System). Your sensor should aim to have as few dependencies as possible • Do’s: • Document your sensor. Refer to the sensor tutorial to see how this can be done automatically for you. • If you have the ability to use a timeout around calls to databases and services like LSF, use it!! • Make your metric classes configurable, avoid hard coded paths to non standard files. • Try to make your sensors as generic as possible so that others can benefit from your work. Lemon Tutorial
Lemon Tutorial Sensor Exception Miroslav Siket, Dennis Waldron http://cern.ch/lemon CERN-IT/FIO-FD
Outline • What is it? • Configuration • Correlation Examples. • Actuators • Dealing with transient alarms. Lemon Tutorial
What is it? • Sensor-exception • An officially supported Lemon sensor coded in C++. • Developed in collaboration between CERN and BARC. • Implements the Lemon alarm protocol. • Has a LEX & YACC correlation engine which allows it to evaluate 1 or more metrics to determine if a problem exists on a machine. • Supports reporting alarms on behalf of other monitored entities. • Allows corrective actions (actuators) up to n-times or within a given time window. • Is the primary interface to inserting alarms into the Lemon framework. The output of the sensor is used by LAS and lemon-host-check. • Provides one and only one metric class “alarm.exception” • Full documentation at: • http://lemon.web.cern.ch/lemon/doc/sensors/exception.shtml Lemon Tutorial
Configuration • The sensor has 6 configuration options: • Correlation • The power behind the sensor exceptions capabilities • This tells the sensor which metrics are involved in the alarm and how they should be evaluated • Actuator • The path to an actuator to run if the correlation string is true. • MaxRuns • The maximum number of times an actuator can run consecutively before a final alarm is generating • Timeout • The maximum number of seconds that an actuator is allowed to run before being terminated by the sensor. • MinOccurs • The minimum number of consecutive times a problem must be present before raising an alarm. • Good for dealing with transient alarms. • Silent • Defines whether the exception should run in silent mode. A silent exception will continue to be evaluated but the result will not be displayed on LAS or lemon-host-check. • Good for testing and deployment of new alarms. Lemon Tutorial
Configuration (II) • Basic format of a correlation is: [entity_name]:<metric_id>:<field_position> <operator> <reference_value> ... • Where, • entity_name • An optional parameter, used for reporting on behalf of other entities • The name of the entity (wildcards ‘*’ are supported) • metric_id • The id of the metric to check • field_position • The field to use within the metric. • Allows the correlation to extract a single value from a multi-valued metric • Operater • E.g. ==, !=, >, <, eq, ne, regex, !regex … • reference_value • A string or number used to compare the metric_id:field_position against Lemon Tutorial
Correlation Example (I) • Objective: • To run a actuator when the occupancy of the /tmp partition is greater then 80%. • Involved Metrics • 9104 (system.partitionInfo) • Field 1 = mountname, field 5 = percentage occupancy • Correlation Correlation((9104:1 eq '/tmp') && (9104:5 > 80)) Actuator /usr/local/sbin/clean-tmp-partition -o 75 MaxRuns 3 900 Timeout 300 Lemon Tutorial
Correlation Example (II) • Objective: • To raise an alarm “lemon_agent_wrong” if the memory utilisation, cpu utilisation or number of errors in the agents log file is not within acceptable limits. • Correlation 10004:1 > 600 && (10004:7 > 10 || (10004:8 > 150000 && 4109:3 eq 'i386') || (10004:8 > 600000 && 4109:3 regex '64') || 10007:2 > 50 || 10007:3 > 10 || 10007:4 > 0) If the: (uptime of the agent (10004:1) is greater then 600 seconds) AND (the cpu utilisation of the sensors (10004:7) over the last sampling frequency is greater then 10%) OR (the memory consumed by the sensors (10004:8) is greater then 150 megabytes for machines of architecture type (4109:3) i386 or 600 megabytes for machines of architecture type x86_64) OR (the number of warning messages (10007:2) recorded over the last sampling frequency is greater the 50) OR (the number of error messages (10007:3) recorded over the last sampling frequency is greater the 10) OR (the number of fatal messages (10007:3) recorded over the last sampling frequency is greater the 0) raise an alarm Lemon Tutorial
Actuators • Information: • Run as forked processes. • Are connected to the sensor via a pipe. • All information written to stdout or stderr by the actuator is caught and recorded in the agents log file. • All actuator attempts are logged centrally and recorded locally in the agents log file. • Running shell style actuators: • The system call used to run actuator doesn’t provide shell style conveniences. • To use shell style syntax like *, &&, | etc you must define you actuator like this: Actuator /bin/sh –c \\” /bin/echo ‘This is a demo message from $HOSTNAME’ \\” Lemon Tutorial
Dealing with transient Alarms • Why do we get transient alarms? • By default monitoring isn’t very tolerant of outside interventions • Maybe network issues. • A resource maybe temporarily unavailable. • What can be done? • Use the configuration option MinOccurs • MinOccurs gives an exception a level of tolerance, a delay factor between detecting a problem and raising an alarm Lemon Tutorial
Lemon Tutorial Quattor and Non-Quattor Configuration of the lemon-agent Miroslav Siket, Dennis Waldron http://cern.ch/lemon CERN-IT/FIO-FD
Outline • What is the agent? • How to install the agent • Configuring the agent • Demonstration Lemon Tutorial
What is the agent? • A daemon on every monitored machine that is responsible for: • Launching, scheduling requests and communicating with sensors. • Checking on the status of sensors. • Sending sensor information to the central lemon servers using TCP and/or UDP. • Monitoring itself with the internal MSA sensor. • Caching data locally for use by other lemon tools e.g. lemon-host-check and lemon-cli • Full documentation at: http://lemon.web.cern.ch/lemon/docs.shtml Lemon Tutorial
Configuring the agent • Two supported ways: • Quattor • Configuration is stored in hierarchical templates per domain/cluster/node • NCM framework is used to download configuration XML profile to nodes • NCM components are used to convert the xml profile information into the agents native configuration file structure. • Documented at: http://cern.ch/lemon/doc/howto/lemon_cdb_howto.shtml • Non-Quattor • Best suited for homogeneous sites. • Use default agent and sensor rpms from Lemon • Use rpms for custom sensors/settings • The agent supports a modular style configuration where configuration files are places into sub directories depending on their purpose: • /etc/lemon/agent/metrics/ <- metric configuration • /etc/lemon/agent/sensors/ <- sensor configuration • /etc/lemon/agent/transports/ <- transport configuration • Both the Quattor and Non-Quattor styles of configuration can live together on the same machine. Lemon Tutorial
Demonstration • Installation of the agent and default sensors • rpm –Uvh edg-fabricMonitoring-agent-2.13.0-2.i386.rpm • rpm –Uvh lemon-sensor-exception-1.2.1-2.i386.rpm • Configuration of: • General agent’s settings (/etc/lemon/agent/general.conf) • Servers (transports) (/etc/lemon/agent/udp.conf) • Defining a new sensor • Defining a new metric Lemon Tutorial
Lemon Tutorial lemon-host-check Miroslav Siket, Dennis Waldron http://cern.ch/lemon CERN-IT/FIO-FD
Outline • What is it ? • Demonstration Lemon Tutorial
What is it? • Lemon-host-check is: • The latest Lemon tool. • A tool for checking the current status of all configured exceptions on the machine. • A tool for managing the state of exceptions, with the ability to turn a exceptions off and on, on the fly without the need for reconfiguration of the agent. • The first command you should run whenever you believe monitoring is incorrect!!! • Works by instructing the local agent to refresh all metrics contributing towards exceptions (raw metrics) and then requesting a refresh of all exceptions. • Uses fresh monitoring data. • Fully documented at: http://cern.ch/lemon/doc/components/lemon-host-check.shtml Lemon Tutorial
Demonstration • Installation of lemon-host-check • rpm –Uvh lemon-host-check-1.0.1-7.noarch.rpm • rpm –Uvh edg-fabricMonitoring-mrs-1.0.8-1.i386.rpm • Show how to: • Interpret the information returned by lemon-host-check • Enable and disable exceptions • View pre alarms, running actuators and disabled metrics Lemon Tutorial
Lemon Tutorial FlatMon and OraMon servers Miroslav Siket, Dennis Waldron http://cern.ch/lemon CERN-IT/FIO-FD
Authentication • Flat File (FlatMon ) and Oracle based (OraMon) • Used for both TCP and UDP (stateless connections) • Using OpenSSL libraries with public key methods to authenticate – sign() and verify() methods • Support both RSA/SHA1,MD5, DSA/DSS1 algorithms with different key sizes (default = 1024bit) • Fastest – RSA/SHA1 • X509 would provide too much overhead • Three levels: • 0 – no authentication • 1 – authentication of signed packets, accepts also non-signed packets • 2 – full enforcement of authentication Lemon Tutorial
Authentication - schema Node1 [rsa_encrypt(s.pub_key)] rsa_sign(n1.sec_key) Server1 rsa_verify(metric,n3.pub_key) [rsa_decrypt(s.sec_key)] Node2 [rsa_encrypt(s.pub_key)] rsa_sign(n2.sec_key) Server2 rsa_verify(metric,n1.pub_key) [rsa_decrypt(s.sec_key)] Node3 [rsa_encrypt(s.pub_key)] rsa_sign(n3.sec_key) s.pub_key – server’s public key n(x).sec_key – agent’s secret key n(x).public_key – agent’s public key Lemon Tutorial
Setup of FlatMon Fast overview: • Install server rpm • Setup /etc/lemon/server/edg-fmon-server.conf file • Setup /etc/lemon/server/keys directory with client keys • Check authentication • Check data arriving at server • Check log files for problems Lemon Tutorial
Setup of OraMon Fast overview: • Rpms installation (lemon-ora-admin, lemon-OraMon) • DBA creation of schema (use adapted lemon_user.sql) • Setting up schema for OraMon with lemon-ora.admin • Configuring metadata information (/etc/oramon-server.conf) • Configuring OraMon: • System settings with /etc/sysconfig/OraMon • Access settings with /etc/lemon/server/lemon-oramon-server.conf • Checking the log file for problems • Checking data with lemon-ora.retrieve • Changing the metadata • http://lemon.web.cern.ch/lemon/doc/components/lemon-ora-admin/index.html • http://lemon.web.cern.ch/lemon/doc/components/oramon/index.html Lemon Tutorial
Lemon Tutorial LRF Miroslav Siket, Dennis Waldron http://cern.ch/lemon CERN-IT/FIO-FD