120 likes | 320 Views
Lemon/LAS for System Administrators. Overview Miroslav Siket http://cern.ch/lemon CERN-IT/FIO-FD. RRDTool / PHP. apache. HTTP. TCP/UDP. Application Server. Oracle Database. Monitoring Agent. Web browser. Sensor. Sensor. Sensor. Lemon - schema. Repository backend. SQL. Nodes.
E N D
Lemon/LAS for System Administrators Overview Miroslav Siket http://cern.ch/lemon CERN-IT/FIO-FD
RRDTool / PHP apache HTTP TCP/UDP Application Server Oracle Database Monitoring Agent Web browser Sensor Sensor Sensor Lemon - schema Repository backend SQL Nodes Lemon CLI Lemon-host-check User Lemon Tutorial
Lemon/LAS building blocks • Oracle DB server • running LAS logic and storing LAS data - PL/SQL • Lemon-server – application server • Inserting exceptions to Oracle DB • Web server • Providing access to LAS data from Oracle DB to LAS GUI (business logic) and lemon-cli • Remote monitoring – ping, http • SURE gateways for UIMON/AFS Lemon Tutorial
Lemon/LAS hardware • Two independent instances • Primary • Oracle DB – lemonrac (dbsrvd102,dbsrvd103) • Application server (lemon-server) – lxmred0803/0603 • Web server – lemonweb (lxmrec1601) • Secondary • Oracle DB and OraMon – lemondb2 • Web server – lemonweb02 • Remote monitoring machines • Lxmred0803 and lxmred0603 Lemon Tutorial
lemon-cli • Command line tool for extracting raw (un-interpreted) data from lemon. • Information can be extracted from local cache (/var/spool/edg-fmon-agent) or remote server • Limitations • local cache is limited to seven days worth of history (purged everyday by the agent) • local cache contains much more information then is recorded at the server • Why? smoothing!! • Smoothing is a mechanism which allows the agent to be selective on the information it sends to the central servers • If the information you want is < 7 days use the local cache!! • Full documentation at: http://cern.ch/lemon/doc/components/lemon-cli.shtml Lemon Tutorial
lemon-cli (II) - Examples • Resolving a metric id to a name • lemon-cli –m syslog • Displays all the metrics whose name contains ‘syslog’ • Referencing time periods (--end, --start), e.g. • 1h = 1 hour • 2d3h36m44s = 2 days, 3 hours, 36 minutes and 44 seconds • Also supports log file timestamps e.g. Thu 02 Nov 2006 10:45:00 (no guarantees!) • If querying remotely –n accepts the same node name expansion criteria as wassh! e.g lemon-cli –m 10005 –n lxb[0001-1000] --server • All alarms can be seen on the machine using • lemon-cli –class “alarm.exception” • 1 005, 1 135 and 1 000 are alarms • lemon-host-check interprets all the codes for you!! Lemon Tutorial
lemon-host-check (I) • Aim: to provide a command line tool for viewing the status of all active alarms on a given machine by querying the edg-fmon-agent. • Uses the information recorded in the agents local cache. (requires /var/ to be writeable!) • Makes sure that the information reported to you is up to date (fresh!!) • Checks that all sensors are running, and that 1 and only 1 agent processing is running. • Must be logged in as root! • Full documentation at: http://cern.ch/lemon/doc/components/lemon-host-check.shtml Lemon Tutorial
lemon-host-check (II) - Examples • Check for active alarms on the machine • lemon-host-check • Disable alarms “syslogd and klogd” • lemon-host-check –disable "30023,30032“ • Show me alarms even if they are disabled • lemon-host-check –force • Disable all alarms for the next 1 hour 30 minutes and 23 seconds • lemon-host-check –disable-all –duration 1h30m23s “demo intervention” • View a list of all disabled alarms • lemon-host-check –list • Enable all alarms • lemon-host-check –enable-all • Some alarms are “hard” disabled! Requires a CDB reconfiguration and ncm-ncd –co fmonagent run to make them visible again. Lemon Tutorial
lemon-host-check (III) • Pre-alarms • Recent concept added to lemon. • Aims at dealing with transient alarms. • Real Use Case: • high_load (30008) has pre-alarm capabilities! When high load is detected on the machine a pre alarm is raised (not visible on LAS). If the alarm exists for more then 10 minutes it becomes a proper alarm. This allows for high load spikes on machines/clusters such as lxplus to be ignored. • Not visible by default in lemon-host-check • Caution: • If you have a high_load alarm and restart the agent the alarm will disappear!! If the root problem hasn’t been corrected the alarm will resurface 10 minutes later (A new ITCM ticket). • Don’t restart the agent unless you absolutely need to (reconfiguration, errors in the edg-fmon-agent.log,…) • If you have to restart use ‘lemon-host-check –show-all’ afterwards Note: (make sure to check the status of the alarm!!!!!! You need to ignore the disabled ones, if any!) Lemon Tutorial
lemon-host-check (IV) • Common errors: No monitoring agent process running / Too many monitoring agent processes running • service edg-fmon-agent restart • If that fails project-elfms-lemon@cern.ch Possible false exception • lemon-host-check has given up (after 60 seconds) trying to get information from the agent on the machine. If it failed to find out if an alarm was present for a particular exception it assumes the worst case scenario, that an alarm does exist but may not be real (possibly false) • Why? • The agent maybe too busy to answer lemon-host-check • Maybe some sensors have failed to retrieve the necessary information • Solution • re-run lemon-host-check again • Still fails check /var/log/edg-fmon-agent.log for any errors about sensors or missing metrics. If they exist spma_wrapper.sh the machine to get the latest sensor code if any. ncm-ncd –co fmonagent to reconfigure the agent. • Try again • Still failing, contact service manager and CC project-elfms-lemon@cern.ch Lemon Tutorial
FAQ Are monitored machines running only Linux (e.g : SLC3/4, RHEL 3/4) ? • Linux (lemon agent, ping, http check) • Solaris (lemon agent, UIMON) • Windows (ping, http) Is there any limitation that we should be aware of on the other OS’s / platforms? • AFS machines have their own monitoring tools – no information available • UIMON monitored machines – running UIMON process and multiplexer to send alarms Is there any load balancing (DNS) and/or redundancy ? front-/backend part of the failover? • Yes, HA for lemon-server on lxmred0803 and lxmred0603 • Oracle RAC (dbsrvd102/103) • Two independent instances (lemondb2/lemonweb02 and lemonrac/lxmred0803/0603) Lemon Tutorial
FAQ (II) What should we do in a case of a piquet call about a failure on these server(s)? • Operators' LAS procedures do not have any piquet actions defined. All other failures are standard OS/hw procedures that they already have. There is nothing LAS specific for them. How to interpret the correlation rules ? Could you explain the syntax found in the Remedy ticket? • Full documentation with examples athttp://lemon.web.cern.ch/lemon/doc/sensors/exception.shtml • Example: lxs5013:9104:1[/tmp] eq /tmp) && (lxs5013:9104:5[90] > 80 A mean to detect when a node started to be "alarmed" and when this stopped. • /var/log/ncm/component-setodesiredstate.log* log file on the machine in question What to expect from them if no alarm can be displayed anymore at 3:00AM and they've got called by Operator? • No piquet service for LAS defined. If Las does not work, operators have procedures for finding out the state of the LAS – check http://lemon.web.cern.ch/lemon/cern/las_procedures.shtml Lemon Tutorial