170 likes | 286 Views
LAS for System Administrators. LAS overview Miroslav Siket, Dennis Waldron http://cern.ch/lemon CERN-IT/FIO-FD. LAS building blocks. Oracle DB server running LAS logic and storing LAS data - PL/SQL OraMon – application server Inserting exceptions to Oracle DB Web server
E N D
LAS for System Administrators LAS overview Miroslav Siket, Dennis Waldron http://cern.ch/lemon CERN-IT/FIO-FD
LAS building blocks • Oracle DB server • running LAS logic and storing LAS data - PL/SQL • OraMon – application server • Inserting exceptions to Oracle DB • Web server • Providing access to LAS data from Oracle DB to LAS GUI (business logic) • Remote monitoring – ping, http • SURE gateways for UIMON/AFS Lemon Tutorial
LAS hardware • Two independent instances • Primary • Oracle DB and OraMon – lemondb1 • Web server – lemonweb02 • Secondary • Oracle DB and OraMon – lemondb2 • Web server – lemonweb01 • Remote monitoring machines • Lxfsrk4104 (aliased as lemonmr & lemonr01) • lxservb01 (alias lemonr02) Lemon Tutorial
Oracle DB server check • Login to machine (lemondb1,lemondb2): > source ~oracle/.oraprofile.LEMON* > tnsping LEMON_A (LEMON_C for lemondb2) Check output of the previous command Example: OK (0 ms) Lemon Tutorial
OraMon check • Already checked by LAS GUI • Lemon-host-check • ORAMON_WRONG procedure • Log file: /var/log/OraMon.log Lemon Tutorial
Apache web server check • Already checked by LAS GUI • Lemon-host-check • HTTPD_WRONG procedure • Log file: /var/log/httpd/error_log Lemon Tutorial
Remote monitoring check • Runs as sensor (remote) on remote monitoring machines • Lemon-host-check • Agent log file: /var/log/edg-fmon-agent.log Lemon Tutorial
SURE gateways for UIMON/SURE • Runs as a sensor (suregateway) on remote monitoring machines • Agent process and log file • ISSUE: AFS machines • Uses lemon-sure-multiplexer process as a gateway • Lxfsrk4104 only • Check existence of the daemon, log file: /var/log/lemon-sure-multiplexer.log Lemon Tutorial
lemon-cli • Command line tool for extracting raw (un-interpreted) data from lemon. • Information can be extracted from local cache (/var/spool/edg-fmon-agent) or remote server over SOAP (aliased as lemonmr, physical machine: lxfsrk4104) • Limitations • local cache is limited to seven days worth of history (purged everyday by the agent) • remote server queries limited to 20,000 returned results • this limitation will be removed when the new lemon API is deployed (end Q4, begin Q1 2007) • local cache contains much more information then is recorded at the server • Why? smoothing!! • Smoothing is a mechanism which allows the agent to be selective on the information it sends to the central servers • If the information you want is < 7 days use the local cache!! • Full documentation at: http://cern.ch/lemon/doc/components/lemon-cli.shtml Lemon Tutorial
lemon-cli (II) - Examples • Resolving a metric id to a name • lemon-cli –m syslog • Displays all the metrics whose name contains ‘syslog’ • Referencing time periods (--end, --start), e.g. • 1h = 1 hour • 2d3h36m44s = 2 days, 3 hours, 36 minutes and 44 seconds • Also supports log file timestamps e.g. Thu 02 Nov 2006 10:45:00 (no guarantees!) • If querying remotely –n accepts the same node name expansion criteria as wassh! e.g lemon-cli –m 10005 –n lxb[0001-1000] --server • All alarms can be seen on the machine using • lemon-cli –class “alarm.exception” • 1 005, 1 135 and 1 000 are alarms • lemon-host-check interprets all the codes for you!! Lemon Tutorial
lemon-host-check (I) • Aim: to provide a command line tool for viewing the status of all active alarms on a given machine by querying the edg-fmon-agent. • Uses the information recorded in the agents local cache. (requires /var/ to be writeable!) • Makes sure that the information reported to you is up to date (fresh!!) • Checks that all sensors are running, and that 1 and only 1 agent processing is running. • Must be logged in as root! • Full documentation at: http://cern.ch/lemon/doc/components/lemon-host-check.shtml Lemon Tutorial
lemon-host-check (II) - Examples • Check for active alarms on the machine • lemon-host-check • Disable alarms “syslogd and klogd” • lemon-host-check –disable "30023,30032“ • Show me alarms even if they are disabled • lemon-host-check –force • Disable all alarms for the next 1 hour 30 minutes and 23 seconds • lemon-host-check –disable-all –duration 1h30m23s “demo intervention” • View a list of all disabled alarms • lemon-host-check –list • Enable all alarms • lemon-host-check –enable-all • Some alarms are “hard” disabled! Requires a CDB reconfiguration and ncm-ncd –co fmonagent run to make them visible again. Lemon Tutorial
lemon-host-check (III) • Pre-alarms • Recent concept added to lemon. • Aims at dealing with transient alarms. • Real Use Case: • high_load (30008) has pre-alarm capabilities! When high load is detected on the machine a pre alarm is raised (not visible on LAS). If the alarm exists for more then 10 minutes it becomes a proper alarm. This allows for high load spikes on machines/clusters such as lxplus to be ignored. • Not visible by default in lemon-host-check • Caution: • If you have a high_load alarm and restart the agent the alarm will disappear!! If the root problem hasn’t been corrected the alarm will resurface 10 minutes later (A new ITCM ticket). • Don’t restart the agent unless you absolutely need to (reconfiguration, errors in the edg-fmon-agent.log,…) • If you have to restart use ‘lemon-host-check –show-all’ afterwards Note: (make sure to check the status of the alarm!!!!!! You need to ignore the disabled ones, if any!) Lemon Tutorial
lemon-host-check (IV) • Common errors: No monitoring agent process running / Too many monitoring agent processes running • service edg-fmon-agent restart • If that fails project-elfms-lemon@cern.ch Possible false exception • lemon-host-check has given up (after 60 seconds) trying to get information from the agent on the machine. If it failed to find out if an alarm was present for a particular exception it assumes the worst case scenario, that an alarm does exist but may not be real (possibly false) • Why? • The agent maybe too busy to answer lemon-host-check • Maybe some sensors have failed to retrieve the necessary information • Solution • re-run lemon-host-check again • Still fails check /var/log/edg-fmon-agent.log for any errors about sensors or missing metrics. If they exist spma_wrapper.sh the machine to get the latest sensor code if any. ncm-ncd –co fmonagent to reconfigure the agent. • Try again • Still failing, contact service manager and CC project-elfms-lemon@cern.ch Lemon Tutorial
FAQ Are monitored machines running only Linux (e.g : SLC3/4, RHEL 3/4) ? • Linux (lemon agent, ping, http check) • Solaris (lemon agent, UIMON) • Windows (ping, http) Is there any limitation that we should be aware of on the other OS’s / platforms? • AFS machines have their own monitoring tools – no information available • UIMON monitored machines – running UIMON process and multiplexer to send alarms to suregateway sensor on remote monitoring machines We knew nodes' polling on SURE, what is implemented in Lemon? • Remote sensor on remote monitoring machines Is there any load balancing (DNS) and/or redundancy ? front-/backend part of the failover? • No, just two independent instances running in parallel. • In future (with RAC) there will be failover for OraMon and only one Oracle DB Lemon Tutorial
FAQ (II) What should we do in a case of a piquet call about a failure on these server(s)? • Operators' LAS procedures do not have any piquet actions defined. All other failures are standard OS/hw procedures that they already have. There is nothing LAS specific for them. How to interpret the correlation rules ? Could you explain the syntax found in the Remedy ticket? • Full documentation with examples athttp://lemon.web.cern.ch/lemon/doc/sensors/exception.shtml • Example: lxs5013:9104:1[/tmp] eq /tmp) && (lxs5013:9104:5[90] > 80 LAS reduction rules and multi-hosts tickets: a direct mapping? • Several use cases: • e.g. 12 x spma_wrong on 12 nodes of cluster YYY • One LAS item if the number of machines reaches 51% of the active nodes in cluster • Several LAS items if they appear in burst and the alarm has been already reduced • Individual machine LAS items if below 51% • If new machines appear, there will be a new reduced LAS item for each set of them A mean to detect when a node started to be "alarmed" and when this stopped. • /var/log/ncm/component-setodesiredstate.log* log file on the machine in question Lemon Tutorial
FAQ (III) What to expect from them if no alarm can be displayed anymore at 3:00AM and they've got called by Operator? • No piquet service for LAS defined. If Las does not work, operators have procedures for finding out the state of the LAS – check http://lemon.web.cern.ch/lemon/cern/las_procedures.shtml QUESTIONS? Lemon Tutorial