200 likes | 216 Views
Nagios on Tier1 farm. Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008. Overview. What we had before (Sure) Introduction to Nagios and how it is configured for the farm What might we do next. Sure monitoring - 1. Consists of a server and clients Communication via sysreq command
E N D
Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20th June 2008
Overview • What we had before (Sure) • Introduction to Nagios and how it is configured for the farm • What might we do next
Sure monitoring - 1 • Consists of a server and clients • Communication via sysreq command • Required scripts set up for each client to run checks and report results to server
Sure monitoring - 2 3 main tasks: • check host alive • active using ping • passive accepting heartbeat messages • receive alarm messages • receive “backup started” and “backup finished” messages
Sure monitoring - 3 Problems: • configuration not directly under Tier1 control • requires locally-written and locally maintained scripts • limited view of farm alarms and state • alarms only visible on server screen
Introduction to Nagios • highly configurable • under active development (Nagios 2.11 legacy, Nagios 3.0.2 latest stable) • active user community (mailing list) • some commercial offerings • extensive documentation part of installation • allows local extensions
Introduction to Nagios – basics -1 Nagios: • schedules test commands, for example: is space used in /var filesystem larger than permitted limit • accepts results as return code (0 - OK, 1 – warning, 2 – critical, 3/-1 – unknown), and a single line message
Introduction to Nagios – basics -2 Nagios (continued): • displays via Web interface to authorised users • sends notification via e-mail, SMS, RSS, Morse code, jungle drums etc • may run an event handler, e.g. if a test fails, then put this batch node offline
Introduction to Nagios – networked clients • Nagios server can use check_nrpe command to run test on networked client • client must be running nrpe client process to • accept and run check requests • accept results and return to server • Nagios server can also use ssh or smtp to perform checks (little experience on Tier1)
Single server, many clients Nagios server Nagios client Nagios client Nagios client Nagios client
Running scheduled checks and web server puts heavy load on Nagios server Tier1 uses master and slave servers: master keeps all results, runs web server and sends notifications slaves schedule tests, run them and return results to master (using send_nsca command to nsca daemon) Introduction to Nagios – slave servers
If slave server has crashed: master server checks whether tests have been run to schedule (freshness checking) if test is stale (test results not returned to schedule), master will run test (force check) Introduction to Nagios – “freshness”
Master and slaves servers; many clients Master server Slave server Slave server Slave server Client Client Client Client Client Client Client Client Client
Introduction to Nagios – clearing alarms If check condition has been corrected and you want to clear alarm before the next scheduled test: • can force check (from master or slave) by issuing appropriate formatted command to server • scripts available to do this
Introduction to Nagios - configuration In our configuration Nagios knows about: • hosts • host groups • services (for checking) • contacts and contact groups • time periods (when tests are valid, when to send contact messages)
Introduction to Nagios - configuration • Configuration is made simpler by extensive use of templates, for example: • define a template for a generic host • use it to define many other hosts, only changing parameters that are different (e.g. host name, address, group to which it belongs) • can be recursive
# Generic host definition template define host{ name generic-host; name of host template notifications_enabled 1; Host notifications are enabled event_handler_enabled 1; Host event handler is enabled flap_detection_enabled 1; Flap detection is enabled process_perf_data 1; Process performance data retain_status_information 1; Retain status information retain_nonstatus_information 1; Retain non-status information register 0; Template definition check_command check-host-alive max_check_attempts 10 notification_interval 720 notification_period 24x7 notification_options d,u,r }
define host{ use generic-host host_name ganglia0430 parents swt-5530-0 alias Ganglia Host hostgroups aux-services contact_groups thorne address 130.246.183.173 } define host{ use generic-host host_name shelob parents swt-4400-1 alias CSF Webserver ……………
Introduction to Nagios - plugins • Test scripts are known as plugins • Can be written in any suitable language: shell script, Perl, C, Pascal • About 60 standard plugins (available by RPM from Dag Wieers’ repository) • About 30+ locally written plugins • plus 14+ specially written for Castor
Nagios links • Nagios home page: http://www.nagios.org/ • For locally written plugins: http://cvs.gridpp.rl.ac.uk/viewcvs/viewcvs.cgi/nagios/plugins/ • For GridPP information about Nagios: http://www.gridpp.ac.uk/wiki/Nagios