150 likes | 163 Views
Learn how Nagios enhances system control monitoring with plugins, active and passive checks, and alert notifications. Explore real-life experiences and future implementation steps.
E N D
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Monitoring a Control System Using Nagios Ralph Lange, BESSY – Mauro Giacchini, LNL
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios What is the Situation? Machine Status vs. Controls Infrastructure Status • Machine status: • usually handled in the Control Room by an operator • uses the Alarm Handler or other EPICS tools • based on Channel Access connections • Control System infrastructure can be comparably complex, its status: • needs to be handled outside the Control Room • with tools that allow remote access • using different types of connections/checks: ping, snmp, http, Channel Access, disk usage, ... • BESSY was starting to have an increasing number of failures due to ageing hardware • One summer day Mauro (preparing an EPICS training in hot Italian summer) was asking me if I knew Nagios ...
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios What is Nagios? Nagios (“nah-ghee-ose”) • Open source monitoring framework • widely used & actively developed: www.nagios.org • Host and service problems detection and recovery • Provides wide set of basic plugins (checks) • easy to develop custom plugins • Active vs. passive checks • Centralized vs. distributed deployment • also allows redundant Nagios daemons • High configurability • service dependencies, fine-grained notification options • Web interface • status view, administration (e.g. analysis, downtime scheduling)
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios The Plugin (Check) Interface Plugins (Checks) • Checks are command line programs that follow a convention for arguments, stdout output, and return code: nagiosplugins.org • Output: one line of status info • Return code: OK / WARNING / CRITICAL / UNKNOWN • Can be written in any (i.e. your favourite) compiled or interpreted language • Are configured into Nagios for local or remote execution Passive Checks • An external application can write check results (following a certain format) into a file (or a pipe) • Nagios reads from this and accepts the results (if configured)
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Nagios + CA Plugin = NAL Nagios Channel Access Plugins • caget type plugin (active check) by Mauro Giacchini (LNL) • camonitor type daemon (passive check) by Debby Quock (APS) • Integrate data available through CA into the Nagios monitoring framework • Can check the health of EPICS integrated VME crates, VME IOCs, soft IOCs, PLCs, CA gateways, CA archivers, ... as well as OPI machine and server health, disk status, network device status, NTP, DNS, web services etc. • Allows NAL (Nagios Alarm Handler) to be the central monitoring system for all control system infrastructure, whereas the ALH in the control room provides similar functionality for the controlled facility
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Current Configuration at BESSY Servers • All machines: ping, disk usage, load, processes, users, SSH • Some: DNS (foreign and internal addresses), NTP vxWorks IOCs • Ping, CPU load, memory usage, FD usage Services • Wikis, web server, help pages, issue trackers (Trac/Redmine), elog • Oracle servers: Ping, ODB Telnet, ODB TNS for important DBs => 296 checks on 111 hosts
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Screen Shots: Tactical Overview
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Screen Shots: Service Detail
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Screen Shots: Service Detail
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Screen Shots: Availability Report
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Screen Shots: Service Trends
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Firefox/Thunderbird Plugin • Highly configurable, many filtering options • New alarm starts blinking and may play sound • Mouse-over opens a pop-up showing the current alarms • Clicking an alarm opens the related Nagios page in a tab
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Experiences • Nagios is a very stable and reliable framework, configuration is flexible, options and plugins are many • Off control room, web based, email notification approach fits our controls group better than ALH • Manual configuration can be tedious, some parts could (should!) be generated from our RDB • Found some network problems, one running system clock, two disks filling up, IOC load and memory saturation on a number of mv162s (which were replaced by mv2100s)
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Next Steps To be configured: • Soft IOCs, CA Gateways, VME crates (Wiener), Embedded Controllers • NFS shares usage, switches/routers, printers Checks to be written: • Conserver (IOC console access) • CA Archiver (through ArchiveManager web interface) • CA access rights (based on cainfo) Collaborate: • Integrate CA check plugin development • Agree on a common place for our plugins (APS? Sourceforge? Nagios?)
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios LivEPICS Example Live Example: Mauro Giacchini's LivEPICS distribution includes Nagios 3.0(configured to look at the EPICS Base example app channels) Go check it out – now!