1 / 15

Monitoring a Control System Using Nagios

Monitoring a Control System Using Nagios. Ralph Lange, BESSY – Mauro Giacchini, LNL. What is the Situation?. Machine Status vs. Controls Infrastructure Status Machine status: usually handled in the Control Room by an operator uses the Alarm Handler or other EPICS tools

poorman
Download Presentation

Monitoring a Control System Using Nagios

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Monitoring a Control System Using Nagios Ralph Lange, BESSY – Mauro Giacchini, LNL

  2. R. Lange, M. Giacchini: Monitoring a Control System Using Nagios What is the Situation? Machine Status vs. Controls Infrastructure Status • Machine status: • usually handled in the Control Room by an operator • uses the Alarm Handler or other EPICS tools • based on Channel Access connections • Control System infrastructure can be comparably complex, its status: • needs to be handled outside the Control Room • with tools that allow remote access • using different types of connections/checks: ping, snmp, http, Channel Access, disk usage, ... • BESSY was starting to have an increasing number of failures due to ageing hardware • One summer day Mauro (preparing an EPICS training in hot Italian summer) was asking me if I knew Nagios ...

  3. R. Lange, M. Giacchini: Monitoring a Control System Using Nagios What is Nagios? Nagios (“nah-ghee-ose”)‏ • Open source monitoring framework • widely used & actively developed: www.nagios.org • Host and service problems detection and recovery • Provides wide set of basic plugins (checks)‏ • easy to develop custom plugins • Active vs. passive checks • Centralized vs. distributed deployment • also allows redundant Nagios daemons • High configurability • service dependencies, fine-grained notification options • Web interface • status view, administration (e.g. analysis, downtime scheduling)‏

  4. R. Lange, M. Giacchini: Monitoring a Control System Using Nagios The Plugin (Check) Interface Plugins (Checks)‏ • Checks are command line programs that follow a convention for arguments, stdout output, and return code: nagiosplugins.org • Output: one line of status info • Return code: OK / WARNING / CRITICAL / UNKNOWN • Can be written in any (i.e. your favourite) compiled or interpreted language • Are configured into Nagios for local or remote execution Passive Checks • An external application can write check results (following a certain format) into a file (or a pipe)‏ • Nagios reads from this and accepts the results (if configured)‏

  5. R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Nagios + CA Plugin = NAL Nagios Channel Access Plugins • caget type plugin (active check) by Mauro Giacchini (LNL)‏ • camonitor type daemon (passive check) by Debby Quock (APS)‏ • Integrate data available through CA into the Nagios monitoring framework • Can check the health of EPICS integrated VME crates, VME IOCs, soft IOCs, PLCs, CA gateways, CA archivers, ... as well as OPI machine and server health, disk status, network device status, NTP, DNS, web services etc. • Allows NAL (Nagios Alarm Handler) to be the central monitoring system for all control system infrastructure, whereas the ALH in the control room provides similar functionality for the controlled facility

  6. R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Current Configuration at BESSY Servers • All machines: ping, disk usage, load, processes, users, SSH • Some: DNS (foreign and internal addresses), NTP vxWorks IOCs • Ping, CPU load, memory usage, FD usage Services • Wikis, web server, help pages, issue trackers (Trac/Redmine), elog • Oracle servers: Ping, ODB Telnet, ODB TNS for important DBs => 296 checks on 111 hosts

  7. R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Screen Shots: Tactical Overview

  8. R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Screen Shots: Service Detail

  9. R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Screen Shots: Service Detail

  10. R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Screen Shots: Availability Report

  11. R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Screen Shots: Service Trends

  12. R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Firefox/Thunderbird Plugin • Highly configurable, many filtering options • New alarm starts blinking and may play sound • Mouse-over opens a pop-up showing the current alarms • Clicking an alarm opens the related Nagios page in a tab

  13. R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Experiences • Nagios is a very stable and reliable framework, configuration is flexible, options and plugins are many • Off control room, web based, email notification approach fits our controls group better than ALH • Manual configuration can be tedious, some parts could (should!) be generated from our RDB • Found some network problems, one running system clock, two disks filling up, IOC load and memory saturation on a number of mv162s (which were replaced by mv2100s)‏

  14. R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Next Steps To be configured: • Soft IOCs, CA Gateways, VME crates (Wiener)‏, Embedded Controllers • NFS shares usage, switches/routers, printers Checks to be written: • Conserver (IOC console access)‏ • CA Archiver (through ArchiveManager web interface)‏ • CA access rights (based on cainfo)‏ Collaborate: • Integrate CA check plugin development • Agree on a common place for our plugins (APS? Sourceforge? Nagios?)‏

  15. R. Lange, M. Giacchini: Monitoring a Control System Using Nagios LivEPICS Example Live Example: Mauro Giacchini's LivEPICS distribution includes Nagios 3.0(configured to look at the EPICS Base example app channels)‏ Go check it out – now!

More Related