1 / 25

Extensible Monitoring with Nagios and Messaging Middleware

Extensible Monitoring with Nagios and Messaging Middleware. LISA 2012 Jonathan Reams < jreams@columbia.edu >. Symon Says Nagios Project. Replace 12-year-old home grown monitoring system Very customized Very engineered Very unsupported ~17,000 checks Mandate to move to Nagios.

ulla
Download Presentation

Extensible Monitoring with Nagios and Messaging Middleware

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams <jreams@columbia.edu>

  2. Symon Says Nagios Project • Replace 12-year-old home grown monitoring system • Very customized • Very engineered • Very unsupported • ~17,000 checks • Mandate to move to Nagios

  3. False Start • Installed Nagios • Ported checks from old system to new • Went out for coffee • Problems • High check latency • High load

  4. Stock Nagios

  5. Nagios Problems • Trapped on one host: • Check results • Status data • Configuration data • Nagios isn’t a great executor • Forks 2 processes per check • Everything is basically synchronous – asyncachieved with multiple processes • Data format is simple but non-standard

  6. Nagios Problems • Implementation is all in C – hard to customize • Can be I/O bound by reading/writing check result files • Cannot query data from status file/configuration without reading/parsing all of it • Input via FIFO gives no feedback and has a limited buffer size

  7. Nagios Problems Communication is hard!

  8. My Solution NagMQ A ZeroMQ-based API for Nagios

  9. Background on ZeroMQ • Broker-less messaging kernel in a single library • Emulates Berkeley socket API • Supports IPC/TCP/Multicast transports • Fanout, pub/sub, pipe-line, and request/reply messaging patterns • All I/O is asynchronous after connections are established with dedicated I/O threads • Bindings available for large number of operating systems and languages • Agnostic of data being sent – no defined data format

  10. NagMQ

  11. Event Publisher & Commands Host check result from publisher host_check_processedlocalhost { "host_name": "localhost", "check_type": 0, "check_options": 0, "scheduled_check": 1, "reschedule_check": 1, "current_attempt": 1, "max_attempts": 1, "state": 0, "last_state": 0, "last_hard_state": 0, "last_check": 1354996955, "last_state_change": 1337098090, "latency": 1.63600, "timeout": 60, "type": "host_check_processed", "start_time": { "tv_sec": 1354996955, "tv_usec": 636453 }, "end_time": { "tv_sec": 1354996964, "tv_usec": 161965 }, "early_timeout": 0, "execution_time": 0.07324, "return_code": 0, "output": "Host up", "long_output": null, "perf_data": null, "timestamp": { "tv_sec": 1354996964, "tv_usec": 161966 } } Command to add an acknowledgement to service problem {'comment_data': 'Stop alerting me!!', 'notify_contacts': False, 'author_name': ’jreams', 'persistent_comment': False, 'host_name': 'localhost', 'service_description': 'rotate-unix', 'time_stamp': {'tv_sec': 1355074576}, 'type': 'acknowledgement'}

  12. State Data Request {'keys': ['host_name', 'services', 'hosts', 'service_description', 'current_state', 'members', 'type', 'name', 'problem_has_been_acknowledged', 'plugin_output', 'checks_enabled', 'notifications_enabled', 'event_handler_enabled'], 'include_services': True, 'host_name': 'localhost'} Response [{'checks_enabled': True, 'notifications_enabled': True, 'current_state': 0, 'plugin_output': 'Host up', 'problem_has_been_acknowledged': 0, 'event_handler_enabled': True, 'host_name': 'localhost', 'services': ['rotate-unix'], 'type': 'host'}, {'checks_enabled': False, 'notifications_enabled': True, 'current_state': 1, 'plugin_output': 'You are now on call', 'problem_has_been_acknowledged': False, 'event_handler_enabled': True, 'host_name': 'localhost', 'service_description': 'rotate-unix', 'type': 'service'}]

  13. Some examples • Distributed check execution (mqexec) • Custom user interfaces (nag.py, etc) • High availability (haagent.py, halib.py)

  14. mqexec

  15. mqexec • Asynchronous command executor • Subscribes to host_check_initiate, service_check_initiate, and event_handler_startmessages, and executes command line specified • Can filter which commands to execute based on any attribute in message • Receives messages as • Fair-queued worker pool (pull from MQ broker) • Individual worker (subscribe directly to NagMQ) • Sends results back to command interface of NagMQ

  16. Performance: Stock Nagios

  17. Performance: NagMQ/mqexec

  18. User Interfaces • Command-line $ nag.py -c 'Stop alerting me!!' add acklocalhost [localhost]: No problem found [uptime@localhost]: Acknowledgement added • Python/Javascript/Twitter Bootstrap web interface using NagMQ (see demo) • Interface to Twitter

  19. High Availability – Stock Nagios

  20. High Availability - NagMQ

  21. High Availability - NagMQ • Use regular program_statusto provide heartbeat • Retrieve active state from state interface to bring passive node into sync with active node on startup • Subscribe to and send check result messages, acknowledgements, downtimes, and adaptive changes to command interface • Passive host’s mqexec(s) run checks for whatever host is active • Use VIFs owned by the message broker to direct traffic to active host

  22. Why not use one of these? • LiveStatus – live state query module with check execution workers • Mod_gearman – distributed check execution based on gearman job queue • Merlin – database/distributed backend for Nagios • Ndoutils – database backend for Nagios • NSCA – allows check/command submission over network • NRPE – remote check executor

  23. API – not a product • NagMQ is just an interface into Nagios, not a product • Better communication with clients comes from larger ZeroMQ project – leaving NagMQ to focus on Nagios • Implement ad-hoc tools for Nagios without having to write any compiled code • Doing expensive data processing of monitoring data doesn’t have to create latency in monitoring system • Re-use one interface for many tools

  24. Future Work • Pluggable authentication/encryption for NagMQ • Pluggable parser/emitter for custom data formats (XML, Yaml, etc) • NDOutils database replacement • More user interfaces (Jabber, SMS, email gateway, REST API) • Nagios 4

  25. NagMQ https://github.com/jbreams/nagmq Jonathan Reams jbreams@gmail.com

More Related