260 likes | 428 Views
Extensible Monitoring with Nagios and Messaging Middleware. LISA 2012 Jonathan Reams < jreams@columbia.edu >. Symon Says Nagios Project. Replace 12-year-old home grown monitoring system Very customized Very engineered Very unsupported ~17,000 checks Mandate to move to Nagios.
E N D
Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams <jreams@columbia.edu>
Symon Says Nagios Project • Replace 12-year-old home grown monitoring system • Very customized • Very engineered • Very unsupported • ~17,000 checks • Mandate to move to Nagios
False Start • Installed Nagios • Ported checks from old system to new • Went out for coffee • Problems • High check latency • High load
Nagios Problems • Trapped on one host: • Check results • Status data • Configuration data • Nagios isn’t a great executor • Forks 2 processes per check • Everything is basically synchronous – asyncachieved with multiple processes • Data format is simple but non-standard
Nagios Problems • Implementation is all in C – hard to customize • Can be I/O bound by reading/writing check result files • Cannot query data from status file/configuration without reading/parsing all of it • Input via FIFO gives no feedback and has a limited buffer size
Nagios Problems Communication is hard!
My Solution NagMQ A ZeroMQ-based API for Nagios
Background on ZeroMQ • Broker-less messaging kernel in a single library • Emulates Berkeley socket API • Supports IPC/TCP/Multicast transports • Fanout, pub/sub, pipe-line, and request/reply messaging patterns • All I/O is asynchronous after connections are established with dedicated I/O threads • Bindings available for large number of operating systems and languages • Agnostic of data being sent – no defined data format
Event Publisher & Commands Host check result from publisher host_check_processedlocalhost { "host_name": "localhost", "check_type": 0, "check_options": 0, "scheduled_check": 1, "reschedule_check": 1, "current_attempt": 1, "max_attempts": 1, "state": 0, "last_state": 0, "last_hard_state": 0, "last_check": 1354996955, "last_state_change": 1337098090, "latency": 1.63600, "timeout": 60, "type": "host_check_processed", "start_time": { "tv_sec": 1354996955, "tv_usec": 636453 }, "end_time": { "tv_sec": 1354996964, "tv_usec": 161965 }, "early_timeout": 0, "execution_time": 0.07324, "return_code": 0, "output": "Host up", "long_output": null, "perf_data": null, "timestamp": { "tv_sec": 1354996964, "tv_usec": 161966 } } Command to add an acknowledgement to service problem {'comment_data': 'Stop alerting me!!', 'notify_contacts': False, 'author_name': ’jreams', 'persistent_comment': False, 'host_name': 'localhost', 'service_description': 'rotate-unix', 'time_stamp': {'tv_sec': 1355074576}, 'type': 'acknowledgement'}
State Data Request {'keys': ['host_name', 'services', 'hosts', 'service_description', 'current_state', 'members', 'type', 'name', 'problem_has_been_acknowledged', 'plugin_output', 'checks_enabled', 'notifications_enabled', 'event_handler_enabled'], 'include_services': True, 'host_name': 'localhost'} Response [{'checks_enabled': True, 'notifications_enabled': True, 'current_state': 0, 'plugin_output': 'Host up', 'problem_has_been_acknowledged': 0, 'event_handler_enabled': True, 'host_name': 'localhost', 'services': ['rotate-unix'], 'type': 'host'}, {'checks_enabled': False, 'notifications_enabled': True, 'current_state': 1, 'plugin_output': 'You are now on call', 'problem_has_been_acknowledged': False, 'event_handler_enabled': True, 'host_name': 'localhost', 'service_description': 'rotate-unix', 'type': 'service'}]
Some examples • Distributed check execution (mqexec) • Custom user interfaces (nag.py, etc) • High availability (haagent.py, halib.py)
mqexec • Asynchronous command executor • Subscribes to host_check_initiate, service_check_initiate, and event_handler_startmessages, and executes command line specified • Can filter which commands to execute based on any attribute in message • Receives messages as • Fair-queued worker pool (pull from MQ broker) • Individual worker (subscribe directly to NagMQ) • Sends results back to command interface of NagMQ
User Interfaces • Command-line $ nag.py -c 'Stop alerting me!!' add acklocalhost [localhost]: No problem found [uptime@localhost]: Acknowledgement added • Python/Javascript/Twitter Bootstrap web interface using NagMQ (see demo) • Interface to Twitter
High Availability - NagMQ • Use regular program_statusto provide heartbeat • Retrieve active state from state interface to bring passive node into sync with active node on startup • Subscribe to and send check result messages, acknowledgements, downtimes, and adaptive changes to command interface • Passive host’s mqexec(s) run checks for whatever host is active • Use VIFs owned by the message broker to direct traffic to active host
Why not use one of these? • LiveStatus – live state query module with check execution workers • Mod_gearman – distributed check execution based on gearman job queue • Merlin – database/distributed backend for Nagios • Ndoutils – database backend for Nagios • NSCA – allows check/command submission over network • NRPE – remote check executor
API – not a product • NagMQ is just an interface into Nagios, not a product • Better communication with clients comes from larger ZeroMQ project – leaving NagMQ to focus on Nagios • Implement ad-hoc tools for Nagios without having to write any compiled code • Doing expensive data processing of monitoring data doesn’t have to create latency in monitoring system • Re-use one interface for many tools
Future Work • Pluggable authentication/encryption for NagMQ • Pluggable parser/emitter for custom data formats (XML, Yaml, etc) • NDOutils database replacement • More user interfaces (Jabber, SMS, email gateway, REST API) • Nagios 4
NagMQ https://github.com/jbreams/nagmq Jonathan Reams jbreams@gmail.com