1 / 16

Nagios Demonstration

Nagios Demonstration. Tom Wlodek. SLAC Tier2 workshop 2007-11-29. NRPE server Outside firewall Grid02.usatlas.org. BNL Nagios Hardware (current). …. firewall. …. Several NRPE servers on monitored machines. Nagios server rnagios01. NRPE server Inside firewall Gridmon…. RT/AT machine

judywhite
Download Presentation

Nagios Demonstration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nagios Demonstration Tom Wlodek SLAC Tier2 workshop 2007-11-29

  2. NRPE server Outside firewall Grid02.usatlas.org BNL Nagios Hardware (current) … firewall … Several NRPE servers on monitored machines Nagios server rnagios01 NRPE server Inside firewall Gridmon… RT/AT machine Rt-racf.bnl.gov

  3. RT can exchange problem reports with external ticketing systems. OSG Footprints Problems reported to RT are reflected on asset’s history RT In case of a failure of a critical machine or service Nagios notifies experts and/or opens RT problem report to keep track with the problem resolution. AT Nagios Information about assets stored in AT is used by Nagios to monitor the BNL machines and services as well as to keep up-to date list of administrators which are to be notified in case of problems Machines and services monitored by Nagios No AT support anymore!!!

  4. Coming changes to Nagios The Nagios server will be split into two: • internal RACF server (BNL stuff) • external (Tier2/3, OSG services, USAtlas) Nagios split has been delayed (lack of suitable hardware) but I hope that the problems have been fixed now Once the split is completed the Tier2 admins will be given nagios administrator rights.

  5. NRPE server Outside firewall Grid02… Future Hardware … firewall … NRPE servers on monitored machines Nagios internal server RT machine Rt-racf.bnl.gov NRPE server Inside firewall Gridmon… Nagios external server

  6. Current Nagios in nutshell • https://web.racf.bnl.gov/nagios/ • Bookmark this page and visit it often • We are currently monitoring ~500 services on ~260 hosts and counting…

  7. Service dependencies Parent services Child service

  8. Service dependencies • Currently some service dependencies are defined in nagios • More need to be defined/discovered • Discovering and declaring service dependencies is a neverending task..

  9. False alarms in Nagios • Sometimes probes report false alarms. • Many of those false alarms were caused by problems in BNL firewalls. We eliminated them by adding second network interface to nagios server. • Some level of false alarms still persist, probably still caused by firewalls. It is hard to eliminate them. I work on making the probes smarter. Also fix to BNL firewalls should bring relief.

  10. Nagios – “Tactical overview” https://web.racf.bnl.gov/nagios/ Visit this page daily – especially if you are member of management group or operator

  11. We need to formalize the Nagios operations • Operators should monitor “Tactical overview” page for new alerts and notify experts if they see one • Upon receiving nagios alert (by e-mail and/or pager and/or operator call) expert should visit nagios page and acknowledge the problem. • Expert should then take ownership of corresponding RT ticket and check the status of parent service (if applicable) • Close the RT ticket, if applicable. • Reschedule the new test of nagios service to clear the alert from nagios page • Fix the problem, leave record of the solution in RT • Delete comments from nagios page (if applicable)

  12. Useful things to know • How to schedule a shutdown of a service or group of services? • How to disable checks for a particular service or group of services? • How to stop notifications for a service/service group?

  13. We need to formalize the Nagios operations (cntd) • We need to enforce two rules: • No abandoned RT tickets (mostly works OK) • No unacknowledged nagios alarms • Acknowledged problems should remain acknowledged for at most T time. (One week???) After that they ought to be fixed or removed from nagios. The length of time interval T is negotiable, but we should agree on some number.

  14. RSV probes and NagiosThere are 3 ways to integrate RSV probes in nagios • Run RSV probe directly from nagios. Can be done (and is done) for simple probes, more complex ones will timeout nagios • Make RSV probes to report results to central OSG database, make nagios read the database.RSV authors do not seem to like it. • Make RSV probes report directly to nagios.BNL security experts do not like it, since it would imply changing current authentication methods.. So….

  15. We will combine method 2 and 3 RSV probes running in OSG land Interface Db BNL firewall Nagios

  16. I need feedback from you! • What should be monitored? • Who should be on call list? • What should be notification policy? E-mail? Pager? • We define event handlers to correct common error conditions? Do you want/need it? • Etc… etc…

More Related