270 likes | 290 Views
Learn about LBLnet's Nagios NMS design, challenges faced, benefits, and future goals. Explore Nagios features, extensibility, monitoring of network devices, trap handling, and notification configuration.
E N D
Nagios – Our Open Source Network Management Solution Presenter: Ling Zhang LBLnet Services Group Information Technologies and Services Division LBNL
Contributors Nagios software design and development: Ethan Galstad (www.nagios.org) System integration, configuration, testing: Ling Zhang, Greg Bell, Harper Mann, Cedric Hui, Clark Wood, Mike Bennett ITSD/LBNL
Goals for this talk • To explain: • LBLnet’s point of view of Network Management System • network monitoring problems we encountered • the design of our Nagios network monitoring system • To discuss • the benefits of the nagios system • our future development goals ITSD/LBNL
Our point of view of a NMS • Proactive network management • Alarm Panel • Connectivity • Performance • Fault isolation • Trend Analysis • Capacity planning • The Notification • Precise • Fast ITSD/LBNL
Background Information • Network Monitoring tools we have tested and/or used before: • Sun Net Manager • Spectrum • Whatsup Gold • Netmon • SNMPc • Ipmonitor • HP Openview • OpenNMS • InCharge • Home grown scripts • MRTG/RRDtool • etc. ITSD/LBNL
Background Information Our fair share of problems with NMS: • Notification storm 65 notifications were received during a router up/down event. The router has 20 active interface and 32 downstream monitored devices • False alarms • Integration with existing systems (MRTG, Trouble ticket system) • Tech support our longest outstanding tickets: 2 years and counting • Budget ITSD/LBNL
In Search of a Better NMS • Accurate and efficient fault detection • Good performance • Extensible • Can be integrated with our existing system • Low maintenance • Fits our budget ITSD/LBNL
Features of Nagios • Open source system runs on most Unix system • Highly extensible • Reliable dependency monitoring • Excellent service monitoring capabilities • Ability to schedule maintenance periods • Flexible notification ITSD/LBNL
Our Nagios Topology LBLnet NMS diagram ITSD/LBNL
Nagios Extensibility • Plugins • Event handlers • External commands ITSD/LBNL
Nagios Extensibility - Plugins • Compiled executables or scripts (Perl, shell, etc.) • Run by nagios process • Checks device or service status Example: define host { host_name switch1 address 1.2.3.4 check_command ping_switch } define service { host_name switch1 Service_description CPU Util check_command get_cpu_util } ITSD/LBNL
Services Monitored by Nagios • Nagios uses plugins to check service status • DHCP • DNS • FTP • HTTP • HTTPS • IMAP • NTP • Radius • SMTP • SQL • TFTP • WINS • etc. ITSD/LBNL
Nagios Extensibility – Event Handelers • Compiled executables or scripts • Run by nagios process • Triggered by host or service status change Example: define service{ host_name somehost service_description HTTP max_check_attempts 4 check_command check_http event_handler restart-httpd ...other service variables... } ITSD/LBNL
Nagios Extensibility – External Commands • A predefined set of commands issued externally to control the behavior of nagios • Controls notification, monitor scheduling, program start/stop • Issued by external applications (CGI, snmptrapd, etc.) • Reads in by nagios core process during run time Example • User disabled monitoring of switch1 from web interface • CGI wrote command “disable monitor switch1” to command file • Nagios process read this command and stopped scheduling monitoring for switch1 ITSD/LBNL
Monitoring Network Devices • Ping • Measures system responsiveness via average RTT • SNMP get • CPU • Temperature • Interface/port status • System up time • Power supply status • Throughput • Packet discard rate • etc. • SNMP trap ITSD/LBNL
Nagios Trap handling • Requires Net-SNMP or other trap receiver daemon • Trap receiver notifies nagios about traps received via External Commands • Nagios calls event handlers and/or notifies user ITSD/LBNL
Dependency Configuration define host { use switch-tmpl host_name switch1 address 1.2.3.10 parents router1 } define host { use switch-tmpl host_name switch2 address 1.2.3.20 parents switch1 } define host { use switch-tmpl host_name switch3 address 1.2.3.30 parents switch1 } define host { use switch-tmpl host_name switch4 address 1.2.3.40 parents switch2 } Diagram ITSD/LBNL
Nagios Notification • Similar to event handlers • Triggered by host/service status change • Calls third party notification tools (sendmail, qpage, etc.) • Supports email, page, instant messaging etc. ITSD/LBNL
Nagios Notification format • Email Subject: switch3 (1.2.3.30) DOWN Host: switch3 Address: 1.2.3.30 Date/Time: Thu Jul 15 14:03:37 PDT 2004 Additional Info: (No Information Returned From Host Check) • Page DOWNswitch3(1.2.3.40) ITSD/LBNL
Maintenance Scheduling • Schedule a maintenance window via Nagios web interface • Uses external commands • Fixed window • Float window • Dependency aware ITSD/LBNL
Monitoring Subnet with Redundant Network Connections • Solution: • Monitor interface up/down status via Ping • Monitor HSRP status via HSRP mib • Challenge: • Monitoring interface status • Monitoring standby status at the same time ITSD/LBNL
Performance of Nagios • False alarms • False positive • False negative • Unnecessary • Notification delay • Before: 303 sec • After: 221 sec ITSD/LBNL
Money and Time Saved • Software package cost • InCharge ($$$) • IPmonitor ( $1500) • Nagios ($0) • Software maintenance contract cost • InCharge (>$15,000) • IPmonitor ($500) • Nagios ($0) • Time saved from less unnecessary alarms (Compared to IPmontior) • 20 man.hrs/month ITSD/LBNL
Future development of Nagios • Performance Monitoring • Network element out of resources • Interface buffer drops • Duplex mismatch • Has to be done by inference • Assume heterogeneous network equipment • No use of host SNMP • Derive from combination of interface error types and rates • Integrating with other NMS elements • Syslog • MRTG/RRDtool • Trouble ticket System • Database • Topology discovery ITSD/LBNL
Conclusion • Nagios fits our Network Management needs because: • Accurate and efficient fault detection • Extensibility • Can be easily integrated with our existing system • Low maintenance • Fits our budget • Delete sample documenticons and replace with working document icons as follows: • From Insert Menu, select Object... • Click “Create from File” • Locate File name in “File” box • Make sure “Display as Icon” is checked • Click OK • Select icon • From Slide Show Menu, Select “Action Settings” • Click “Object Action” and select “Edit” • Click OK ITSD/LBNL
Thanks! • We are happy to share • Questions / comments • send to lblnet@lbl.gov • Delete sample documenticons and replace with working document icons as follows: • From Insert Menu, select Object... • Click “Create from File” • Locate File name in “File” box • Make sure “Display as Icon” is checked • Click OK • Select icon • From Slide Show Menu, Select “Action Settings” • Click “Object Action” and select “Edit” • Click OK ITSD/LBNL