260 likes | 272 Views
Learn how to monitor grid nodes and applications using Nagios and RRDtool. Monitor node status, memory usage, network load, and more. Access historical data and receive automated notifications. Avoid unnecessary checks and focus on root cause failures. Improve system state response with RRDtool's round-robin database.
E N D
Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics
In a perfect world … • Individual node status • Is it up? • What is its load? • What is the memory and swap usage? • NFS and network load? • Are the partitions full? • Are applications and services running properly? • Amalgamated node status • Same info, but across groups of nodes
In a perfect world … • Historical information • Trends • Notification of service states • e.g. Storage down to 100 megs free = Warning • Storage down to 10 megs free = Critical • sshd no longer running = Failure • notify by email, pager, mobile • Easy access to monitoring information • web, email, digest, mobile
In a perfect world … • Avoidance of “Too many red flashing lights” • “Just the facts, ma’am” – only want root cause failures to be reported, not cascade of every downstram failure. • also includes avoiding unnecessary checks • e.g. HTTP responding, therefore no need to ping • e.g. power outage, doesn’t ping, so don’t bother trying anything else • Other wish list requirements?
Aspects of Current Grid Monitoring • LDAP (Lightweight Directory Access Protocol) is the current foundation for MDS. Designed frequent read, infrequent write. • MDS (Monitoring and Discovery Service) uses LDAP for maintaining static and dynamic system details. • R-GMA (Relational Grid Monitoring Architecture) meant to address shortcomings of LDAP based MDS system by using hierarchy of relational databases. Now being deployed. • GRIS (Grid Resource Information Service) stores details about the state of “the grid” (at least from the local node) • GIIS (Grid Index Information Service) ties together several GRISes • HBM (Heart Beat Monitor) monitor Globus services – seems to have died a quiet death
Existing Grid Monitoring Lacks… • Historical information for trends • Simple interface for accessing information • Automated response to changes in system state Here is where RRDtool and Nagios can contribute
RRDtool www.rrdtool.com • Round Robin Database for time series data storage • Command line based • From the author of MRTG • Made to be faster and more flexible • Includes CGI and Graphing tools, plus APIs • Solves the Historical Trends and Simple Interface problems
Define Data Sources (Inputs) • DS:speed:COUNTER:600:U:U • DS:fuel:GAUGE:600:U:U • DS = Data Source • speed, fuel = “variable” names • COUNTER, GAUGE = variable type • 600 = heart beat – UNKNOWN returned for interval if nothing received after this amount of time • U:U = limits on minimum and maximum variable values (U means unknown and any value is permitted)
Define Archives (Outputs) • RRA:AVERAGE:0.5:1:24 • RRA:AVERAGE:0.5:6:10 • RRA = Round Robin Archive • AVERAGE = consolidation function • 0.5 = up to 50% of consolidated points may be UNKNOWN • 1:24 = this RRA keeps each sample (average over one 5 minute primary sample), 24 times (which is 2 hours worth) • 6:10 = one RRA keeps an average over every six 5 minute primary samples (30 minutes), 10 times (which is 5 hours worth) • Clear as mud! • all depends on original step size which defaults to 5 minutes
RRDtool Database Format Recent data stored once every 5 minutes for the past 2 hours (1:24) Old data averaged to one entry per day for the last 365 days (288:365) } RRD File --step 300 (5 minute input step size) RRA 1:24 RRA 6:10 RRA 288:365 Medium length data averaged to one entry per half hour for the last 5 hours (6:10)
RRDtool Example • Monitoring a car – fuel in the tank plus odometer 12:05 12345 KM 7.0 L 12:10 12357 KM 5.8 L 12:15 12363 KM 5.2 L STOP 12:20 12363 KM 5.2 L 12:25 12363 KM 5.2 L RESTART 12:30 12373 KM 4.2 L 12:35 12383 KM 3.2 L 12:40 12393 KM 2.2 L 12:45 12399 KM 1.6 L 12:50 12405 KM 9.0 L REFUEL 12:55 12411 KM 8.4 L 13:00 12415 KM 8.0 L 13:05 12420 KM 7.5 L 13:10 12422 KM 7.3 L 13:15 12423 KM 7.2 L
RRDtool Example • Create an RRD to store distance and fuel rrdtool create car.rrd --start 920804400 \ DS:speed:COUNTER:600:U:U \ DS:fuel:GAUGE:600:U:U \ RRA:AVERAGE:0.5:1:24 \ RRA:AVERAGE:0.5:6:10 • --start Defines earliest time RRD accepts
RRDtool Example • Input data: rrdtool update car.rrd 920804700:12345:7.0 920805000:12357:5.8 rrdtool update car.rrd 920805300:12363:5.2 920805600:12363:5.2 rrdtool update car.rrd 920805900:12363:5.2 920806200:12373:4.2 rrdtool update car.rrd 920806500:12383:3.2 920806800:12393:2.2 rrdtool update car.rrd 920807100:12399:1.6 920807400:12405:9.0 rrdtool update car.rrd 920807700:12411:8.4 920808000:12415:8.0 rrdtool update car.rrd 920808300:12420:7.5 920808600:12422:7.3 rrdtool update car.rrd 920808900:12423:7.2
RRDtool Graphing • Now with data in the RRD, RRDtool can generate graphs: rrdtool graph speed.gif \ --start 920804400 --end 920808000 \ --vertical-label m/s \ DEF:myspeed=car.rrd:speed:AVERAGE\ DEF:myfuel=car.rrd:fuel:AVERAGE \ CDEF:realspeed=myspeed,1000,* \ LINE2:realspeed#FF0000 \ LINE2:myfuel#00FF00
RRDtool Graphing Output • Much more interesting graphs possible • Multiple RRDs may be used as sources for variables • Auto-interpolation of points • Functions and calculations can be applied to variables • Legends, labels, and text can be inserted
Nagios www.nagios.org • Instantaneous service level monitoring • Web based interface • Somewhat complicated set of configuration files to manually edit • Automated notification of change in service level (email, phone, etc.) • Defines WARNING, CRITICAL, FAILED levels
Nagios Host Definitions • Define details about each node and their hierarchy in the network: define host{ host_name tbce01 alias Testbed CE address 163.1.243.105 parents edg-testbed notifications_enabled 1 process_perf_data 1 check_command check-host-alive notification_interval 120 notification_period 24x7 notification_options d,u,r }
Nagios Service Definitions • Define details about each service: define service{ name ping check_command check_ping!100.0,20%!500.0,60% contact_groups linux-admins check_period 24x7 max_check_attempts 3 normal_check_interval 5 notification_interval 120 notification_period 24x7 notification_options c,r }
Nagios Service and Host Polling • Pull model, where Nagios server executes command to fetch host or service status • Requires remote hosts and services to cooperate • NRPE installed on clients allows server to execute “plugins” to poll for information • Alternatively use existing client reporting mechanisms (ping, wget, http) • Server responsible for configuration of polling intervals and details to be polled
Nagios Service and Host Reporting • Push model, where services and hosts decide when to report status to Nagios server • push data when available/relevant • generally full access to node-local data • requires configuring every node independently • authentication of nodes at server • nodes need to know who to send data to
Finally, some other monitors • NWS (Network Weather Service) attempts to predict network utilisation from historical information • Ganglia cluster monitoring system, provides aggregate graphs of cluster performance – Globus/EDG tie-ins underway • Map Center EDG project to monitor Grid status and services • ActiveMap, GridPortal, and InfoPortal* appear to be inactive projects