580 likes | 640 Views
Understand how performance data is stored and accessed in Nagios XI, and learn how to create and interpret performance graphs. This presentation is a valuable reference for Nagios users.
E N D
Leveraging and Understanding Performance Data and Graphs Troy Lea troy@box293.com Twitter: @Box293 http://exchange.nagios.org/directory/Owner/Box293/1
About Me IT Consultant Nagios Developer Love tinkering with Nagios Why Nagios XI? It’s a virtual appliance - ready to go
About This Presentation Understanding how performance data is stored in the back end and how Nagios accesses it Goal is to give you key pieces of information A good reference for understanding concepts This presentation is centered around Nagios XI Valid for other Nagios implementations
Basic Concepts - Part 2 ./check_nt -H SERVER -s "" -p 12489 -v USEDDISKSPACE -l C -w 80 -c 95 C:\ - total: 39.99 Gb - used: 25.28 Gb (63%) - free 14.71 Gb (37%) | 'C:\ Used Space'=25.28Gb;32.00;38.00;0.00;39.99
Basic Concepts - Part 3 Service check command is executed by the monitoring engine Monitoring engine receives the result of the check Data received has performance data Performance data is anything after the | (pipe) The performance data is inserted into an RRD file When viewing the performance graph, PNP4Nagios retrieves the performance data from the RRD file and generates a pretty graph Every time the service check receives performance data, it inserts this performance data into the RRD file which allows you to look at trends over time
Plugins The power of Nagios is in the plugins! Monitor what you want, how you want! Resources available that clearly define the guidelines around creating plugins Nagios Plug-in Developer Guidelines http://nagiosplug.sourceforge.net/developer-guidelines.html PNP Documentation http://docs.pnp4nagios.org/pnp-0.4/doc_complete
Plugin Output Explained - Part 1 Plugins produce data divided into two parts The pipe symbol “|” is used as a delimiter Example check_icmp OK - 127.0.0.1: rta2.687ms, lost 0% | rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;; Data to the left of the pipe symbol is processed by the monitoring engine Data to the right of the pipe symbol is used for inserting into RRD and XML files
Plugin Output Explained - Part 2 The exit code Nagios receives from the plugin determines the state of the service 0 = OK 1 = WARNING 2 = CRITICAL 3 = UNKNOWN The exit code is not “visible” when running a check from the command line or looking at the output returned from the plugin
Plugin Output Explained - Part 3 No performance data = no pretty graphs You can create a plugin using whatever language and tools are available All that matters is the end result which is returned back to Nagios when the plugin has finished running
Plugin Output Explained - Part 4 Examples: Shell script Something you might want to check on the Nagios host itself perl script Remotely checking a device using SNMP OR using third party APIs like the VMware vSphere SDK to remotely access virtual environments Visual Basic script Using NSClient on a Windows host to perform a check (like RDP usage)
Performance Data Specifics - Part 1 Asterix(*) fields are required fields, everything else is optional In this instance, rta is the FIRST DS, or DS 1
Performance Data Specifics - Part 2 Multiple DS Each DS is separated by a space rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;; The label can have spaces however the label MUST be enclosed by single quotes 'Round Trip Average'=2.687ms;3000.000;5000.000;0; 'Packet Loss'=0%;80;100;; 13
Basic Plugin - Part 1 Example shell script demonstrating how a plugin outputs performance data NUMBER1=$[ ( $RANDOM % 100 ) + 1 ] NUMBER2=$[ ( $RANDOM % 1000 ) + 1 ] echo ""OK - Number 1: $NUMBER1 Number 2: $NUMBER2" | 'Number 1'=$NUMBER1;;;; 'Number 2'=$NUMBER2;;;;“ exit "0"
Basic Plugin - Part 2 Here is the output each time it is run: OK - Number 1: 4 Number 2: 74 | 'Number 1'=4;;;; 'Number 2'=74;;;; OK - Number 1: 52 Number 2: 758 | 'Number 1'=52;;;; 'Number 2'=758;;;; OK - Number 1: 73 Number 2: 60 | 'Number 1'=73;;;; 'Number 2'=60;;;; OK - Number 1: 29 Number 2: 338 | 'Number 1'=29;;;; 'Number 2'=338;;;; OK - Number 1: 87 Number 2: 612 | 'Number 1'=87;;;; 'Number 2'=612;;;;
Basic Plugin - Part 3 Performance data displayed as a pretty graph Demonstration of how you can generate performance data in a plugin
Basic Plugin - Part 4 Now lets add warning and critical thresholds to the performance data string Number1 WARNING @ 50 CRITICAL @ 75 Number2 WARNING @ 500 CRITICAL @ 750 echo ""OK - Number 1: $NUMBER1 Number 2: $NUMBER2" | 'Number 1'=$NUMBER1;50;75;; 'Number 2'=$NUMBER2;500;750;;"
Basic Plugin - Part 5 Here is the output each time it is run: OK - Number 1: 4 Number 2: 74 | 'Number 1'=4;50;75;; 'Number 2'=74;500;750;; OK - Number 1: 52 Number 2: 758 | 'Number 1'=52;50;75;; 'Number 2'=758;500;750;; OK - Number 1: 73 Number 2: 60 | 'Number 1'=73;50;75;; 'Number 2'=60;500;750;; OK - Number 1: 29 Number 2: 338 | 'Number 1'=29;50;75;; 'Number 2'=338;500;750;; OK - Number 1: 87 Number 2: 612 | 'Number 1'=87;50;75;; 'Number 2'=612;500;750;;
Basic Plugin - Part 6 This demonstrates how the performance data does not have any effect on the state of the service Warning and Critical thresholds are inside the .xml file 19
.rrd and .xml files Used for recording the results from Nagios checks Useful for observing daily trends of your environment Invaluable for helping resolve performance issues RRD = Round Robin Database XML = Information about the Nagios check PNP4Nagios uses the RRD and XML files to generate pretty graphs
Location of .rrd and .xml files When a service check returns performance data, Nagios dumps this into: /usr/local/nagios/var/spool/perfdata A background process detects the spooled data and creates / updates the relevant .rrd and .xml The Performance Data files live in: /usr/local/nagios/share/perfdata/<host>
Extract .rrd data You can extract data from an .rrd file Example (from the CLI): rrdtool fetch /usr/local/nagios/share/perfdata/localhost/_HOST_.rrd MAX -r 900 -s -1h
.rrd and .xml Gotchya - Part 1 The .xml file can contain sensitive data <NAGIOS_SERVICECHECKCOMMAND>check_emc_clariion!$HOSTADDRESS$!-u readonly!-p Str0ngPassw0rd!-t sp_cbt_busy!--sp A!--warn 70!--crit 90!</NAGIOS_SERVICECHECKCOMMAND>
.rrd and .xml Gotchya - Part 2 Perhaps use a central credential file <NAGIOS_SERVICECHECKCOMMAND>check_vmware_host!check_vmware_config_vcenter01!cpu!90!95!!!!</NAGIOS_SERVICECHECKCOMMAND>
.rrd and .xml Gotchya - Part 3 RRD Data is averaged out over time Looking at performance graphs for past day / week / month / year will show results with less spikey data This generally only occurs with data that has lots of peaks and troughs Constant data like disk space used will generally not average out that much It all depends on your environment! When reviewing RRD data you need to take into consideration these factors, it’s all relative!
Graphs - How Templates Are Used - Part 1 http://docs.pnp4nagios.org/pnp-0.4/tpl
Graphs - How Templates Are Used - Part 2 PNP4Nagios queries the XML file for the <TEMPLATE> tag Each datasourcehas it’s own <TEMPLATE> tag <TEMPLATE>check-host-alive</TEMPLATE> Also can be a trailing string in the performance data (good for distributed monitoring) OK - 127.0.0.1: rta2.687ms, lost 0% | rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;; [check_icmp]
Graphs - How Templates Are Used - Part 3 From the example graphs: <TEMPLATE>check-host-alive</TEMPLATE> <TEMPLATE>check_local_load_alt</TEMPLATE> PNP4Nagios looks for a php file with this name in the following folders: /usr/local/nagios/share/pnp/templates.dist /usr/local/nagios/share/pnp/templates
Graphs - How Templates Are Used - Part 4 check-host-alive /usr/local/nagios/share/pnp/templates.dist/check-host-alive.php This PHP file generates the performance graph check_local_load_alt check_local_load_alt.php does NOT exist Default template is used: /usr/local/nagios/share/pnp/templates.dist/default.php 29
Graphs - Creating Your Own Template - Part 1 The check_command name is what Nagios uses to insert into the <TEMPLATE> tag in the XML file (how PNP determines which template to use) So for this example I have created a copy of an existing command check_xi_service_nsclient_alt
Graphs - Creating Your Own Template - Part 2 The service definition using the new command
Graphs - Creating Your Own Template - Part 3 The graph currently being generated Default Template being used Check Command being used .rrd and .xml files currently contain valid data
Graphs - Creating Your Own Template - Part 4 Copy the file: /usr/local/nagios/share/pnp/templates.dist/default.php To the following location with the name: /usr/local/nagios/share/pnp/templates/check_xi_service_nsclient_alt.php Edit check_xi_service_nsclient_alt.php
Graphs - Creating Your Own Template - Part 5 In the graph we are removing the bottom two lines Default Template Check Command command name Which are lines 62 and 63 $def[$i] .= 'COMMENT:"Default Template\r" '; $def[$i] .= 'COMMENT:"Check Command ' . $TEMPLATE[$i] . '\r" '; Save check_xi_service_nsclient_alt.php 34
Graphs - Creating Your Own Template - Part 6 How easy was that! • Updated graph • Template Name and Check Command removed
PNP Templates In Detail - Part 1 Lets get into specifics Template we just modified It’s not that complicated! (LOL) 36
PNP Templates In Detail - Part 2 .rrd files can have multiple datasources (DS) Round Trip Time and Packet Loss for example
PNP Templates In Detail - Part 3 Example of .rrd file with five DS Two graphs generated using these DS
PNP Templates In Detail - Part 4 Default Template creates one graph per DS This is a simple PHPforeach loop The code within the loop references the relevant DS by the $i variable
PNP Templates In Detail - Part 5 This section of the template uses three DS One graph will be generated using three DS $opt[1] and $def[1] is a reference for the first graph being generated
PNP Templates In Detail - Part 6 Number formatting Our modified template and the relative code • The relevant information: • %3.4lf
PNP Templates In Detail - Part 7 The three DS template and the relative code • The relevant information: • %4.0lf
PNP Templates In Detail - Part 8 Numbers are displayed with four decimal points %3.4lf • Numbers are displayed as wholenumbers • %4.0lf
PNP Templates In Detail - Part 9 PNP documentation defines the number formatting using the printf standard defined here http://en.wikipedia.org/wiki/Printf The number (1) and the letter "L" look alike %3.4lg contains a lower case "L" The syntax is %[parameter][flags][width][.precision][length]type
PNP Templates In Detail - Part 10 width When the number is generated on the graph, it will allocate a minimum specific width, this helps you align numbers in a column style precision Determines if the number displayed is a whole number, or a number with a specific number of digits following the decimal place
PNP Templates In Detail - Part 11 %3.4lf width = 3 precision = .4 hence the displayed number is 25.3800 %4.0lf width = 4 precision = .0 hence the displayed number is 14 Because the precision is 0, NO decimal place is used
MRTG - Part 1 MRTG= Multi Router Traffic Grapher Nagios Addon that is useful for monitoring network switch and router bandwidth using SNMP Can be complicated to understand configuration
MRTG - Part 2 Nagios XI Wizard called “Network Switch / Router” automates the configuration of MRTG MRTG configuration file /etc/mrtg/mrtg.cfg MRTG runs as a cron job every five minutes cron comes from the Greek word for time, χρόνος [chronos] Hence cron is a software utility on linux which is a time-based job scheduler In the windows world it's the Task Scheduler
MRTG - Part 3 When MRTG runs, it gathers data from the devices defined in the mrtg.cfg file It dumps this data into the folder /var/lib/mrtg For every port monitored, an .rrd file is created (no .xml file created at this point) Another background process will then take the data in /var/lib/mrtg and put it into the correct location /usr/local/nagios/share/perfdata/<host>
MRTG Gotchya - Part 1 When the Wizard populates the mrtg.cfg file it will add ALL ports on the switch to the config file Even if you only selected to monitor 10 ports on the switch The Nagios XI Service Configuration will only have 10 ports defined as service definitions Every time the MRTGcron job runs, it will collect data from all ports on the switch (as defined in the mrtg.cfg file) Extra CPU cycles, extra disk space 50