My latest install of Nagios core monitors several of our servers but the Nagios configuration itself is kept very clean and there is only ONE main service being monitored against most of my servers. The command run is check_main. check_main is then a custom script I wrote that runs several checks (defined on each server from a settings file). Those checks are all returned back to Nagios in one line.

As an example, here is the settings file for the first server in the screenshot... EMR1BYR

checks[0]='cpu|check_load -w 15,10,5 -c 30,25,20|1'
checks[1]='disk|check_disk -w 5 -c 2|1'
checks[2]='memory|check_memory -w 98% -c 99%|1'
checks[3]='activemq|check_procs -w 1:3 -c 0:8 -a /opt/apache-activemq/bin/run.jar|1'
checks[4]='lock|check_dblock_nagios.sh|1'
checks[5]='ping|check_ping -H 172.16.90.254 -w 500,50% -c 1000,80%|1'
checks[6]='rsyslog|check_procs -w 1:5 -c 0:10 -a /sbin/rsyslogd|1'
checks[7]='postfix|check_procs -w 1:5 -c 1:10 -a "/usr/libexec/postfix/master"|1'
checks[8]='apache&tomcat_time|check_tomcat_time|1'

There are 3 arguments in each element of the array "checks"

The first argument is the simple / short name of the check that gets returned back to check_main with an OK, WARNING, or CRITICAL.

The second argument is the script that gets called for that check

The third argument is if the check is enabled... that way if we need to disable a check for a particular server but not deleted it, we could simply come in and put a 0 in the file on THAT server with no further changes needed.. It will magically dissapear from Nagios because it won't get returned in the output :)

Another important piece to check_main running all of these checks is that if one of the checks returns a WARNING or CRITICAL, check_main needs to return that status back to Nagios so there is a check in there to return the greatest error level returned by all of the checks. So... if one returns WARNING, but another returns CRITICAL, then CRITICAL will get returned back to Nagios. The full output will also contain the full output of any script that had an error. And of course just looking at the short output... if we saw "CPU: WARNING, Memory: CRITICAL, Disk: OK" we can quickly tell that CPU and Memory have a problem!

The next item was to set up service dependencies which was easy once I understood the format. I have Nagios monitoring the gateway. The servers all have the gateway as a dependancy so that if we can't ping the gateway, we probably can't ping the servers that are on the other side of that router. I have proxy servers on the end of a VPN, so the proxy servers have each of their VPN as a dependancy. The VPN's also have the gateway as a dependancy. This way we won't get a billion pages when the internet goes down.. the only page we will get is that the gateway is down.

NewNagiosScreenShot.jpg