Since the setup of our new Nagios Cluster (based on Linux-HA and DRBD) I hat a slight discomfort looking at the Network and Disk I/O rates of the System. We had about 10MBit/s network traffic between the Nodes. I was almost sure it was caused by DRBD, since Nagios generates many Files.
Yesterday I had some time for tuning our systems. After a bit of looking around I moved some files and the network and disk I/O usage dropped by roughly 70%!
Here are some pictures to illustrate:
So, what did I do:
- I moved the spoolpath for the checkresults to a Ramdisk (/dev/shm)
# rm -rf checkresults
# ln -s /dev/shm checkresults
This was executed first, about 17:00 (5:00 pm).
Moving the temporary checkresults to a Path outside DRBD and into memory brought ~0.8 MB/s less block I/O and ~3 MBit/s less network I/O.
- 2nd thing I did, was moving the status file (this is were Nagios holds its runtime state, like a small Database) to the same Ramdisk
This change was executed short bevore 18:00 (6:00 pm).
Moving this state file resultet in an aditional drop of ~1.2 MB/s block I/O and ~5.5 MBit/s network I/O.
The data I put on ramdisk is not critical, and thus in an event of cluster switchover od reboot the lost data can be restored every time (and will be automaticaly). The checkresults Nagios puts on this ramdisk are basicaly temporary files to store the output of the checkcommands. These files are typicaly deletet afer a few seconds. The state file is a representation of the current state of all checks, it is regenerated every few seconds. The state is saved in the state retention file, as lang as the retention file resides on the DRBD (or whatever other cluster storage) everything will be ok.
The remeining I/O on the System is generated by Cacti (running on the other cluster node) and the remaining I/O operations to the cluster storage by Nagios. And of course by some users
I also hope to solve a problem that emerged recently with this solution. On our Nagios node the Kernel CPU time keeps going up, until it consumes one full CPU. after that access to Nagios keeps getting slower until we restart Nagios. This all happens over a timerang of several weeks, so it was no big problem. My guess is, it has something to do with the Filesystem (ext3), perhaps the many files created by Nagios to store the temporary check results (>2000 files per minute). Or something else, I hope we will see this effect gone in some weeks.