Rebuilding nagios at SGI

At work (SGI) I'm currently building a new monitoring infrastructure using nagios.

On our new nagios server I setup a tool called nagvis which I highly recommend. Basically, it's a nifty little visualization tool that makes it a snap to setup "dashboards" for stuff you monitor with nagios. Briefly, you setup a "broker" module in nagios (see the docs for more specifics) which then saves all events that occur in nagios to a backend that nagvis supports, then nagvis can get the status of any host or service (and see changes to them) by querying/polling that backend.

It's got a pretty nice PHP/Javascript tool for managing these dashboards. Creating a dashboard is simple - one just uploads an image to use as a background (optional), then one drops icons (or custom widgets written in PHP) on the background that show the status of hosts, services, or other dashboards (known as "maps"). Simple. I took a fancy diagram of our Web cluster at SGI, for instance, showing SAN switches, Storage arrays, heartbeat ethernet switches and the cluster node servers themselves, and within about 10 minutes had a functional dashboard that management can use to see at a glance if all is well with the web cluster. It's a great tool for taking a complex set of stuff that has to all be working for some larger service to function and displaying it all on one neat, tidy web page.

Now that I've got multiple dashboards (ie, one for our ERP systems including DB servers, app servers, applications and web services on those servers, SAN switches 'n storage, DB snapshots for backups, reporting and testing, etc), one for DNS services across SGI, another for Clarify (our customer support/knowledgebase service), and a variety of others, I can build one dashboard that shows a list of these larger services with an icon next to it showing the status of the dashboard for that service - one place managers can go to in order to see what the state of our IT infrastructure is.

And while I initially set this tool up as mere eye-candy for pointy-haired boss types, I'm finding myself using it more 'n more myself. While I used to always hit the nagios tactical overview page as a "start of the day" activity, now it usually starts with the nagvis overview page and then the tactical overview.

Having said all that, however, there was a catch. We were using a broker module called ndo2db which uses mysql as a backend. I chose it simply because it was the default and worked right out of the box with no problem. But I have to admit I was a bit concerned with the scalability since I know SQL databases of any stripe can use a lot of resources particularly if there are lots of writes or updates. Once I got everything migrated into the new nagios server (about 3200 services being checked on about 600 servers - and that'll be getting bigger soon), I found that ndo2db was doing around 120 inserts per second! Even worse, I noticed that every now 'n then the check latencies in nagios (both service and host) would suddenly jump from less than a second to a few minutes - which has some distressing side effects on nagios (acknowledgements or enabling/disabling notifications or forcing a re-check of a service could take a very long time). Also, a large proportion of the CPU time on the nagios server was I/O Wait time (almost entirely writes, almost entirely by mysqld).

Although stopping nagios (and waiting for all children processes to finally finish which could take a few mintues) and then restarting nagios generally caused all the check latencies to drop back to a second or three, things like this that I can't explain bug me. So I cracked open a tool called mysqltop and started looking to see what could be tweaked in mysql (if anything) to improve the situation. I found that ndo2db was (not surprisingly) occasionally deleting events older than a certain time. But the SQL statement that was cleaning up old service check events was sometimes taking upwards of 3 minutes to run! And worse, it locks the table first. It turns out that the column being referred to in the DELETE WHERE statement wasn't indexed by the default schema setup by ndo2db. I added an index for this and it helped somewhat (the really long running queries went away and the check latencies settled down somewhat) but the high I/O Wait was still there.

So one weekend when I had a few spare cycles I checked out one of the other backends that nagvis supports: mklivestatus. This broker module just provides a socket interface that can be used to make SQL-like queries for status info which is then read directly from the nagios internal data structures and/or from the historical data in nagios' log files. Slick! It does, however, require that your PHP server support the sockets API (which might not have been enabled in your PHP binaries). In my case, I'm building apache and PHP from source using the fancy build infrastructure our Webmaster uses (which also means the nagios webserver is built using the same standards as all of our production, externally facing webservers). So tweaking it a bit to enable sockets support was a piece of cake.

Once I switched over our nagvis config to use the mklivestatus backend and switched the nagios.cfg file to use the mklivestatus broker module instead of ndo2db and stopped 'n restarted nagios, everything just worked. The nagvis dashboards are a bit snappier, but more importantly the check latencies are always less than a second now and the I/O Wait CPU time dropped to almost zero. So now I've got a nagios server that'll scale properly.

Here's what the CPU usage on the nagios server looked like before and after the switch to mklivestatus:

Nice, eh? Gotta love how all the I/O Wait time dropped out around Week 25. The CPU usages go up to 400% because there's 4 cpu cores (in case anyone was wondering).

I noticed an increase in memory usage (you can also see that we increased the physical memory so I could give more to mysql for it's indexes - another attempt to get the I/O wait time down I forgot to mention):

But I also noticed that when another program needed more memory, the amount currently in use dropped as if it was quickly relinquished rather than swap having to be allocated. To confirm this I wrote a quick C hack that just malloc'd 2 gigs of memory and wrote all zeroes across it to ensure it really did get allocated -- different OSes implement memory allocation differently and I've found the only way to guarantee that the memory your C program has malloc'd actually gets used (possibly incurring paging and/or swapping) is to use it. Anyway, the memory usage has since stabilized and isn't a memory leak or anything like that. (whew)