Rebuilding nagios at SGI
At work (SGI) I'm currently building a new
monitoring infrastructure using nagios.
On our new nagios server I setup a tool called
nagvis which I highly recommend. Basically,
it's a nifty little visualization tool that makes it a snap to setup
"dashboards" for stuff you monitor with nagios. Briefly, you setup a
"broker" module in nagios (see the docs for more specifics) which then
saves all events that occur in nagios to a backend that nagvis supports,
then nagvis can get the status of any host or service (and see changes to
them) by querying/polling that backend.
It's got a pretty nice PHP/Javascript tool for managing these dashboards.
Creating a dashboard is simple - one just uploads an image to use as a
background (optional), then one drops icons (or custom widgets written
in PHP) on the background that show the status of hosts, services, or
other dashboards (known as "maps"). Simple. I took a fancy diagram of
our Web cluster at SGI, for instance, showing SAN switches, Storage arrays,
heartbeat ethernet switches and the cluster node servers themselves, and
within about 10 minutes had a functional dashboard that management can use
to see at a glance if all is well with the web cluster. It's a great tool
for taking a complex set of stuff that has to all be working for some
larger service to function and displaying it all on one neat, tidy
web page.
Now that I've got multiple dashboards (ie, one for our ERP systems
including DB servers, app servers, applications and web services on
those servers, SAN switches 'n storage, DB snapshots for backups, reporting
and testing, etc), one for DNS services across SGI, another for Clarify
(our customer support/knowledgebase service), and a variety of others, I can
build one dashboard that shows a list of these larger services with an
icon next to it showing the status of the dashboard for that service - one
place managers can go to in order to see what the state of our IT
infrastructure is.
And while I initially set this tool up as mere eye-candy for pointy-haired
boss types, I'm finding myself using it more 'n more myself. While I used
to always hit the nagios tactical overview page as a "start of the day"
activity, now it usually starts with the nagvis overview page and then the
tactical overview.
Having said all that, however, there was a catch. We were using a broker
module called ndo2db which uses mysql as a backend. I chose it simply
because it was the default and worked right out of the box with no problem.
But I have to admit I was a bit concerned with the scalability since I know
SQL databases of any stripe can use a lot of resources particularly if there
are lots of writes or updates. Once I got everything migrated into the new
nagios server (about 3200 services being checked on about 600 servers - and
that'll be getting bigger soon), I found that ndo2db was doing around 120
inserts per second! Even worse, I noticed that every now 'n then the check
latencies in nagios (both service and host) would suddenly jump from less
than a second to a few minutes - which has some distressing side effects
on nagios (acknowledgements or enabling/disabling notifications or forcing
a re-check of a service could take a very long time). Also, a large
proportion of the CPU time on the nagios server was I/O Wait time (almost
entirely writes, almost entirely by mysqld).
Although stopping nagios (and waiting for all children processes to finally
finish which could take a few mintues) and then restarting nagios generally
caused all the check latencies to drop back to a second or three, things
like this that I can't explain bug me. So I cracked open a tool called
mysqltop and started looking to see what could be tweaked in mysql (if
anything) to improve the situation. I found that ndo2db was (not surprisingly)
occasionally deleting events older than a certain time. But the SQL statement
that was cleaning up old service check events was sometimes taking upwards
of 3 minutes to run! And worse, it locks the table first. It turns out that
the column being referred to in the DELETE WHERE statement wasn't
indexed by the default schema setup by ndo2db. I added an index for this and
it helped somewhat (the really long running queries went away and the check
latencies settled down somewhat) but the high I/O Wait was still there.
So one weekend when I had a few spare cycles I checked out one of the other
backends that nagvis supports:
mklivestatus.
This broker module just provides a socket interface that can be used to
make SQL-like queries for status info which is then read directly from the
nagios internal data structures and/or from the historical data in nagios'
log files. Slick! It does, however, require that your PHP server support
the sockets API (which might not have been enabled in your PHP binaries). In
my case, I'm building apache and PHP from source using the fancy build
infrastructure our Webmaster uses (which also means the nagios webserver is
built using the same standards as all of our production, externally facing
webservers). So tweaking it a bit to enable sockets support was a piece of
cake.
Once I switched over our nagvis config to use the mklivestatus backend
and switched the nagios.cfg file to use the mklivestatus broker module
instead of ndo2db and stopped 'n restarted nagios, everything just worked.
The nagvis dashboards are a bit snappier, but more importantly the check
latencies are always less than a second now and the I/O Wait CPU time
dropped to almost zero. So now I've got a nagios server that'll scale
properly.
Here's what the CPU usage on the nagios server looked like before and after
the switch to mklivestatus:

Nice, eh? Gotta love how all the I/O Wait time dropped out around Week
25. The CPU usages go up to 400% because there's 4 cpu cores (in case
anyone was wondering).
I noticed an increase in memory usage (you can also see that we increased
the physical memory so I could give more to mysql for it's indexes -
another attempt to get the I/O wait time down I forgot to mention):

But I also noticed that when another program needed more memory, the amount
currently in use dropped as if it was quickly relinquished rather than
swap having to be allocated. To confirm this I wrote a quick C hack that
just malloc'd 2 gigs of memory and wrote all zeroes across it to
ensure it really did get allocated -- different OSes implement memory
allocation differently and I've found the only way to guarantee that the
memory your C program has malloc'd actually gets used (possibly incurring
paging and/or swapping) is to use it. Anyway, the memory usage has since
stabilized and isn't a memory leak or anything like that. (whew)