Rebuilding nagios at SGI
At work (SGI) I'm currently building a new
monitoring infrastructure using nagios.
It's coming along nicely. While working on it I sorted out a problem
trying to correctly check an LDAPS server and emailed someone who I'd
seen asking about the same problem (but not getting a reply). I let him
know what I'd found and in turn he mentioned that they were using a tool
called PNP4Nagios instead of cacti. Well, I'm not about to abandon Cacti
yet - it's just too darn good. But I did add PNP4Nagios to our new nagios
server.
Briefly, it allows you to tell nagios to take the data a host/service
check returns and to hand it to some scripts which then log it into an
RRD file. There are "action_url" settings in nagios which you can use to
display the graphs generated on the fly from these RRD files. So things
like ping times, latencies on service checks like http, smtp, and ldap, can
all be tracked and graphed. Some service checks, obviously, don't lend
themselves to this such as the checks for RAID volumes (there's no numeric
data returned from the service check to graph!) but it's a nice addition
to nagios. It scales nicely too.
On that note, it looks like ndomod and nagvis, even though I like them,
don't scale quite as nicely. I'm beginning to see a slight cpu load on the
nagios server now that I've hit about 2800 service checks. I'll probably do
a bit of tuning on the mysql server to try to improve that some, but on a
four cpu core system I'm not too worried about it yet.
I decided to ingore the handful of web admin
interfaces I found for nagios. The only ones I found that looked like they had
an active developer community and also had decent flexibility were every
bit as complex as just editing the config files manually. So why bother?
The latest version of nagios is great! I've simplified our nagios
setup quite a lot and by making use of things like "multiple inheritance"
I've been able to make our nagios configs on the new server much
simpler and easier to manage.
Best of all, I think I'll be able to, eventually, auto-generate at least
part of our nagios config based on a simple LDAP query. We use a tool
internally known as DCSi (for historical reasons) to track assets - what
servers we have, where they are, what apps they run, who owns the hardware
and who owns the apps, who the emergency contacts are, any useful notes
about the server/device/app that someone on-call might need, etc. Basically,
it's really nothing more than a custom schema and an openLDAP server with
Apache DirectoryStudio. But it works great. And if I can have nagios
automagically start monitoring whatever servers/devices we document in
LDAP, that'd be pretty cool.
Anyway, that's what I'm shooting for in the long run, provided I can make
it simple enough that anyone in the group can feel comfortable maintaining
what I set up. What I'm envisioning is having most of the meat of the nagios
config setup in regular text config files, but then having a script/program
that would query LDAP for servers/devices and generate a config file for
each one using the appropriate templates (which would be specified along
with hostgroup membership in the DCSi LDAP record). One-off service checks
could be defined as children objects in LDAP in a similar fashion (defining
what templates to use and minimal nagios config info).