Rebuilding nagios at SGI


At work (SGI) I'm currently building a new monitoring infrastructure using nagios.

It's coming along nicely. While working on it I sorted out a problem trying to correctly check an LDAPS server and emailed someone who I'd seen asking about the same problem (but not getting a reply). I let him know what I'd found and in turn he mentioned that they were using a tool called PNP4Nagios instead of cacti. Well, I'm not about to abandon Cacti yet - it's just too darn good. But I did add PNP4Nagios to our new nagios server.

Briefly, it allows you to tell nagios to take the data a host/service check returns and to hand it to some scripts which then log it into an RRD file. There are "action_url" settings in nagios which you can use to display the graphs generated on the fly from these RRD files. So things like ping times, latencies on service checks like http, smtp, and ldap, can all be tracked and graphed. Some service checks, obviously, don't lend themselves to this such as the checks for RAID volumes (there's no numeric data returned from the service check to graph!) but it's a nice addition to nagios. It scales nicely too.

On that note, it looks like ndomod and nagvis, even though I like them, don't scale quite as nicely. I'm beginning to see a slight cpu load on the nagios server now that I've hit about 2800 service checks. I'll probably do a bit of tuning on the mysql server to try to improve that some, but on a four cpu core system I'm not too worried about it yet.

I decided to ingore the handful of web admin interfaces I found for nagios. The only ones I found that looked like they had an active developer community and also had decent flexibility were every bit as complex as just editing the config files manually. So why bother?

The latest version of nagios is great! I've simplified our nagios setup quite a lot and by making use of things like "multiple inheritance" I've been able to make our nagios configs on the new server much simpler and easier to manage.

Best of all, I think I'll be able to, eventually, auto-generate at least part of our nagios config based on a simple LDAP query. We use a tool internally known as DCSi (for historical reasons) to track assets - what servers we have, where they are, what apps they run, who owns the hardware and who owns the apps, who the emergency contacts are, any useful notes about the server/device/app that someone on-call might need, etc. Basically, it's really nothing more than a custom schema and an openLDAP server with Apache DirectoryStudio. But it works great. And if I can have nagios automagically start monitoring whatever servers/devices we document in LDAP, that'd be pretty cool.

Anyway, that's what I'm shooting for in the long run, provided I can make it simple enough that anyone in the group can feel comfortable maintaining what I set up. What I'm envisioning is having most of the meat of the nagios config setup in regular text config files, but then having a script/program that would query LDAP for servers/devices and generate a config file for each one using the appropriate templates (which would be specified along with hostgroup membership in the DCSi LDAP record). One-off service checks could be defined as children objects in LDAP in a similar fashion (defining what templates to use and minimal nagios config info).