rebuilding nagios at SGI

At work (SGI) I'm currently building a new monitoring infrastructure using nagios. The prototype server I've got set up is also using nagvis. I've been using nagios for close to eight years now at a variety of places, but I'm new to nagvis. It's a really great tool!

Basically, there's an add-on called ndomod that will hook into nagios and log all events to a mysql database. The nagvis tool, then, will hook into this database and display selected "maps". A "map" in nagvis (from what I've seen so far) is basically an image background, on top of which are dropped icons, gadgets, text-annotations, etc, which are interfaced with the nagios events logged in mysql. For instance, some people have taken a picture of several racks of equipment, then dropped little status icons on the picture in nagvis, each icon being a live link back to the nagios host for the device. The user can get all sorts of useful info on the host/service in question by hovering their mouse cursor over the icon and it's color is determined by the status of the nagios host/service. Slick.

Even better, you can drop "gadgets" on a map. A gadget is basically just a web link (typically to a PHP script) which is passed info about the host/service it's supposed to represent - things like the current value, and the warning/critical thresholds for the host/service check in question. The example gadget that comes with nagvis is a speedometer-type gauge but one could easily make other custom gadgets. Pretty cool.

Basically, nagvis is a nice way to make user-friendly or management-friendly "dashboards" for visualizing the status of complex services. For instance, I'll have a nagvis map for our Clarify server (where support calls and tickets are logged) containing one icon for the over-all status of Clarify, plus icons for the Clarify webservers, appservers, database servers, the SAN switches the DB servers use, perhaps the StorageArrays on the SAN as well, the web service/s on the webservers and appservers, the Oracle database service on the DB servers, etc. All the stuff that has to be up and running in order for someone to say "Clarify is working fine" will be summarized on this map. And we'll have maps for top-level views of all the critical things at SGI.

I'm also trying out a variety of web-based admin tools for nagios. One is called lilac. It's based on the old "fruity" web interface, which appears to be a stale project now. At first glance it's functional, but it's templates don't generate nagios templates - it's used strictly for templating within lilac itself. Also, I haven't been able to coerce it into letting me set hostinfo icons for hosts. I can set the external-info URLs but not the icons. Go figure.

I'm in the middle of trying to test another one called nagiosQL but I don't have it running just yet.

I may simply stick to the text-file configs for now. They offer a lot of flexibility (especially since the latest nagios allows for multiple inheritance) but I'm also considering integrating our nagios infrastructure with DCSi. At SGI, I setup a custom LDAP schema with object classes and attributes for tracking servers/devices and applications. So we can quickly see who owns what servers, where they're physically located (what building, room, and footprint/rack they're in), what apps are run, who owns those apps, who are the emergency contacts for servers/apps, notes about servers and apps that might be useful to whoever is on-call, how the server's console is connected (DRAC, IRIXconsole, cyclade, etc). Our nagios server uses the external info URLs (hostextinfo stuff) so that when nagios sends us a page we can simply click on a link next to the host in nagios and get a quick view of all this info. Anyway, I'm considering banging out some quick perl, php, or java code to query the LDAP server and generate nagios configs from it. It would be nice to be able to drop in a record into LDAP about a new server/service and have nagios automagically start doing the right set of default host/service checks for it...