Stop Using Email Notifications In Cron Jobs


The more I work with different IT infrastructures, as a new employee taking over for previous admins, after acquiring another company and adopting some of their infrastructure or being acquired and having to migrate/retire our infrastructure (yeah, I've been involved in all of those), the more I've come to truly detest email notifications for cron/scheduler jobs and email notifications in general. Why? I find they can be grouped into several categories of bad.

The first is the "All is well" email notifications - you know, the ones you get at least once a day, that everyone has setup an email filter for that nobody looks at and are either filtered into a separate, giant folder or just get deleted. What's the point of an email notification that nobody looks at for cryin' out loud! And then every now and then, mixed in with thousands of ignored "All is well" emails, there will be one "All is NOT well" emails that nobody has looked at because they're all filtered off to a folder the recipient doesn't look at. Oops.

Then there's the email notifications sent to an undeliverable alias or mailing list or a person who hasn't been with the company in a decade. How could this be? Picture, IT geek Fred. He's been with the company since it's founding a decade earlier when the entire IT department was just him. Now the company is bigger and has a half dozen IT geeks. Hardware has come 'n gone and Fred leaves the company. Any bets as to how many devices have email notifications pointing to Fred@company.com? Any bets as to how many will be found and fixed? These only send an email when something is NOT well, so it's not like one can just look in the email logs to see what devices are trying to send email to Fred. Someone has to look at every single device. When people in IT are operating at a dead-run already, that's just not gonna happen. Typically, nobody notices an old piece of hardware desperately calling Fred for help until some an email or security admin happens to notice hundreds of bounced email to Fred. Or more often, the device will finally have enough failures that no redundancy is redundant enough and it fails spectacularly and management starts asking "why did nobody know that this device had 2 bad fans, one bad power supply and three failed hard drives?" Oops.

Another category is the cron/sechudler jobs that Fred setup over the years. Inevitably, these are critical pieces of automation in the IT infrastructure that nobody even knows exist since they mostly just quietly do their job... until one breaks for some reason and sends an email to Fred saying "Help! Fix me!". Then someone in IT has to do IT archaeology figuring out why some database transaction log is filling up now or why support tickets aren't magically getting sync'd to the support FTP server or whatever. Not fun.

So now that I've gotten that off my chest... Yeah, email notifications suck. They either wind up not being seen or going to the wrong place or otherwise wind up being more of an annoyance than a solution. So what are we to do?

At my last job, I inherited a metric crap-ton of systems which had been setup by people who'd long-since left the company running apps with no support contract, setup by other people who'd been gone a decade or more, and each with critical cron jobs that nobody in the IT or Apps departments knew anything about. Bummer! So any time I found one of these, I dug in my heels and found an owner for whatever process the job was supposed to accomplish and setup a passive nagios service for that cron job. The I edited it searching for any exit conditions. Any time the script exited, it would run a command to notify nagios about the success or failure of this job.

Why is this better? Well, because now the script can run as often as needed and report "all is well" to nagios who will dutifully note it without filling up anyone's mailboxes. Plus if it ever does exit with some fault, it notifies nagios of that fault - it doesn't go unnoticed. Now, it's also critical that nagios (or whatever monitoring server you use) be kept up to date with correct mailing addresses. But that's MUCH more likely to be kept up to date and much easier to keep up to date than to have to go edit a half a jillion cron jobs and hardware configs all over the world.

The more I did this, the more I fell in love with passive service checks in nagios. Any time we setup a new job the first thing after making sure the cron job worked was to setup a passive check for it to send notification to. For instance, here's the nagios checks for a database server:


Yeah, I'm a big believer in monitoring. (grin) If it can be monitored, we probably monitor it. Many of these service checks were just to gather performance data to render on a dashboard like ethernet or SAN switch traffic stats or CPU usages. But notice the last two service checks? They're the ones that have an icon that says PASV with two arrows pointing down. One of those is for a cron job that quiesces the database, then takes a snapshot of the filesystem the data is on, then releases the database and takes a snapshot of the filesystem the logs are on. The other check is for a cron job that uses these snapshots to create volumes we can then backup.

So if either job fails it exits with a status telling nagios that all is not well and giving us some sort of useful error as to what happened. But what if the cron job just doesn't run at all so no error conditions are reported? Nagios (and any decent monitoring system) can take care of this too. In nagios you just set the freshness_interval and setup the check so most of the time nagios doesn't actively check this service, it just sits passively listening for an update. If it doesn't get one within the freshness interval, it then runs whatever you've specified for the active check, and you specify a script that always returns a critical or unknown status (whatever you'd prefer) So if it gets an update from the cron job, great. If nothing is heard within your freshness interval (say, 25 hours for a job that runs once a day and takes around 30 mins to run), this service check will revert to critical or whatever you've specified for the active check.

Here's what my nagios config looks like for this service check:

# my-server create snapshot check
# Go critical if no notification received within 25 hours
# the stalking_options are so that nagios will log all checks, not
# just transitions from one state to a different one
define service {
	use			local-service
	host_name		my-server
	service_description	create snapshot
	check_command		check_dummy_crit
	flap_detection_enabled	0
	max_check_attempts	1
	active_checks_enabled	0
	passive_checks_enabled	1
	check_freshness		1
	freshness_threshold	90000
	stalking_options	o,w,c
	}

See the check_dummy_crit command on the check_command line? That runs a script that just always returns a CRITICAL status to nagios. And you can see that passive checks are enabled and active checks are disabled. I also set the flap_detection off because I don't want it to squelch alerts if the cron job fails X days in a row. If memory serves, I set the stalking options because I wanted nagios to log all the successes (which includes the names of the created snapshots) as well as any failures. In other words, even if the last notification was a success and the most recent notification was also a success, I wanted nagios to log it, not just the changes in status. It just made it more convenient for the on-call if there was a problem and he/she needed to roll back a filesystem to a previous snapshot - they were all logged in nagios.

One thing to be careful with (at least in nagios) is to not get in the habit of just disabling and re-enabling all checks for all services on a host. Get in the habit of acknowledging alerts or disabling notifications instead of disabling service checks. This is because if you use the link to disable all checks (for instance) fix a problem and then enable all checks for a host you will wind up re-enabling active checks for your passive services also and then nagios will run the check_dummy_crit call rather than waiting for the next passive update from the script.

So what does it look like from the script's perspective? It turns out to be very easy. I'm using nscd and send_nsca but I encourage you to look for newer tools. I think there's a newer tool that allows for multi-line passive updates for instance. But here's how it works with send_nsca and nscd. You just configure nscd on your nagios server (consult the man pages - it's easy), then in your script you run send_nsca. In my case, my cron job is a perl script so I added a function to run send_nsca which I could call from any point within the script:

#
# This is a snippet from a cron job that creates snapshots of our database

# my notify_nagios perl function:
sub notify_nagios ($$$$) {
   my $naghost = shift;
   my $nagservice = shift; 
   my $nagstatus = shift;
   my $nagmsg = shift;

   my $NAGIOS = "/usr/local/nagios/nrpe/bin/send_nsca -H my-nagios-server -c /usr/local/nagios/nrpe/etc/send_nsca.cfg >& /dev/null";

   open(NAGIOS,"| $NAGIOS");
   print NAGIOS "$naghost\t$nagservice\t$nagstatus\t$nagmsg\n";
   close(NAGIOS);
}

# ... a bunch of geeky DB and storage stuff ...

# At this point, $result is a numeric result code like
# 0 = OK
# 1 = WARNING
# 2 = CRITICAL

# And $resstr is a text string containing a textual status like
# "CRITICAL: Failed to create a snopshot for $volname"   or...
# "OK: Created snapshot for $volume on controller $ctrlr"

# Tell nagios the status
notify_nagios ($hostname,"create snapshot",$result,$resstr);

And from a bash script it can be as easy as running echo with some tab separated parameters and piping that to send_nsca. Easy. So let nagios log success and failures for you and let it sort out who should be notified when under which conditions. You'll save yourself no end of headaches in the future as people in your IT department come and go.