Stop Using Email Notifications In Cron Jobs
The more I work with different IT infrastructures, as a new employee taking
over for previous admins, after acquiring another company and adopting some of
their infrastructure or being acquired and having to migrate/retire our
infrastructure (yeah, I've been involved in all of those), the more I've come
to truly detest email notifications for cron/scheduler jobs and email
notifications in general. Why? I find they can be grouped into several
categories of bad.
The first is the "All is well" email notifications - you know, the ones you
get at least once a day, that everyone has setup an email filter for that
nobody looks at and are either filtered into a separate, giant folder or just
get deleted. What's the point of an email notification that nobody looks at for
cryin' out loud! And then every now and then, mixed in with thousands of
ignored "All is well" emails, there will be one "All is NOT well" emails that
nobody has looked at because they're all filtered off to a folder the
recipient doesn't look at. Oops.
Then there's the email notifications sent to an undeliverable alias or
mailing list or a person who hasn't been with the company in a decade. How
could this be? Picture, IT geek Fred. He's been with the company since it's
founding a decade earlier when the entire IT department was just him. Now
the company is bigger and has a half dozen IT geeks. Hardware has come 'n gone
and Fred leaves the company. Any bets as to how many devices have email
notifications pointing to Fred@company.com? Any bets as to how many will be
found and fixed? These only send an email when something is NOT well, so it's
not like one can just look in the email logs to see what devices are trying to
send email to Fred. Someone has to look at every single device. When people in
IT are operating at a dead-run already, that's just not gonna happen. Typically,
nobody notices an old piece of hardware desperately calling Fred for help
until some an email or security admin happens to notice hundreds of bounced
email to Fred. Or more often, the device will finally have enough failures
that no redundancy is redundant enough and it fails spectacularly and
management starts asking "why did nobody know that this device had 2 bad fans,
one bad power supply and three failed hard drives?" Oops.
Another category is the cron/sechudler jobs that Fred setup over the years.
Inevitably, these are critical pieces of automation in the IT infrastructure
that nobody even knows exist since they mostly just quietly do their job...
until one breaks for some reason and sends an email to Fred saying "Help! Fix
me!". Then someone in IT has to do IT archaeology figuring out why some
database transaction log is filling up now or why support tickets aren't
magically getting sync'd to the support FTP server or whatever. Not fun.
So now that I've gotten that off my chest... Yeah, email notifications suck.
They either wind up not being seen or going to the wrong place or otherwise
wind up being more of an annoyance than a solution. So what are we to do?
At my last job, I inherited a metric crap-ton of systems which had been
setup by people who'd long-since left the company running apps with no
support contract, setup by other people who'd been gone a decade or more, and
each with critical cron jobs that nobody in the IT or Apps departments knew
anything about. Bummer! So any time I found one of these, I dug in my heels
and found an owner for whatever process the job was supposed to accomplish
and setup a passive nagios service for that cron job. The I edited it searching
for any exit conditions. Any time the script exited, it would run a
command to notify nagios about the success or failure of this job.
Why is this better? Well, because now the script can run as often as needed
and report "all is well" to nagios who will dutifully note it without filling
up anyone's mailboxes. Plus if it ever does exit with some fault, it
notifies nagios of that fault - it doesn't go unnoticed. Now, it's also
critical that nagios (or whatever monitoring server you use) be kept up to
date with correct mailing addresses. But that's MUCH more likely to be kept
up to date and much easier to keep up to date than to have to go edit a half
a jillion cron jobs and hardware configs all over the world.
The more I did this, the more I fell in love with passive service checks in
nagios. Any time we setup a new job the first thing after making sure the
cron job worked was to setup a passive check for it to send notification to.
For instance, here's the nagios checks for a database server:

Yeah, I'm a big believer in monitoring. (grin) If it can be monitored, we
probably monitor it. Many of these service checks were just to gather
performance data to render on a dashboard like ethernet or SAN switch
traffic stats or CPU usages. But notice the last two service checks? They're
the ones that have an icon that says PASV with two arrows pointing down. One
of those is for a cron job that quiesces the database, then takes a snapshot
of the filesystem the data is on, then releases the database and takes a
snapshot of the filesystem the logs are on. The other check is for a cron job
that uses these snapshots to create volumes we can then backup.
So if either job fails it exits with a status telling nagios that all is not
well and giving us some sort of useful error as to what happened. But what if
the cron job just doesn't run at all so no error conditions are reported?
Nagios (and any decent monitoring system) can take care of this too. In nagios
you just set the freshness_interval and setup the check so most of the
time nagios doesn't actively check this service, it just sits passively
listening for an update. If it doesn't get one within the freshness interval,
it then runs whatever you've specified for the active check, and you specify
a script that always returns a critical or unknown status (whatever you'd
prefer) So if it gets an update from the cron job, great. If nothing is heard
within your freshness interval (say, 25 hours for a job that runs once a day
and takes around 30 mins to run), this service check will revert to critical
or whatever you've specified for the active check.
Here's what my nagios config looks like for this service check:
# my-server create snapshot check
# Go critical if no notification received within 25 hours
# the stalking_options are so that nagios will log all checks, not
# just transitions from one state to a different one
define service {
use local-service
host_name my-server
service_description create snapshot
check_command check_dummy_crit
flap_detection_enabled 0
max_check_attempts 1
active_checks_enabled 0
passive_checks_enabled 1
check_freshness 1
freshness_threshold 90000
stalking_options o,w,c
}
See the check_dummy_crit command on the check_command line? That runs a
script that just always returns a CRITICAL status to nagios. And you can see
that passive checks are enabled and active checks are disabled. I also set
the flap_detection off because I don't want it to squelch alerts if the cron
job fails X days in a row. If memory serves, I set the stalking options
because I wanted nagios to log all the successes (which includes the names
of the created snapshots) as well as any failures. In other words, even if
the last notification was a success and the most recent notification was also
a success, I wanted nagios to log it, not just the changes in status. It just
made it more convenient for the on-call if there was a problem and he/she
needed to roll back a filesystem to a previous snapshot - they were all logged
in nagios.
One thing to be careful with (at least in nagios) is to not get in the habit of
just disabling and re-enabling all checks for all services on a host. Get in
the habit of acknowledging alerts or disabling notifications instead of
disabling service checks. This is because if you use the link to disable all
checks (for instance) fix a problem and then enable all checks for a host you
will wind up re-enabling active checks for your passive services also and
then nagios will run the check_dummy_crit call rather than waiting for the
next passive update from the script.
So what does it look like from the script's perspective? It turns out to be
very easy. I'm using nscd and send_nsca but I encourage you to look for newer
tools. I think there's a newer tool that allows for multi-line passive updates
for instance. But here's how it works with send_nsca and nscd. You just
configure nscd on your nagios server (consult the man pages - it's easy), then
in your script you run send_nsca. In my case, my cron job is a perl script so
I added a function to run send_nsca which I could call from any point within
the script:
#
# This is a snippet from a cron job that creates snapshots of our database
# my notify_nagios perl function:
sub notify_nagios ($$$$) {
my $naghost = shift;
my $nagservice = shift;
my $nagstatus = shift;
my $nagmsg = shift;
my $NAGIOS = "/usr/local/nagios/nrpe/bin/send_nsca -H my-nagios-server -c /usr/local/nagios/nrpe/etc/send_nsca.cfg >& /dev/null";
open(NAGIOS,"| $NAGIOS");
print NAGIOS "$naghost\t$nagservice\t$nagstatus\t$nagmsg\n";
close(NAGIOS);
}
# ... a bunch of geeky DB and storage stuff ...
# At this point, $result is a numeric result code like
# 0 = OK
# 1 = WARNING
# 2 = CRITICAL
# And $resstr is a text string containing a textual status like
# "CRITICAL: Failed to create a snopshot for $volname" or...
# "OK: Created snapshot for $volume on controller $ctrlr"
# Tell nagios the status
notify_nagios ($hostname,"create snapshot",$result,$resstr);
And from a bash script it can be as easy as running echo with some tab
separated parameters and piping that to send_nsca. Easy. So let nagios log
success and failures for you and let it sort out who should be notified when
under which conditions. You'll save yourself no end of headaches in the future
as people in your IT department come and go.