Champagne logging on a beer budget

Older ramblings

Holy cow, I can't believe it's been a year since I started building my own distributed log server. And kibana has turned out to be very useful even with some of it's minimal visualizations. Not long after I started moving more and more production logging to the new servers, kibana and some rudimentary dashboards made a compromised system stick out like a sore thumb and it also made figuring out what happened to the system much easier.

I even made a youtube video demonstrating it. We've discovered misconfigurd systems making odd DNS queries or sending broken email, and now given some network indicators from analyzing some malware from some phish, we can usually determine very quickly if anyone in the company fell for the phish. Or if we find some indicator that's repeating (say, a bunch of different DNS queries that always resolve to a small set of IPs or networks), we can add it to the list of things we search for and display on one of the dashboards. For instance, on my "start of the day" dashboard I show DNS queries for known hostile hostnames or queries we blocked because they'd otherwise resolve to known hostile IPs/nets, or specific snort signature IDs, or rejected traffic to/from certain IPs/nets. You get the idea. We're always discovering interesting new things to watch for.

It's been working fantastic, but I'm finally thinking about re-designing it a bit and making a next version. The current version is coping with between 3 and 4 million events per hour, running on two SGI UV 20s (most events still go to only one of the two servers for reasons I won't bore you with). And on the busier of the two servers, my single node.js process is only consuming between 30 and 60 percent of a single CPU core. Not too bad!

I've built it to be fairly modular. So making a new parser to ingest some new log data is relatively easy. I turned loose a summer intern who'd never written node or javascript, making a new parser for some windows event logs. The idea (now), is that one can configure syslog to pair up a parser module (a single .js file) with a "handler". I've got two handlers at the moment - one that writes parsed data to a log file and one that inserts it into the elasticsearch DB. The parser modules just have a few functions - one to return a boolean if a given message is one the parser wants to handle, and a function that takes an event and returns a JSON object - the parsed object. Easy.

But I'm now thinking along the lines of wanting to parallelize things a bit, just to guarantee that everything continues to scale nicely even if my logging load increases, and I'd also like to start to add some log analysis modules. Some obvious examples would be a module to look at usernames, login times of day and days of week and servers being logged in to. Then after some configured time after first seeing a new username, it would start to figure what servers that user typically logs into, and on what times of days and days of the week. Then if it sees anything that is NOT normal, it could log it's own event. Basically it would (I hope) log events (maybe even send a nagios alert) if a user account logs into a server it rarely (or never) logs into, or if that account is suddenly being used during odd hours, etc.

Or another module might do similar things looking at file server usages, share usages, printing volumes/times, maybe even looking at logs from a badging system. Trying to find what might be odd user behavior worthy of having a human look into it.

Or another module might look for a sequence of events. Some of the indicators I've seen which I currently can't easily search for in kibana are things like, Creates a group, grants admin privs to that group, then sometime later deletes that group, or Creates a new user, then sometime not too later deletes that user. Or maybe it looks for X login failures from a given source or for a given username followed by success (did the account just get compromised?) All sorts of possibilities for "interesting" stuff to look for here. It would also be a nice way to bump up the priority on certain events. For instance, we ingest data from snort, cyphort, PaloAlto firewalls, ASA firewalls, and an Attivo Botsink. Alerts for a host on any one of these might be interesting but might merely be suspicious. But if I see alerts from multiple different things all related to the same host, server, or username, then it's more likely to be something "interesting" a human should look at.

So... My current thoughts are that for this next version, I'll break the syslog daemon into multiple pieces that run as independent processes - probably managed with pm2 so if one process dies for some reason it gets restarted. One process will be the syslog daemon itself and what I'm calling the dispatcher. It binds to the ports accepts conections and log data and does the syslog parsing, figuring out the syslog facility and severity, the host/ip sending the data and possibly application and process ID. This process would then either call each parser to see if it'd return a parsed object or each parser might be made a separate process too. Still debating this. If I make each parser a separate process that makes it possible to spread the parsing out across multiple CPU cores and also makes it easier to figure out which parser is consuming the most CPU time, etc.

For now, I'll assume each parser will be a separate process listening on localhost for messages from the dispatcher and returning parsed json objects. The dispatcher will be able to locally call each parser's isMine() function so it'll only send messages to the appropriate parser. Then, given a parsed Object returned by one of the parsers, it'd use a supplier/subscriber message queue to give that parsed object to any other modules that wanted to look at them. Each handler would be a subscriber, for instance. But the log analysis modules could also be subscribers.

Hmm, as I write this, I just realized that there needs to be a way for the new events some log analysis module wants to log would need to be sent to all of the handlers too. So instead of using a supplier/subscriber mechanism for parsed objects, perhaps I'd use a send/reply message queue. So the dispatcher would send a message with a parsed object to each handler AND to any post-processing modules, but then also listen for any optional returned objects. So a handler writing to a file or the DB wouldn't send any replies but a log analysis module might and those parsed objects would be handled the same way those from any parser would be handled - they'd be passed along to the handlers and log analysis modules.

Anyway, in an effort to start along this path, I played some "what if" games with nodejs and zeromq. I found that on my pokey old laptop, I was able to send/reply JSON objects between two processes between 4000 and 5000 times per second. At my current log event rate of around 800/sec, that should at least function. It's at least fast enough to investigate and then test with some realistic logging loads.

And right now I'm running elasticsearch on the same physical server as the syslog daemon. I could split the syslog daemon off to a separate server as it doesn't need anywhere near as much memory or cpu as elasticsearch does and putting syslog on a system by itself would allow it to scale even bigger if needed. Not to mention I could take my single-node elasticsearch "cluster" and split it up into multiple elasticsearch servers - a real cluster for each syslog daemon in each major geography. Love it when I can scale both horizontally and vertically. :-)


The old C hacker in me bugged me all day today about how IPC over local sockets ought to be far faster than any message passing library instead of ramming everything through the IP stack and maybe I ought to blow the dust off of my old posix/unix C programming books and make an IPC module for nodejs and then tonight... Whaddya know, somone else already did it! Nice.

So tonight I did the same thing I did the other night with the zeromq library and benchmarked the absolute throughput I could expect. Instead of maxing out my pokey old laptop at 4000 to 5000 messages per second (including converting from JSON objects to a string back to JSON objects), I'm getting more like 12,000 messages per second. Now that's more like it!

So I'll be looking into using this module instead. It's a little more fiddly to setup and I'll have to test doing it over a TCP session (for the IP to hostname to username mapping service) but it oughta scale far better.

I made some significant progress on the next rev of my syslog daemon last weekend. Firstly, all the parsers are now separate processes! I have one program called runParser.js which takes care of all the IPC comms and can then fire up any named parser. This is great because we can split off all the parsing loads onto multiple CPU cores but also, the parser code remains uber-simple. Anyone with minimal scripting experience can figure out how to make simple parsing code.

In addtion, I ported over the destination handlers now as similar modules, but which get run by a runDestHandler program. These may return their own, fresh parsedObjects back to the syslog dispatcher as new events to be parsed and/or handled. I think my log analysis modules will likely be written as Destination Handlers, so this is important so they can create new events after finding some sequence of events all related to some bigger security-related event (for instance).

I also stripped all the SGI-specific stuff out of the dispatcher code and placed it in it's own PostProcessor module, and made another program to start named PostProcessor modules. These are for things like taking a parsed object for a DHCP event and mapping an IP address with a DHCP hostname or taking a VPN event and mapping a username to an IP. Or taking some other event with only an IP and using previous mappings to add a username and/or hostname. So now, anyone can glue in whatever mappings make sense in their environment with their parsers. Slick. I'm also going to finally put in my geoIP mapping code using this mechanism. Using it, one can specify lat/lon/city/geo stuff to your own subnets and have it fall back to the free GeoIP databases to figure it out for other IPs/nets. These postProcessor modules modify parsed objects, rather than creating new ones.

Lastly, I finally RTFMd on the pm2 module for nodejs. It's really pretty slick. And best of all, now that I'm breaking the syslog daemon into all these separate bits of code that all run as separate processes, I can have pm2 manage them all. For instance, I setup a JSON file that says start the ipcache service (for IP/username/hostname mappings for all syslog daemons), then start all the different parsers, then start all the different postProcessors, then finally start the syslog daemon itself. And if any one process chokes, pm2 will log it and restart it. I can start them all with one pm2 command. I can stop them all with one pm2 command. I can monitor them (at least their memory usage - not sure about cpu) with one command. If a given process/program doesn't maintain an internal state in memory, if it's carefully written one could even tell pm2 to start X copies of it to run in parallel. This would be a problem for the log analysis modules as they're going to need to keep some sort of state about what events they've seen in what order, unless they're written to use a DB or memcached or something to preserve some shared state data that all the separate processes can refer to.

But, yeah! It's looking very good indeed. The only thing missing now is I need to re-implement the command channel. When someone sends a command to a syslog daemon, it'll need to pass it along to at least all of the destination handlers, though I'll likely pass those commands along to ALL of the separate modules and let them decide what commands (like status, or rotate or reset) matter to each one.

Getting close to sticking it on some development servers at work to stress-test it and start beating it up for a week or two. Oh yeah, and I've got to shut off a bunch of debugging statements and the debugging flag on the ipc module (which makes it quite chatty). So close...