Learning Node.js and elasticsearch
At my DAY JOB I'm logging an increasingly
large amount of stuff and then wanting to be able to quickly search through
it for specific things and also have tools doing the logging also looking
for specific things. I've been doing this with a combination of a wide
variety of tools (some of which I've talked about here like oak), but most
recently we had a small start-up write a prototype for us that used
elasticsearch and logstash. The bit they wrote was a web interface for
searching the DB and making jobs that run periodically that search for
whatever and report what they find.
Well, that business folded. So I'm back in the market for a log DB
and don't have any budget at the moment. I might just use elasticsearch and
kibana and RTFM enough on logstash to figure out how to make it index
I've been meaning to teach myself about a language called
Node.js anyway, so I spent the recent holidays
learning a new language I always start by finding a project I want
that the language is suited for and start writing it as I learn. So, I
started making my own syslog server, just for grins.
Node should be a good fit for it, I think, since it's whole claim to fame is
that it's event driven, so less time is wasted waiting on disk/network I/O
and instead is used handling the next event. Interesting. So, my
(possibly dumb) idea was to have it accept syslog messages, pick out the
facility/severity and then pass it on to various handlers, which would
then write it to a file or add it to a database or pass it to a log
message analyzer or whatever. I also found out that these could be run as
separate processes. Neat - a multi-threaded syslog daemon? One that if one
handler (say, one inserting messages into a DB) starts to bog down doesn't
cause any dropped messages maybe? (I still need to do some benchmarking and
play "what if" games with it).
Anyway, I've got a first cut at one now that logs to a file and logs to
elasticsearch. And the handler that writes stuff to elasticsearch also
picks interesting things out of the log messages to index by like "user" or
"command" or "tty" (all stuff seen in auth.* messages). So far, so good.
I still need to spend a lot more quality time with kibana figuring it
out. I think it'll be a powerful tool (proably more so than the one made
by the start-up) though it's not exactly intuitive. But it's a start and
gave me a good excuse to learn a bit about Node.js. :-)
If I wind up with a program that I think is useful, I'll probably stick it
in github for the world to laugh at... er, I mean to use. Hopefully it'll
scale well. At SGI it's not unheard of for someone to mistakenly cause over
8000 DNS queries per second which all get logged (ugh) and I don't even wanna
think about the number of log messages per second we get from all the
firewalls in all the different offices. It's obscene.
But my intention is to make it so I can have all these devices/servers logging
to a LOCAL server (one in each geography - less WAN traffic) which logs it
to one or more LOCAL
elasticsearch servers which are all working as a cluster so kibana can
search (in parallel?) any or all of them for whatever I need to find. A
distributed logging system that's centrally searchable.
Yeah, there's probably lots of off-the-shelf tools like splunk or
logrhythm that do this but as I said I have no budget at the moment. And
in the meantime I'll add Node.js to my toolbox. :-)
Ok color me officially impressed. Once I got my syslog daemon logging things
into elasticsearch, I started playing with the latest version of kibana.
And it's a very nice tool. I've still got lots to learn about it but
search youtube for videos demonstrating it. It's very nice.
I also have been spending evenings and weekends working on my little syslog
daemon and it's coming along nicely too. I tweaked it to start tracking how
many log events per second it's handling, how long (on average) each
destination handler is taking to parse it, and how long the inserts into
elasticsearch are taking. After this I found out that doing hundreds of
inserts per second didn't scale - after 8 or 9 hours of running I'd see my
per-message handling times going from .1 or .2 ms (roughly) to 20-30 ms.
However, last night I tweaked it so the elasticsearch-based handlers now
just queue up each object to insert into the database and then every X
seconds (once a second right now) I do a bulk insert. MUCH better
performance. After running it about 12 hours, my average handling
times are still .01 to .02 ms per event, and the insert times are still
around 0.17 ms per event.
And I haven't even parallelized anything yet. I'll either bang together some
new parsers for other types of data or I'll dive into the pm2 module and
figure out how to parallelize it next. Or maybe I'll tidy up the code and
modularize it a bit more so specifying what "handlers" to use and what
files to import them from is in a config file instead of just hardwired
in the code.
And at some point, I'll have the handlers take their performance stats and
stick them into elasticsearch too. That way kibana can graph them for me.
I'll be able to make pretty dashboards with pie charts showing what types
of log data we're ingesting and line charts showing average parsing and
inserting times, how many events per second each handler is doing, etc.
Since kibana is so good at that, I figure, 'might as well'.
Anyway... It's coming along nicely. I'm currently around 20k to 25k
events per minute and it's only consuming (for now) around 20 percent of one
cpu core (for both elasticsearch and my syslog)...
Another breakthrough... Last night I figured out how to bend elasticsearch
and kibana to my will so that I could keep my data local to a specific
geography but still have it be centrally searchable. This turned out to be
a challenge at first because elasticsearch clusters really wants to
spread the data out across all the nodes in the cluster. Whereas, I wanted
to keep log data in CA in the elasticsearch node/s in CA, log data in
WI in the nodes in WI, log data in London in the nodes in London, etc.
As it turns out, the way to do this is to have each geography have it's own
elasticsearch cluster (with it's own cluster name), and it's own indexes
(this is important). Then, on the system you're gonna run kibana, you
setup elasticsearch again but in THIS config, you set it up to use
"tribes". Each tribe you configure causes elasticsearch to connect to
a separate cluster as a client. Then you point kibana at this local
elasticsearch node. It does all the queries across all the different
tribes - across the different clusters - aggregates the results, and
feeds them back to kibana.
So, I had to tweak my syslog.js script so the index name used still
started with logstash- (which is what kibana looks for by default), but
was unique for each geography. I also had to connect kibana to one of the
clusters directly, once, so it could create it's .kibana index, then
switch it back to it's local tribe-based elasticsearch server.
This will work out great because it means if things ever start to not
scale well enough, I can just build up that elasticsearch cluster - add
a new node. Or if I decide I want to retire/replace a node in one of the
clusters, I can set it to be master: false, data: false, and then spin up
a new node in the cluster and all the new data will go only to the new node
and then eventually I can tear down the old one when all of it's data has
At the moment, I'm handling well over 1.5 million log events per hour and
the elasticsearch nodes (my "clusters" are only one node each at the moment)
are ticking along fine at 20% of one CPU core. :-) I'm sure the CPU usage
and memory needs will increase as the amount of data I'm keeping increases,
but for now that's just dandy. These are just prototypes after all. At
some point I'll spin up production nodes with significant amounts of
memory and a BIG chunk of fast storage. Then I'll really have somethin'!