New tool for Visualizing Data


I recently attended SecureWorld in Denver and was reminded by one of the speakers that I'd been intending to look at visualization tools for ages now such as those mentioned at www.secviz.org and then it occurred to me that SGI recently announced a really nifty tool called mineset. So I took it for a test drive.

My first go with it, I was intending to feed it DNS query log and filtering data, possibly to have it show me what queries were leading to queries that we blocked (ie, what website was some user surfing to seconds before they landed on some hostname that we're already blocking in our DNS filters. Well, that yielded some interesting graphs but not what I was after (more on that below).

So next, I figured "Start Simple" (tm). I grabbed a portion of our firewall logs (just a part of one day), then filtered out ONLY the outbound packets that got blocked, and filtered out ONLY the ones coming from an SGI IP address and going to a non-SGI address (some of our filters are to block outbound traffic to private IP space for instance which I wasn't interested in looking at). I fed it to mineset and it immediately noticed one "outlier" in the data - someone apparently tried to connect to an X11 server outside of SGI. Hmm, I'll look into that. :-)

Then I told mineset to draw a "parallel coordinates" graph of this data and got:



Interesting, eh? So, the first thing that leaps out is the UDP traffic. Note that almost all of it is going to one port? (53) and it's coming from a wide variety of IPs and going to a wide variety of IPs? This is something I knew I'd find but it's cool that it's so readily apparent on the diagram. Basically, this is a jillion or so linux systems now (stupidly) running dnsmasq as their DNS client instead of something sensible like nscd. Why is this dumb? Because dnsmasq is a full blown proxy - it does it's own root server lookups and ignores that the DHCP server said "thou shalt consult these two IPs for DNS" and instead it tries to do it's own root DNS queries, which we block, and only then does it finally use the IPs DHCP told it to use. LAME! I'd love to punch the nitwit who thought this was a neat idea. Once upon a time, blocked outbound DNS queries was a fine way to detect a compromised system.

Why do we block outbound DNS queries? Cuz I don't want my users sending their DNS queries to some malicious server in... Well, I won't name the country but you get the idea. I also go to a lot of work to make our DNS servers filter out malicious hostnames and I don't want someone bypassing those filters by using Google's DNS or something stupid. I also rewrite the responses for stuff like pool.ntp.org so they resolve to our NTP servers instead. No sense in people using an NTP server run by who knows who out on the interwobble when we have our own!

Anyway, it's a slick diagram. I can also see a few other ports (ie, UDP to port 161 - SNMP stuff to a network that's no longer active at SGI), some windows traffic (ports 135-139) to a few IPs that are no longer in service, etc. There was also some TCP traffic, someone's VPN client trying to do TCP to port 53 on their ISP's DNS server (rolls eyes), an old server from an acquisition that's (???) trying to send email for some reason - time to remind folks to de-comm that system, and a vpn client trying to connect to HTTPS on a IP that's in our banned list. Hmph, I'll have to look into that on Monday too. :-)

These graphs are interactive too. You can re-arrange the poles in the parallel coordinates graph. You can mouse over any line/dot and get the details from the data (in this case, IPs and dest port). And it can do a bunch of different visualizations.

For instance, my first go with it, I fed it a bunch of DNS query data and this was the parallel coords graph I made:



This one was built using a csv file with about 4.8 million rows. Here's another parallel cordinates graph but using a smaller dataset (just one client IP) and this time including the date/time as one of the dimensions to graph:



Another interesting (for this dataset) type of visualization is the bubble chart:



This one shows some of the different queries that were made and the "BlockedVia" shows which RPZ rule caused this to be blocked. The ? value is for queries we didn't block.

Anyway, check out mineset! The demo version is only limited in the size of the dataset you feed it and you can play around with a pile of different visualizations of the data. Plus it has a whole bunch of other features for analyzing the dataset (determining what columns "predict" a value in another column, for instance, looking for correlations), that I haven't even started to play with. I'll have to fool around with it with some other data and see what other cool things I can use it for.