New tool for Visualizing Data
I recently attended SecureWorld in Denver and was reminded by one of the
speakers that I'd been intending to look at visualization tools for ages now
such as those mentioned at
www.secviz.org and then it occurred to
me that SGI recently announced a really nifty tool called
mineset. So I took it for
a test drive.
My first go with it, I was intending to feed it DNS query log and filtering
data, possibly to have it show me what queries were leading to queries that
we blocked (ie, what website was some user surfing to seconds before
they landed on some hostname that we're already blocking in our DNS
filters. Well, that yielded some interesting graphs but not what I was
after (more on that below).
So next, I figured "Start Simple" (tm). I grabbed a portion of our
firewall logs (just a part of one day), then filtered out ONLY the outbound
packets that got blocked, and filtered out ONLY the ones coming from an SGI
IP address and going to a non-SGI address (some of our filters are to
block outbound traffic to private IP space for instance which I wasn't
interested in looking at). I fed it to mineset and it immediately noticed
one "outlier" in the data - someone apparently tried to connect to an X11
server outside of SGI. Hmm, I'll look into that. :-)
Then I told mineset to draw a "parallel coordinates" graph of this
data and got:

Interesting, eh? So, the first thing that leaps out is the UDP traffic.
Note that almost all of it is going to one port? (53) and it's coming from
a wide variety of IPs and going to a wide variety of IPs? This is something
I knew I'd find but it's cool that it's so readily apparent on the diagram.
Basically, this is a jillion or so linux systems now (stupidly) running
dnsmasq as their DNS client instead of something sensible like nscd. Why
is this dumb? Because dnsmasq is a full blown proxy - it does it's own
root server lookups and ignores that the DHCP server said "thou shalt
consult these two IPs for DNS" and instead it tries to do it's own root DNS
queries, which we block, and only then does it finally use the IPs DHCP
told it to use. LAME! I'd love to punch the nitwit who thought
this was a neat idea. Once upon a time, blocked outbound DNS queries
was a fine way to detect a compromised system.
Why do we block outbound DNS queries? Cuz I don't want my users sending
their DNS queries to some malicious server in... Well, I won't name the
country but you get the idea. I also go to a lot of work to make our DNS
servers filter out malicious hostnames and I don't want someone bypassing
those filters by using Google's DNS or something stupid. I also rewrite
the responses for stuff like pool.ntp.org so they resolve to our NTP
servers instead. No sense in people using an NTP server run by who knows
who out on the interwobble when we have our own!
Anyway, it's a slick diagram. I can also see a few other ports (ie, UDP to port
161 - SNMP stuff to a network that's no longer active at SGI), some windows
traffic (ports 135-139) to a few IPs that are no longer in service, etc.
There was also some TCP traffic, someone's VPN client trying to do TCP to
port 53 on their ISP's DNS server (rolls eyes), an old server from
an acquisition that's (???) trying to send email for some reason - time
to remind folks to de-comm that system, and a vpn client trying to connect
to HTTPS on a IP that's in our banned list. Hmph, I'll have to look into
that on Monday too. :-)
These graphs are interactive too. You can re-arrange the poles in the
parallel coordinates graph. You can mouse over any line/dot and get the
details from the data (in this case, IPs and dest port). And it can do a
bunch of different visualizations.
For instance, my first go with it, I fed it a bunch of DNS query data and
this was the parallel coords graph I made:

This one was built using a csv file with about 4.8 million rows.
Here's another parallel cordinates graph but using a smaller dataset
(just one client IP) and this time including the date/time as one of the
dimensions to graph:

Another interesting (for this dataset) type of visualization is the bubble
chart:

This one shows some of the different queries that were made and the
"BlockedVia" shows which RPZ rule caused this to be blocked. The ? value
is for queries we didn't block.
Anyway, check out mineset! The demo version is only limited in the size
of the dataset you feed it and you can play around with a pile of different
visualizations of the data. Plus it has a whole bunch of other features for
analyzing the dataset (determining what columns "predict" a value in another
column, for instance, looking for correlations), that I haven't even started
to play with. I'll have to fool around with it with some other data and
see what other cool things I can use it for.