I took a closer look at scan-NG and at the scan.bro that shipped with 1.5 to understand how the detection could be better than what we have now. 1.5 wasn't fundamentally better, but compared to what we are doing now it has an unfair advantage :-)
I found that it used tables like this:
global distinct_ports: table[addr] of set[port]
&read_expire = 15 mins &expire_func=port_summary &redef;
Not only is it using a default timeout of 15 minutes vs 5 minutes, it is using read_expire. This means that an attacker can send one packet every 14 minutes 25 times and still be tracked.
Meaning scan.bro as shipped with 1.5 can pick up slow scans over as much as a 6 hour period.
The sumstats based scan.bro can only detect scans that fit in the fixed time window (it is effectively using create_expire, but as Aashish points out, limited even further since the 'creation time' is a fixed interval regardless of when the attacker is first seen)
The tracking that 1.5 scan.bro has isn't doing anything inherently better than what we have now, it's just doing it over a much longer period of time. The actual detection it uses has the same limitations the current sumstats based scan.bro has: it does not detect fully randomized port scans. It would benefit from the same "unification" changes.
Since that fixing sumstats and adding new functionality to solve this problem in a generic way is a huge undertaking, I tried instead to just have scan.bro do everything itself. We may not be able to easily fix sumstats, but I think we can easily fix scan.bro by making it not use sumstats.
To see if this was even viable or a waste of time I wrote the script: it works. It sends new scan attempts to the manager and stores them in a similar '&read_expire = 15 mins' table. This should detect everything that the 1.5 based version did, plus all the fully random scans that were previously missed. And with the simpler unified data structure and capped set sizes it will use almost zero resources.
Attached is the code I just threw on our dev cluster. It's the implementation of "What is the absolute simplest thing that could possibly work". It uses 1 event and 2 tables, one for the workers and one for the manager.
What does this look like from a CPU standpoint?
This graph shows a number of experiments.
* The first block around 70% is the unified sumstats based scan.bro plus hacked up sumstats/cluster.bro to do data transfer more efficiently
* The next block at 40% was the unified scan.bro hacked up to make the manager do all the sumstats (worked, but had issues)
* The small spike upwards back to 70% was a return to the unified scan.bro that is in git with the threshold changed back to 25
* The spike up to 170-200% was a return to stock sumstats/cluster.bro. This is what 2.5 would be with sumstats based scan.bro
* The drop back down to 40% is the switch to the attached scan.bro that does not use sumstats at all.
The 'duration' is TODO in the notices, but otherwise everything works. I want to just get the start time directly from the time information in the table.. I'm not sure if bro exposes it or even stores it in a usable way. If there's no way to get it out of the table I just need to track when an attacker is first seen separately, but that is easy enough to do.
- Justin Azoff