Robin Sommer wrote:
if I understand you correctly, there are actually two
- Bro is dropping many packets even when running at rather low CPU
Yes, that is the way it seemed when I didn't have restrict filters
turned on. When the cluster started, the CPU for the Bro process would
be high, but would drop down to 20-40% even though many packets were
being dropped after filtering.
- after a few days, Bro hangs with 99% CPU and stalls.
Partially correct. Bro appears to be hanging, but the CPU is at 0%, and
the DroppedPackets/received ratio was banging against 99% just before it
started to hang. I haven't restarted the cluster yet, so here is the
Lines from top:
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
51061 XXXXXX 1 -20 0 1207M 843M swread 1 606:53 0.00%
51082 XXXXXX 1 44 5 31556K 228K select 0 19:29 0.00%
I tried attaching to the process with the large TIME value. Is that the
$gdb `which bro-1.4-robin` 51061
#0 0x081d0e96 in free (mem=0xd724e28) at malloc.c:4229
#1 0x285cfc01 in operator delete () from /usr/lib/libstdc++.so.6
#2 0x080a8f0a in ~Dictionary (this=0x99cd4a0) at Dict.cc:101
#3 0x081c7348 in ~TableEntryValPDict (this=0x99cd4a0) at Val.h:49
#4 0x081c42ac in ~TableVal (this=0x99cd408) at Val.cc:1697
#5 0x081c0e28 in TableVal::DoExpire (this=0x8669d60, t=1244434191.756459)
#6 0x081a9be2 in PQ_TimerMgr::DoAdvance (this=0x82f2a18,
new_t=1244434191.756459, max_expire=300) at Timer.cc:164
#7 0x0813ff09 in expire_timers (src_ps=0x90495a0) at Net.cc:392
#8 0x0813ffbd in net_packet_dispatch (t=1244434191.756459, hdr=0x90495d8,
pkt=0x9049a6a "", hdr_size=14, src_ps=0x90495a0, pkt_elem=0x0)
#9 0x08140549 in net_packet_arrival (t=1244434191.756459, hdr=0x90495d8,
pkt=0x9049a6a "", hdr_size=14, src_ps=0x90495a0) at Net.cc:496
#10 0x0814ef1f in PktSrc::Process (this=0x90495a0) at PktSrc.cc:199
#11 0x081402b5 in net_run () at Net.cc:526
#12 0x080501be in main (argc=454545480, argv=0xbfbfeb28) at main.cc:1056
Here is the bt from the other process just in case it helps.
$gdb `which bro-1.4-robin` 51082
#0 0x286f8da3 in select () from /lib/libc.so.7
#1 0x081617fa in SocketComm::Run (this=0xbfbfe770) at
#2 0x0816629a in RemoteSerializer::Fork (this=0x82fa580)
#3 0x081664aa in RemoteSerializer::Init (this=0x82fa580)
#4 0x0804fbab in main (argc=-2147483647, argv=0xbfbfeb28) at main.cc:956
Is that correct?
Regarding the former, generally at 20-30% CPU Bro shouldn't drop any
signficant amount of packets, there's no throttling mechanism or
such. One guess here would be the operating system. What kind of
system are you running on? Have you tried the tuning described on
I'm running FreeBSD 7.1 for i386. I had tried tuning based on the Bro
Wiki, but the following page showed sysctl debug.bpf_bufsize and sysctl
debug.bpf_maxbufsize. Those commands didn't work in FreeBSD 7.1.
The above tu-berlin.de link shows the following:
sysctl -w net.bpf.bufsize=10485760 (10M)
sysctl -w net.bpf.maxbufsize=10485760 (10M)
The Bro-Workshop-July07-tierney.ppt showed the following should be added
to the /etc/sysctl.conf
Based on these two examples, I am guessing the bufsize is where the
buffer starts, and the max is how large it can grow.
Here are my default values:
$sysctl -a |grep net.bpf
According to the FreeBSD 7.1 manpage for sysctl, "The -w option has been
deprecated and is silently ignored". I'll try setting both to 10M, like
in the link you sent.
I also added those values to the /etc/sysctl.conf so they get set on reboot.
I just restarted the cluster, and the bro-1.4-robin process is sitting
at 11-13%. The DroppedPackets/received ratio is flucuating between 3
and 25%. Shouldn't the CPU be maxing out before packets get dropped?
is there any regularity in the timestamps of when the drops occur?
Like in regular intervals? (But longer intervals than 10s as that's
just the reporting interval).
In the previous email, it looks like the intervals were 10s, but there
was a gap of over a minute at epoch 1244425261.942659, which is right
before the cluster froze. I'll try to keep an eye out for that if it
I wouldn't be totally surprised if the state
checkpointing is the
culprit. To test that, can you remove the line "@load checkpoint"
I haven't tried this yet. I'll see if the bpf buffer increase helps.
If not, I'll try unloading the checkpoint.bro.