Recent Zeek Performance Improvements

Over the course of the past year we’ve made a number of changes to Zeek that have positively affected its performance, sometimes considerably. This blog post highlights a few places and changes that knowingly improved performance between Zeek 5.2 and Zeek 6.2.

Selected Pull Requests

Improvements for non-selectable Packet Sources (6.0)

Zeek’s handling of packet sources without selectable file descriptors used significant amounts of CPU time doing non-productive busy polling. This applies to certain high performance cards like Napatech or older Myricom NICs. At least one bug in the IO loop caused polling to happen 10x more often than expected. The PktSrc::GetNextTimeout() method requested a packet to decide the next timeout for the IO loop. Further, poll frequency and packet source timeouts were hard-coded, requiring re-compilation to tweak and experiment.

The bug was squashed. The GetNextTimeout() implementation now uses a last-value heuristic: if Zeek just saw a packet, expect another without a timeout, otherwise return short timeout to the IO loop to relax busy polling. Zeek’s main loop poll behavior and the timeout have also become configurable, enabling NIC-specific tuning and experimentation without recompilation.

References

Increased PCAP File Read Buffer (6.2)

On most Linux systems, the default buffer size for a FILE opened via fopen() is 4096 bytes. When reading large PCAP files, the overhead of individual read() system calls to fill such a small buffer repeatable becomes noticeable. Zeek now uses a default buffer size of 128 KB, significantly reducing the number of read() system calls required to processa large PCAP file. You can configure the buffer size via the option PCAP::bufsize_offline_bytes. Run zeek -r file.pcap PCAP::bufsize_offline_bytes=4096 to determine the impact this makes on your own system. This change only applies PCAP file reading.

References

Packet Analysis: Removal of shared_ptr copying (6.2)

The packet analysis framework’s Lookup() method used to return a std::shared_ptr instance of a packet analyzer for every network layer found in a packet (eth, vlan, IP, UDP/TCP, tunnel analyzers). Creating and destroying std::shared_ptr instances proves expensive due to thread-safety guarantees and the use of atomic instructions. Doing so multiple times per packet only exacerbates the problem. Worse, Zeek did not need to do this work in the first place, since the packet analyzer instance pointed to exists over the lifetime of the Zeek process. We switched to returning a const reference to the std::shared_ptr instead. (We could possibly have used raw pointers directly, too.)

References

Removal of Segment Profiling (6.2)

This one is embarrassing. Until version 6.2, Zeek instantiated a SegmentProfiler object (a historic feature to record user and system time) in certain very hot execution paths. Even with segment profiling disabled, creating the SegmentProfiler instance on the stack would zero-initialize 144 bytes of the contained struct rusage instance, resulting in significant and non-productive overhead for packet and event processing.

References:

Removal of event manager’s flare (6.2)

Zeek’s main loop wakes up every time a packet becomes available, an IO source needs serving, or a timeout of an IO source expires. Zeek’s event manager operates as an IO source, too. Previously, when Zeek queued the first event during a loop iteration, a write() system call to a pipe marked the event manager ready. Subsequently, a read() from the pipe cleared readiness.

The way the event manager is integrated, however, removes the need to use a pipe. IO sources support indicating readiness via a timeout to the main loop. The event manager now communicates a timeout of 0.0 to the loop whenever events are available, indicating immediate readiness. No need for system calls here.

References

Remove snprintf() for hex printing (6.2)

To format digest hashes or mac address values, Zeek used snprintf("%02x"). While not overly hot, these places were visible in perf top output on production systems. For example, the result of hashing X509 certificate files or formatting a connection’s endpoint mac address involves hex formatting. Calling snprintf() to convert a single byte to its hex representation is not cheap. These places now use zeek::util::bytetohex(), which internally uses a 16 byte hex lookup table directly and requires no parsing of a format string.

References

String built-in function improvements (6.2)

Functions for string handling like ends_with() or starts_with() often copied the input string or used CheckString() repeatedly. We improved these by leveraging std::string_view where possible.

References

Spicy Performance Improvements (6.2)

Version 1.10 of Spicy (bundled with the upcoming Zeek 6.2 release) comes with numerous improvements to improve throughput of generated parsers. We have revisited the code typically generated for parsers and the runtime libraries they use with the goal of improving throughput of parsers at runtime. Coarsely summarized, this work centered around

  • reduction of allocations during parsing
  • reduction of data copies during parsing
  • use of dedicated, hand-check implementations for automatically generated code to avoid overhead from safety checks in the runtime libraries

With these changes we see throughput improvements of some parsers in the range of 20-30%.

References:

Tools and Methods

perf and Firefox Profiler

On Linux, we find many of these issues using the perf tooling and visualizing results in Firefox Profiler.

This works particularly when code is compiled with frame pointers enabled throughout, including external dependencies like libpcap or jemalloc. We record a profile of a running Zeek process or one-off command using perf record -g. A subsequent invocation of perf script then converts into a format compatible with Firefox Profiler:

$ perf record -g zeek -C -b -r 200k-syn-packets.pcap
$ perf script -F+pid > /tmp/200k-syn-packets.profile

Loading 200k-syn-packets.profile into an instance of Firefox Profiler allows to dig into the profile:

perf record --call-graph dwarf

With normal frame pointers, functions defined in C++ class headers are not visible in the flamegraph as they are inlined. For these cases, we find it useful to compile Zeek with dwarf debugging information and record a profile using perf record --call-graph dwarf. This also helps visibility of functions from std::map, std::string, std::shared_ptr, etc.

To demonstrate the difference, we focus on the Analyzer::Analyzer() constructor below. Without DWARF, the result is as follows . This looks suspicious, but not pointing out anything specifically:

When using --call-graph dwarf, there’s a lot more detail: Roughly 75% of the samples in Analyzer’s constructor fall into the ComponentManager::Lookup() method and its subsequent std::map lookups. There’s now a clearer picture: We normalize an analyzer’s name to all upper-case and look for the analyzer’s tag in a std::map. This is not productive - an analyzer’s tag is constant and we should be able to cache the lookup result. The conversion from const char* to std::string for the Lookup() call is additionally visible when hovering over one of the more left bars.

perf annotate

In addition to flamegraphs, the output of perf annotate can be useful. However, it is rather low-level information showing individual instructions within methods. It helped at the time to find clues about the std::shared_ptr overhead by drilling into Dispatcher::Lookup() method and seeing ~72% of its samples in a lock addl instruction.

$ perf annotate zeek::packet_analysis::Dispatcher::Lookup

 0.17 : 913d8f: mov %rcx,%rax
 2.50 : 913d92: ret
 0.00 : 913d93: nopl 0x0(%rax,%rax,1)
      : 72 __gnu_cxx::__atomic_add(int volatile*, int):
      : 71 { __atomic_fetch_add(__mem, __val, __ATOMIC_ACQ_REL); }
71.56 : 913d98: lock addl $0x1,0x8(%rdx)
 0.00 : 913d9d: jmp 913d8f <zeek::packet_analysis::Dispatcher::Lookup(unsigned int) const+0x5f>

hyperfine

Lastly, to compare PCAP processing runtime of two different Zeek versions, hyperfine is a great tool for benchmarking. Primarily it avoids custom shell scripting for timing command execution and computing statistics. To compare Zeek 6.0.3 with a recent master build, one example invocation is:

$ hyperfine -L zeek /opt/zeek-6.0.3/bin/zeek,/opt/zeek-dev-prod/bin/zeek 'taskset -c 1 {zeek} -C -b -r 200k-syn-packets.pcap'

Benchmark 1: taskset -c 1 /opt/zeek-6.0.3/bin/zeek -C -b -r 200k-syn-packets.pcap
  Time (mean ± σ): 2.908 s ± 0.015 s [User: 2.303 s, System: 0.603 s]
  Range (min … max): 2.886 s … 2.942 s 10 runs

Benchmark 2: taskset -c 1 /opt/zeek-dev-prod/bin/zeek -C -b -r 200k-syn-packets.pcap
  Time (mean ± σ): 2.575 s ± 0.015 s [User: 1.994 s, System: 0.579 s]
  Range (min … max): 2.557 s … 2.607 s 10 runs

Summary
  'taskset -c 1 /opt/zeek-dev-prod/bin/zeek -C -b -r 200k-syn-packets.pcap' ran
    1.13 ± 0.01 times faster than 'taskset -c 1 /opt/zeek-6.0.3/bin/zeek -C -b -r 200k-syn-packets.pcap'

This indicates that processing a PCAP file of just SYN packets in bare mode is 1.13x faster with the latest version compared to Zeek 6.0.

Summary

Compared to prior versions, Zeek 6.2 and later should provide you with a nice boost in performance, possibly allowing you to reduce required infrastructure.

If such changes excite you and you are technical enough to profile and hack on Zeek, we would love to see or hear where else in Zeek we can squeeze out more cycles. Pull requests would be amazing, of course.

Further, if you are interested in doing these kind of things as your day job, we are currently looking to grow the Open source team at Corelight.

Please reach out if you have questions or input!

2 Likes

These seem like excellent reasons for upgrading to 6.2 which is now available.

Sincerely,

Richard