I'm in the process of adding flow cutoff functionality to VAST, in the
same vein as we have in the TM. Several questions arose during the
implementation, and the TM paper did not help me getting answers:
- Scope: what exactly is the flow size? Because TM uses the
connection 5-tuple,, I'd go for the cumulative transport-layer
bytes. That is, for TCP/UDP the full payload size, and for ICMP
the variable-size data portion. However, one could also imagine
counting bytes starting at the network layer.
- Directionality: is there a single cutoff value for both directions
of the flow? Or does the cutoff mean N bytes per direction, so
that the trace ends up with O(2N) bytes?
- Eviction: how does TM evict old/stale entries? Naturally it makes
sense to age out old entries upon connection termination. For TCP,
it would be after seeing FIN or RST, and for ICMP maybe getting the
response to a corresponding request. For UDP, there seems to be
need for a timer-based approach.
Moreover, the connection state table must not grow arbitrarily and
be capped at some size. What is a typical number of concurrent
flows at, say, UCB? In the case of scanning, adversarial activity,
or more general any sudden increase in the flow count due to
numerous (small) connections, we need to evict existing flow
counters. Choosing a random entry would likely end up with one of
the smaller flows being evicted, which seems like a robust
stateless design. Alternatively, one could come up with a scheme
that walks the flow table periodically (say every 100k packets)
after it has reached its maximum size and then evict all entries
less than a given threshold.
Any thoughts on the above points would be much appreciated.
Matthias
Repository : ssh://git@bro-ids.icir.org/time-machine
On branch : topic/naokieto/ipv6
>---------------------------------------------------------------
commit f984098ab8aa5cd1f6a84c9cbf9ce56621f45efa
Author: NaokiEto <neto(a)caltech.edu>
Date: Thu Aug 28 21:39:07 2014 -0400
This is my final commit before leaving lab for this year.
I fixed the hash table size changes that coverity found. Basically,
I was not always guranteed to create a new hash table, so that has been
fixed. Also, the number 1.82 that is in the code has some explanation, but
may be refined after more experiments.
>---------------------------------------------------------------
f984098ab8aa5cd1f6a84c9cbf9ce56621f45efa
src/Index.cc | 20 +++++++++++++++++---
src/IndexHash.cc | 5 ++++-
2 files changed, 21 insertions(+), 4 deletions(-)
diff --git a/src/Index.cc b/src/Index.cc
index c354706..9b94b16 100644
--- a/src/Index.cc
+++ b/src/Index.cc
@@ -310,8 +310,11 @@ void Index<T>::addEntry(IndexField *iqe) {
/* Balance number of hash buckets */
/* Hash has twice as many buckets as entries. shrink.
* yes, we want to compare the size of cur with the # entries of old (formerly new hash table) */
-
- if (hash_size > 2.05*old->getNumEntries()) {
+ /* Why 2.24? The biggest ratio between consecutive numbers in the hash table size list was 29/13, which is 2.23. So, if the hash table
+ * size is more than 2.24 times bigger than the number of entries, we can safely shrink the hash table size, to the element before in the
+ * hash table size array.
+ */
+ if (hash_size > 2.24*old->getNumEntries()) {
//if (hash_size < old->getNumEntries()) {
// Note that we delete cur - this means we delete the formerly old hash table, which has been written to disk
//tmlog(TM_LOG_NOTE, "Index.cc", "we are about to delete the current (formerly old) hash table");
@@ -321,10 +324,18 @@ void Index<T>::addEntry(IndexField *iqe) {
//tmlog(TM_LOG_ERROR, "Index.cc:addEntry", "we are decreasing hash table size to %d and the number of entries in old hash is %d with %d buckets", hash_size_index - 1, old->getNumEntries(), old->getNumBuckets());
cur = new IndexHash(hash_size_index - 1);
}
+ else
+ cur = new IndexHash(0);
}
/* Hash has half as many buckets than entries. enlarge */
- else if (1.95 * hash_size < old->getNumEntries()) {
+ /* UPDATED COMMENT: If 1.82 * hash_size is less than the number of entries, then enlarge the hash table (1.82 was chosen instead of 3/2 = 1.5 because
+ * 1.82 is for the 53/29, which is the smallest and more likely to happen than 3/2. We enlarge it by two (approx factor of 4) via hash table
+ * size array. We do this because sometimes, the number of entries were observed to increase by a factor of 4 (tested via the numerous tmlogs you see
+ * littering this area of the code). Also, 1.82 worked better than simply 1 or 1.95 in terms of packet drops, it appears. More testing may be needed
+ * to determine the optimal number.
+ */
+ else if (1.82 * hash_size < old->getNumEntries()) {
//else if (1.9*hash_size < old->getNumEntries()) {
//else if (hash_size > old->getNumEntries()) {
// Note that we delete cur - this means we delete the formerly old hash table, which has been written to disk
@@ -335,6 +346,9 @@ void Index<T>::addEntry(IndexField *iqe) {
//tmlog(TM_LOG_ERROR, "Index.cc:addEntry", "we are increasing hash table size to %d and the number of entries is %d with %d buckets ", hash_size_index + 2, old->getNumEntries(), old->getNumBuckets());
cur = new IndexHash(hash_size_index + 2);
}
+ // Based on experimentation, it should never go here. Hash table sizes range at around 10,000->100,000 only.
+ else
+ cur = new IndexHash(old->getNumEntries() + (old->getNumEntries() % 2 + 1));
/*
else
{
diff --git a/src/IndexHash.cc b/src/IndexHash.cc
index e37e6a5..66773f4 100644
--- a/src/IndexHash.cc
+++ b/src/IndexHash.cc
@@ -73,7 +73,10 @@ IndexHash::IndexHash(int size_index) {
*/
numEntries = 0;
numBucketsIndex = size_index;
- numBuckets = Primes[size_index];
+ if (size_index < 43)
+ numBuckets = Primes[size_index];
+ else
+ numBuckets = size_index;
htable = new hash_t[numBuckets];
Repository : ssh://git@bro-ids.icir.org/time-machine
On branch : topic/naokieto/ipv6
>---------------------------------------------------------------
commit 97e862bab618326dab53ba2668036629848d7b18
Author: NaokiEto <neto(a)caltech.edu>
Date: Thu Aug 28 18:05:21 2014 -0400
Updated some comments about prime number hash sizes and the use of the number in the
counter when trying to write indxes to disk.
Implemented Matthias' CMakeLists code for optimizations (-O3)
>---------------------------------------------------------------
97e862bab618326dab53ba2668036629848d7b18
CHANGES | 19 +++++++++++++++----
CMakeLists.txt | 4 +++-
configure | 2 +-
src/Index.hh | 3 +++
src/IndexHash.cc | 2 ++
5 files changed, 24 insertions(+), 6 deletions(-)
diff --git a/CHANGES b/CHANGES
index 3879688..de3d341 100644
--- a/CHANGES
+++ b/CHANGES
@@ -1,21 +1,31 @@
0.1-5 | 2014-08-26 15:35:00 -0800
+ * Implemented the creation of index and query directories by default if the user did not create the index and query directories. Also,
+ if indexes are not enabled, index directory is not created. If querying is not occuring, query directory is not created.
+
+ * Implemented a counter instead of the many calls to gettimeofday when determining when to write to disk, which costed a lot of CPU.
+
* Changed the hash table sizes to always be prime number, to help avoid clustering in the collisions lists. (Naoki Eto)
- * Added gperftools CPU profiler, which can be enabled (Naoki Eto)
+ * Added gperftools CPU profiler, which can be enabled by using --enable-gperftools-cpu in the ./configure option and
+ adding a name to profilepath in the configuration file (Naoki Eto)
* Changed the method for reading the configuration classes so that it is first ordered by precedence and then the highest precedence match is found (Naoki Eto)
* Implemented querying for IPv4 and IPv6 ip, conn2, conn3, and conn4 (Naoki Eto)
- * Implemented class directories that can be specified in the configuration file (Naoki Eto)
+ * Implemented class directories that can be specified in the configuration file. Example:
+ ...
+ filesize 2000m;
+ mem 100m;
+ classdir "/home/neto/data_http";
+ }
+ (Naoki Eto)
0.1-4 | 2014-07-18 16:53:50 -0800
* Implemented IPv6 support for the classes. (Naoki Eto)
- * Some querying for IPv6 addresses is enabled. (Naoki Eto)
-
* VLAN tags are taken into account w/o MPLS labels (Naoki Eto)
0.1-4 | 2013-02-07 14:37:50 -0800
@@ -25,3 +35,4 @@
0.1-3 | 2013-02-07 14:33:20 -0800
* Starting CHANGES.
+
diff --git a/CMakeLists.txt b/CMakeLists.txt
index e718977..83a4457 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -78,7 +78,9 @@ if (ENABLE_PERFTOOLS_CPU OR ENABLE_PERFTOOLS_DEBUG OR ENABLE_PERFTOOLS)
# perftools weren't found
endif ()
-set(CMAKE_CXX_FLAGS "-g -Wall")
+set(CMAKE_CXX_FLAGS "-g -Wall -O3")
+set(CMAKE_C_FLAGS "-g -Wall -O3")
+add_definitions("-O3")
#detect 32 or 64 bit compiler
# Also in config.h.in
diff --git a/configure b/configure
index 1b60a02..377beb1 100755
--- a/configure
+++ b/configure
@@ -25,7 +25,7 @@ Usage: $0 [OPTION]... [VAR=VALUE]...
Optional Features:
--enable-debug compile in debugging mode
- --enable-gperftools-cpu enable the use of gperftools' cpu profiler
+ --enable-perftools-cpu enable the use of gperftools' cpu profiler
Required Packages in Non-Standard Locations:
--with-broccoli=PATH path to libbroccoli install root
diff --git a/src/Index.hh b/src/Index.hh
index 1befe08..3cf0397 100644
--- a/src/Index.hh
+++ b/src/Index.hh
@@ -398,6 +398,9 @@ public:
*/
// returns 0 if lock was successfully achieved
+ // 500000 came from testing. In the above commented out code, you can see that previously, gettimeofday was used.
+ // Basically, I did a counter while it was doing gettimeofday, and found that it did around 500000 entries before trying
+ // disk write lock. It was pretty consistent on two different machines.
if (num_of_entries < 500000)
{
num_of_entries++;
diff --git a/src/IndexHash.cc b/src/IndexHash.cc
index 613fcff..e37e6a5 100644
--- a/src/IndexHash.cc
+++ b/src/IndexHash.cc
@@ -7,6 +7,8 @@
*
* The indexes in the hash table are calculated by doing key mod (hash table
* size).
+ * The statement to prove is: number of buckets used = hash_table_size / GCF(hash_table_size, factor).
+ *
* We can note that there exists y / GCF(y, n) distinct instances of m * n (mod y) for all m.
* To see why this is true, we can let a = GCF(y, n).
* Let n be a particular key, and y = a * x and n = a * b
Repository : ssh://git@bro-ids.icir.org/time-machine
On branch : topic/naokieto/ipv6
>---------------------------------------------------------------
commit f5f4c9d405c192659c21e48c5dcff18ba1798083
Author: NaokiEto <neto(a)caltech.edu>
Date: Thu Aug 28 18:31:12 2014 -0400
Fixed a Memory-Illegal access error found by Coverity. This error occurred in
Index.cc, and was in the original tm-master code (the ipv4 only implementation).
Basically, iqe had a chance to be deleted, and then was to be accessed after that.
To find where this occurred, do a grep "delete iqe" .
>---------------------------------------------------------------
f5f4c9d405c192659c21e48c5dcff18ba1798083
src/Index.cc | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/src/Index.cc b/src/Index.cc
index aeb5717..c354706 100644
--- a/src/Index.cc
+++ b/src/Index.cc
@@ -376,6 +376,9 @@ void Index<T>::addEntry(IndexField *iqe) {
iqe->ts-IDX_PKT_SECURITY_MARGIN*idx_thread_iat, iqe->ts);
cur->add(iqe, ie_n);
+ // update last_updated time
+ last_updated = iqe->ts;
+
//last_updated = iqe->ts;
//ProfilerStop();
@@ -389,7 +392,7 @@ void Index<T>::addEntry(IndexField *iqe) {
delete iqe;
}
// update last_updated time
- last_updated = iqe->ts;
+ //last_updated = iqe->ts;
// Note that old hash table is now the formerly current hash table. So, it is in the memory, and we can do look up on it
// This must be the table that Aashish says that indexes do not persist, part of of index persistence (other part is
Repository : ssh://git@bro-ids.icir.org/time-machine
On branch : topic/naokieto/ipv6
>---------------------------------------------------------------
commit 3f08ec896fb1a5b3b38f99faed2082666af0e633
Author: NaokiEto <neto(a)caltech.edu>
Date: Thu Aug 28 18:05:21 2014 -0400
Updated some comments about prime number hash sizes and the use of the number in the
counter when trying to write indxes to disk.
Implemented Matthias' CMakeLists code for gperftools
>---------------------------------------------------------------
3f08ec896fb1a5b3b38f99faed2082666af0e633
CHANGES | 19 +++++++++++++++----
CMakeLists.txt | 4 +++-
configure | 2 +-
src/Index.hh | 3 +++
src/IndexHash.cc | 2 ++
5 files changed, 24 insertions(+), 6 deletions(-)
diff --git a/CHANGES b/CHANGES
index 3879688..de3d341 100644
--- a/CHANGES
+++ b/CHANGES
@@ -1,21 +1,31 @@
0.1-5 | 2014-08-26 15:35:00 -0800
+ * Implemented the creation of index and query directories by default if the user did not create the index and query directories. Also,
+ if indexes are not enabled, index directory is not created. If querying is not occuring, query directory is not created.
+
+ * Implemented a counter instead of the many calls to gettimeofday when determining when to write to disk, which costed a lot of CPU.
+
* Changed the hash table sizes to always be prime number, to help avoid clustering in the collisions lists. (Naoki Eto)
- * Added gperftools CPU profiler, which can be enabled (Naoki Eto)
+ * Added gperftools CPU profiler, which can be enabled by using --enable-gperftools-cpu in the ./configure option and
+ adding a name to profilepath in the configuration file (Naoki Eto)
* Changed the method for reading the configuration classes so that it is first ordered by precedence and then the highest precedence match is found (Naoki Eto)
* Implemented querying for IPv4 and IPv6 ip, conn2, conn3, and conn4 (Naoki Eto)
- * Implemented class directories that can be specified in the configuration file (Naoki Eto)
+ * Implemented class directories that can be specified in the configuration file. Example:
+ ...
+ filesize 2000m;
+ mem 100m;
+ classdir "/home/neto/data_http";
+ }
+ (Naoki Eto)
0.1-4 | 2014-07-18 16:53:50 -0800
* Implemented IPv6 support for the classes. (Naoki Eto)
- * Some querying for IPv6 addresses is enabled. (Naoki Eto)
-
* VLAN tags are taken into account w/o MPLS labels (Naoki Eto)
0.1-4 | 2013-02-07 14:37:50 -0800
@@ -25,3 +35,4 @@
0.1-3 | 2013-02-07 14:33:20 -0800
* Starting CHANGES.
+
diff --git a/CMakeLists.txt b/CMakeLists.txt
index e718977..83a4457 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -78,7 +78,9 @@ if (ENABLE_PERFTOOLS_CPU OR ENABLE_PERFTOOLS_DEBUG OR ENABLE_PERFTOOLS)
# perftools weren't found
endif ()
-set(CMAKE_CXX_FLAGS "-g -Wall")
+set(CMAKE_CXX_FLAGS "-g -Wall -O3")
+set(CMAKE_C_FLAGS "-g -Wall -O3")
+add_definitions("-O3")
#detect 32 or 64 bit compiler
# Also in config.h.in
diff --git a/configure b/configure
index 1b60a02..377beb1 100755
--- a/configure
+++ b/configure
@@ -25,7 +25,7 @@ Usage: $0 [OPTION]... [VAR=VALUE]...
Optional Features:
--enable-debug compile in debugging mode
- --enable-gperftools-cpu enable the use of gperftools' cpu profiler
+ --enable-perftools-cpu enable the use of gperftools' cpu profiler
Required Packages in Non-Standard Locations:
--with-broccoli=PATH path to libbroccoli install root
diff --git a/src/Index.hh b/src/Index.hh
index 1befe08..3cf0397 100644
--- a/src/Index.hh
+++ b/src/Index.hh
@@ -398,6 +398,9 @@ public:
*/
// returns 0 if lock was successfully achieved
+ // 500000 came from testing. In the above commented out code, you can see that previously, gettimeofday was used.
+ // Basically, I did a counter while it was doing gettimeofday, and found that it did around 500000 entries before trying
+ // disk write lock. It was pretty consistent on two different machines.
if (num_of_entries < 500000)
{
num_of_entries++;
diff --git a/src/IndexHash.cc b/src/IndexHash.cc
index 613fcff..e37e6a5 100644
--- a/src/IndexHash.cc
+++ b/src/IndexHash.cc
@@ -7,6 +7,8 @@
*
* The indexes in the hash table are calculated by doing key mod (hash table
* size).
+ * The statement to prove is: number of buckets used = hash_table_size / GCF(hash_table_size, factor).
+ *
* We can note that there exists y / GCF(y, n) distinct instances of m * n (mod y) for all m.
* To see why this is true, we can let a = GCF(y, n).
* Let n be a particular key, and y = a * x and n = a * b
Hi Matthias,
I am not working on re-using the indexes across restarts currently. TM
with indexes enabled takes a noticeably much larger amount of CPU
usage/time than TM with no indexes enabled (about 40-50% larger CPU on
diag3). I don't think indexes are implemented very efficiently, and
probably should be revamped to another method that doesn't use AVL trees.
Also, the data structure of 16 char array that is being used to store IPv6
and IPv4 addresses is not very efficient for making comparisons, and since
there are many lookup calls to the hash tables, this results in large CPU
usage.
I think it would first be better to revamp the method of indexes/querying
before determining a method to reuse indexes/allow for querying on data
before restart, if necessary.
Best,
Naoki
> Good to know. I recall that a major pain point of the current TM was
> that it's impossible to reuse the indexes across restarts. Is that
> something you're working on as well?
>
> Matthias