Open Source Intrusion Detection Appliances
MetaFlows' intrusion detection appliances appliances are based on robust open standards that quickly integrate in any existing infrastructure. They are custom-built with the best hardware components available today to provide reliable and cost-effective packet processing from 50 Mbps to 10 Gbps.
MSS-1/4/8C
(100Mbps-1Gbps)
MSS-24C
(1-3Gbps)
MSS-64C
(3-7Gbps)
MSS-UTM-1C
(50Mbps)
But unlike our competitors, we detail how open standards (both hardware and software) are used to build extremely cost-effective network security appliances. Below, you will find all the information you need to review what we do and reproduce our results on your own hardware.
You can either follow the steps below or register at nsm.metaflows.com to download an automated installation script for CentOS or RHEL 7. After the install you will have all the system components installed and you can then decide if you want to go the free, open-source route or continue using our complete product for 2 weeks for free through a SaaS subscription.
Robust, Open Source Multiprocessing With PF_RING
Metaflows supports open source.
Metaflows' contribution to the PF_RING project has produced open source technology capable of scaling network monitoring from 10 Mbps to 10 Gbps. Below, we summarize the results of our peer-reviewed testing showing that it is possible to build extremely effective network monitoring appliances on inexpensive commodity hardware.
PF_RING 1 Gbps on Centos 6.x
We have modified PF_RING to work with in line Snort (while still supporting the current passive multiprocessing functionality). PF_RING load balances the traffic to analyze by hashing the IP headers into multiple buckets. This allows PF_RING to spawn multiple instances of Snort, each processing a single bucket and achieving higher throughput through multiprocessing. In order to take full advantage of this, you need a multi-core processor (such as an Intel i7 with 8 processing threads). This should also work well with dual or quad processor boards to increase parallelism even further.
Our test system had the following setup:
- Processor: Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz
- Ethernet: Dual Intel e1000e
- RAM: 4 GB
- Driver: PF_RING e1000e driver, transparent_mode=1
- Snort: 2.9.0.x
As the graphs illustrate, running in line with 1 core can only sustain 100 Mbps or less (that's what is already available today). With PF_RING in line , we parallelize the in line processing on up to 8 cores, thus achieving almost 700 Mbps sustained throughput. Performance numbers are greatly affected by the type and number of Snort rules used, as well as the type of traffic being processed. However, it appears that no matter what your setup is, PF_RING in line with 8 cores should achieve 700-800 Mbps sustained throughput with an approximately 200 µs latency. That is impressive performance!
6,765 Emerging Threats Pro Rules
5,267 VRT Rules
Reaching 5 Gbps Throughput on Commodity Hardware with PF_RING NAPI or DNA
To reach 5 Gbps sustained throughput, we needed better hardware. In this experiment, we are running Snort on a dual processor board with a total of 24 hyper-threads (using the Intel X5670). Besides measuring Snort processing throughput while varying the number of rules, we also:
- Changed the compiler used to compile Snort (GCC vs. ICC) and
- Compared PF_RING in NAPI mode (running 24 Snort processes in parallel) and PF_RING Direct NIC Access technology (DNA) (running 16 Snort processes in parallel)
PF_RING NAPI performs the hashing of the packets in software and has a traditional architecture where the packets are copied to user space by the driver. Snort is parallelized using 24 processes that are allowed to float on the 24 hardware threads while the interrupts are parallelized on 16 of the 24 hardware threads.
PF_RING DNA performs the hashing of the packets in hardware (using the Intel 52599 RSS functionality) and relies on 16 hardware queues. The DNA driver allows 16 instances of Snort to read packets directly from the hardware queues, thereby virtually eliminating system-level processing overhead. There are limitations, though. PF_RING DNA:
- Supports a maximum of 16x parallelism per 10G interface,
- Only allows 1 process to attach to each hardware queue, and
- It costs a bit of money or requires Silicom cards (well worth it)
Number 2 in the list above is a significant limitation, because it does not allow multiple processes to receive the same data. For example, if you run tcpdump -i dna0
, you could not also run snort -i dna0 -c config.snort -A console
at the same time. The second invocation would return an error.
GCC is the standard open source compiler that comes with CentOS 6 and virtually all other Unix systems. It is the foundation of open source and without it we would still be in the stone age (computationally).
ICC is an Intel proprietary compiler that goes much farther in extracting instruction-level and data-level parallelism of modern multi-core processors such as the Intel i7 and Xeons.
Results
Our results below are excellent and show that you can build a 5-7 Gbps IDS using standard off-the-shelf machines and PF_RING. The system we used to perform these experiments is below.
Our test system used the following setup:
- Processor: Dual Intel(R) Core(TM) X5670 CPU
- Ethernet: Intel 10 Gbps 52599 (ixgbe)
- RAM: 24 GB
- Driver: ixgbe-3.1.15-FlowDirector
- Snort: 2.9.0.x
The graph above shows the sustained Snort performance for 4 different configurations using a varying number of Emerging Threats Pro rules. As expected, the number of rules has a dramatic effect on performance for all configurations (the more rules, the lower the performance). In all cases, memory access contention is likely to be the main limiting factor.
Given our experience, we think that our setup is fairly representative of an academic institution. We have to admit that measuring Snort performance in absolute terms is difficult. No two networks are the same and rule configurations vary even more widely. Nevertheless, the relative performance variations are important and of general interest. You can draw your own conclusions from the above graph; however here are some interesting observations:
- At the high end (6900 rules) ICC makes a big difference by increasing the throughput by ~1 Gbps (25%)
- GCC is just as good at maintaining throughput around 5 Gbps
- PF_RING DNA is always better than PF_RING NAPI
Configuring PF_RING for 700 Mbps Processing
File Downloads
Instructions
-
Install these packages:
yum install kernel-devel; yum install libtool; yum install subversion; yum install automake; yum install make; yum install autoconf; yum install pcre-devel; yum install libpcap-devel; yum install flex; yum install bison; yum install byacc; yum install gcc; yum install zlib-devel; yum install gcc-c++;
- Download and install libdnet-1.12.
-
Download our modified version of PF_RING and build the the PF_RING in line libraries and kernel module.
tar xvfz PF_RING.tgz cd PF_RING; make clean cd kernel; make clean; make; make install cd ../userland/lib; export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib; export LIBS='-L/usr/local/lib'; ./configure; make clean; make; make install cd ../libpcap; export LIBS='-L/usr/local/lib -lpfring -lpthread'; ./configure; make clean; make; make install; make clean; make; make install-shared ln -s /usr/local/lib/libpfring.so /usr/lib/libpfring.so
-
Download the daq-0.6.2 libraries and build:
tar xvfz daq-0.6.2.tgz cd daq-0.6.2; chmod 755 configure; export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib; export LIBS="-L/usr/local/lib -lpcap -lpthread" ./configure --disable-nfq-module --disable-ipq-module \ --with-libpcap-includes=/usr/local/include \ --with-libpcap-libraries=/usr/local/lib \ --with-libpfring-includes=/usr/local/include/ \ --with-libpfring-libraries=/usr/local/lib make clean; make; make install
-
Go back to the PF_RING directory and build the daq interface module:
cd PF_RING/userland/snort/pfring-daq-module; autoreconf -ivf; export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib export LIBS='-L/usr/local/lib -lpcap -lpfring -lpthread'; ./configure; make; make install
-
Build Snort 2.9.x:
cd snort-2.9.x; make clean ; export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib; export LIBS='-L/usr/local/lib -lpfring -lpthread' ./configure --with-libpcap-includes=/usr/local/includes \ --with-libpcap-libraries=/usr/local/lib \ --with-libpfring-includes=/usr/local/include/ \ --with-libpfring-libraries=/usr/local/lib \ --enable-zlib --enable-perfprofiling make make install
-
Load the PF_RING module.
Important!
The OS will try to load the PF_RING kernel module with default parameters anytime any application with PF_RING runs. The default parameters are wrong when running in line . Never run in line with
tx_capture
! To prevent this, it is always a good idea to remove pf_ring.ko and reload it with the correct variable before running in line .rmmod pf_ring.ko insmod pf_ring.ko enable_tx_capture=0
-
Run Snort with as many instances as your system can handle, limited only to value of
CLUSTER_LEN
inPF_RING/kernel/linux/pf_ring.h
at compile time (and your memory). Remember to replace the interfaces with values for your instance.
ifconfig eth0 up ifconfig eth1 up snort -c snort.serv.conf -A console -y -i eth0:eth1 \ --daq-dir /usr/local/lib/daq --daq pfring --daq-var clusterid=10 \ --daq-mode inline -Q
-
If you want even faster performance (about 20% more) and you have one of the Ethernet interfaces in
PF_RING/drivers
, you can run in transparent mode 1. We have only extensively tested the e1000e driver and we know it is very reliable. To use transparent mode 1 with an e1000e interface:
cd PF_RING/drivers/PF_RING_aware/intel/e1000e/e1000e-1.3.10a/src; make clean; make; make install
Now you need to replace the e1000e module by either rebooting or removing the old one and reloading the new driver in/lib/modules/`uname -r`/kernel/drivers/net/e1000e/
. You also need to reload thepf_ring.ko
module to enable transparent mode 1, also increasing the buffer size for extra *humf*:rmmod pf_ring.ko insmod pf_ring.ko enable_tx_capture=0 transparent_mode=1 min_num_slots=16384
If you have any issues, go to the Metaflows Google Group for support.
Configuring PF_RING for 5-7 Gbps Multiprocessing
Building and Running PF_RING NAPI
-
Load ixgbe driver. We found that setting the
InterruptThrottleRate
to4000
was optimal for our traffic.
modprobe ixgbe InterruptThrottleRate=4000
-
Load PF_RING in transparent mode 2 and set a reasonable buffer size.
modprobe pf_ring.ko transparent_mode=2 min_num_slots=16384
-
Bring up the
10gbe
interface (in our case this waseth3
).
ifconfig eth3 up
-
Optimize the Ethernet device. We mostly turned off options which hinder throughput. Substitute
eth3
with the interface appropriate to your instance.
ethtool -C eth3 rx-usecs 1000 ethtool -C eth3 adaptive-rx off ethtool -K eth3 tso off ethtool -K eth3 gro off ethtool -K eth3 lro off ethtool -K eth3 gso off ethtool -K eth3 rx off ethtool -K eth3 tx off ethtool -K eth3 sg off
-
Set up CPU affinity for interrupts based on the number of RX queues on the NIC, balanced across both processors. This may vary from system to system. Check
/proc/cpuinfo
to see which processor IDs are associated with each physical CPU.
printf "%s" 1 > /proc/irq/73/smp_affinity #cpu0 node0 printf "%s" 2 > /proc/irq/74/smp_affinity #cpu1 node0 printf "%s" 4 > /proc/irq/75/smp_affinity #cpu2 node0 printf "%s" 8 > /proc/irq/76/smp_affinity #cpu3 node0 printf "%s" 10 > /proc/irq/77/smp_affinity #cpu4 node0 printf "%s" 20 > /proc/irq/78/smp_affinity #cpu5 node0 printf "%s" 40 > /proc/irq/79/smp_affinity #cpu6 node1 printf "%s" 80 > /proc/irq/80/smp_affinity #cpu7 node1 printf "%s" 100 > /proc/irq/81/smp_affinity #cpu8 node1 printf "%s" 200 > /proc/irq/82/smp_affinity #cpu9 node1 printf "%s" 400 > /proc/irq/83/smp_affinity #cpu10 node1 printf "%s" 800 > /proc/irq/84/smp_affinity #cpu11 node1 printf "%s" 1000 > /proc/irq/85/smp_affinity #cpu12 node0 printf "%s" 2000 > /proc/irq/86/smp_affinity #cpu13 node0 printf "%s" 40000 > /proc/irq/78/smp_affinity #cpu18 node1 printf "%s" 80000 > /proc/irq/88/smp_affinity #cpu19 node1
-
Launch Snort instances in a PF_RING cluster. In our test, we spawned 24 instances with the following command:
for i in `seq 0 1 23`; do snort -c snort.serv.conf -N -A none -i eth3 --daq-dir /usr/local/lib/daq \ --daq pfring --daq-var clusterid=10 & done
Building and Running PF_RING DNA
- Download PF_RING 5.1 .
-
Configure and make from the top-level directory.
cd PF_RING-5.1.0 ./configure make
-
Load the DNA driver.
insmod /root/PF_RING-5.1.0/drivers/DNA/ixgbe-3.3.9-DNA/src/ixgbe.ko
-
Load PF_RING in transparent mode 2 and set a reasonable buffer size.
insmod /root/PF_RING-5.1.0/kernel/pf_ring.ko min_num_slots=8192 transparent_mode=2
-
Bring the DNA interface up.
ifconfig dna0 up #optimise the Ethernet device, mostly turning off options #which hinder throughput. #substitute eth3 with the interface appropriate to your instance ethtool -C eth3 rx-usecs 1000 ethtool -C eth3 adaptive-rx off ethtool -K eth3 tso off ethtool -K eth3 gro off ethtool -K eth3 lro off ethtool -K eth3 gso off ethtool -K eth3 rx off ethtool -K eth3 tx off ethtool -K eth3 sg off
-
Set up CPU affinity for interrupts based on the number of RX queues on the NIC, balanced across both processors. This may vary from system to system. Check
/proc/cpuinfo
to see which processor IDs are associated with each physical CPU.
printf "%s" 1 > /proc/irq/73/smp_affinity #cpu0 node0 printf "%s" 2 > /proc/irq/74/smp_affinity #cpu1 node0 printf "%s" 4 > /proc/irq/75/smp_affinity #cpu2 node0 printf "%s" 8 > /proc/irq/76/smp_affinity #cpu3 node0 printf "%s" 10 > /proc/irq/77/smp_affinity #cpu4 node0 printf "%s" 20 > /proc/irq/78/smp_affinity #cpu5 node0 printf "%s" 40 > /proc/irq/79/smp_affinity #cpu6 node1 printf "%s" 80 > /proc/irq/80/smp_affinity #cpu7 node1 printf "%s" 100 > /proc/irq/81/smp_affinity #cpu8 node1 printf "%s" 200 > /proc/irq/82/smp_affinity #cpu9 node1 printf "%s" 400 > /proc/irq/83/smp_affinity #cpu10 node1 printf "%s" 800 > /proc/irq/84/smp_affinity #cpu11 node1 printf "%s" 1000 > /proc/irq/85/smp_affinity #cpu12 node0 printf "%s" 2000 > /proc/irq/86/smp_affinity #cpu13 node0 printf "%s" 40000 > /proc/irq/78/smp_affinity #cpu18 node1 printf "%s" 80000 > /proc/irq/88/smp_affinity #cpu19 node1
-
This loop spawns 16 Snort processes. Each is bound to an RX queue of the NIC interface and is specified as
dnaX@Y
, whereX
is the DNA device ID, andY
is the RX queue.
for i in `seq 0 1 15`; do /nsm/bin/snort-2.9.0/src/snort -c /nsm/etc/snort.serv.conf \ -A none -N -y -i dna0@$i & done
Compiling Snort With ICC
- We found that ICC gives the best performance using its profiling capability with
-march=corei7 -fomit-frame-pointer -no-prec-div -fimf-precision:low -fno-alias -fno-fnalias
. Note: the Intel compiler is free to use for research purposes only; using this in production requires a paid license. -
Set environmental variables for Intel's compiler.
source /opt/intel/bin/compilervars.sh intel64 export CC=/opt/intel/composer_xe_2011_sp1.6.233/bin/intel64/icc
-
Compile Snort first using
-prof-gen
.
export CFLAGS='-march=corei7 -fomit-frame-pointer -no-prec-div \ -fimf-precision:low -fno-alias -fno-fnalias -prof-gen \ -prof-dir=/root/nsm_intel/bin/snort-2.9.x' cd snort-2.9.x/ make clean; ./configure; make
-
Run Snort on as many hardware threads as you would like using PF_RING_NAPI:
for i in `seq 0 1 23`; do snort -c /nsm/etc/snort.serv.conf -N -A none -i eth3 \ --daq-dir /usr/local/lib/daq --daq pfring --daq-var clusterid=10 & done
Or PF_RING DNA:
for i in `seq 0 1 15`; do snort -c /nsm/etc/snort.serv.conf -N -A none -i dna0@$i & done
-
Make sure to send some traffic for a few minutes. You might notice very high CPU utilization in
prof-gen
mode because the application is in profiling mode:
# Kill Snort to output the profile data killall snort # Now recompile it using "-prof-use" instead export CFLAGS='-march=corei7 -fomit-frame-pointer -no-prec-div \ -fimf-precision:low -fno-alias -fno-fnalias -prof-use \ -prof-dir=/root/nsm_intel/bin/snort-2.9.x' cd snort-2.9.x/ make clean; ./configure; make
If you have any issues, go to the Metaflows Google Group for support.