MetaFlows Appliances

MetaFlows' security appliances are based on robust open standards that quickly integrate in any existing infrastructure. They are custom-built with the best hardware components available today to provide reliable and cost-effective packet processing from 50 Mbps to 10 Gbps.

20 Mbps UTM Appliance
5 Gbps Behavioral Malware Detection Appliance
10 Gbps Behavioral Malware Detection Appliance
10 Gbps Behavioral Malware Detection Appliance

But unlike our competitors, we detail how open standards (both hardware and software) are used to build extremely cost-effective network security appliances. Below, you will find all the information you need to review what we do and reproduce our results on your own hardware.

Robust, Open Source IDS Multiprocessing With PF_RING

Metaflows supports open source.

Metaflows' contribution to the PF_RING project has produced open source technology capable of scaling network monitoring from 10 Mbps to 10 Gbps. Below, we summarize the results of our peer-reviewed testing showing that it is possible to build extremely effective network monitoring appliances on inexpensive commodity hardware.

PF_RING 1 Gbps on Centos 6.x

We have modified PF_RING to work with in line Snort (while still supporting the current passive multiprocessing functionality). PF_RING load balances the traffic to analyze by hashing the IP headers into multiple buckets. This allows PF_RING to spawn multiple instances of Snort, each processing a single bucket and achieving higher throughput through multiprocessing. In order to take full advantage of this, you need a multi-core processor (such as an Intel i7 with 8 processing threads). This should also work well with dual or quad processor boards to increase parallelism even further.

Our test system had the following setup:

  • Processor: Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz
  • Ethernet: Dual Intel e1000e
  • RAM: 4 GB
  • Driver: PF_RING e1000e driver, transparent_mode=1
  • Snort: 2.9.0.x

As the graphs illustrate, running in line with 1 core can only sustain 100 Mbps or less (that's what is already available today). With PF_RING in line , we parallelize the in line processing on up to 8 cores, thus achieving almost 700 Mbps sustained throughput. Performance numbers are greatly affected by the type and number of Snort rules used, as well as the type of traffic being processed. However, it appears that no matter what your setup is, PF_RING in line with 8 cores should achieve 700-800 Mbps sustained throughput with an approximately 200 µs latency. That is impressive performance!

6,765 Emerging Threats Pro Rules

Results from running PF_RING in line  with 6,765 Emerging Threats Pro rules

5,267 VRT Rules

Results from running PF_RING in line  with 5,267 VRT rules

Reaching 5 Gbps Throughput on Commodity Hardware with PF_RING NAPI or DNA

To reach 5 Gbps sustained throughput, we needed better hardware. In this experiment, we are running Snort on a dual processor board with a total of 24 hyper-threads (using the Intel X5670). Besides measuring Snort processing throughput while varying the number of rules, we also:

  1. Changed the compiler used to compile Snort (GCC vs. ICC) and
  2. Compared PF_RING in NAPI mode (running 24 Snort processes in parallel) and PF_RING Direct NIC Access technology (DNA) (running 16 Snort processes in parallel)

PF_RING NAPI performs the hashing of the packets in software and has a traditional architecture where the packets are copied to user space by the driver. Snort is parallelized using 24 processes that are allowed to float on the 24 hardware threads while the interrupts are parallelized on 16 of the 24 hardware threads.

PF_RING DNA performs the hashing of the packets in hardware (using the Intel 52599 RSS functionality) and relies on 16 hardware queues. The DNA driver allows 16 instances of Snort to read packets directly from the hardware queues, thereby virtually eliminating system-level processing overhead. There are limitations, though. PF_RING DNA:

  1. Supports a maximum of 16x parallelism per 10G interface,
  2. Only allows 1 process to attach to each hardware queue, and
  3. It costs a bit of money or requires Silicom cards (well worth it)

Number 2 in the list above is a significant limitation, because it does not allow multiple processes to receive the same data. For example, if you run tcpdump -i dna0, you could not also run snort -i dna0 -c config.snort -A console at the same time. The second invocation would return an error.

GCC is the standard open source compiler that comes with CentOS 6 and virtually all other Unix systems. It is the foundation of open source and without it we would still be in the stone age (computationally).

ICC is an Intel proprietary compiler that goes much farther in extracting instruction-level and data-level parallelism of modern multi-core processors such as the Intel i7 and Xeons.

Results

Our results below are excellent and show that you can build a 5-7 Gbps IDS using standard off-the-shelf machines and PF_RING. The system we used to perform these experiments is below.

Our test system used the following setup:

  • Processor: Dual Intel(R) Core(TM) X5670 CPU
  • Ethernet: Intel 10 Gbps 52599 (ixgbe)
  • RAM: 24 GB
  • Driver: ixgbe-3.1.15-FlowDirector
  • Snort: 2.9.0.x
Results from our PF_RING testing to reach 5-7 Gbps throughput.

The graph above shows the sustained Snort performance for 4 different configurations using a varying number of Emerging Threats Pro rules. As expected, the number of rules has a dramatic effect on performance for all configurations (the more rules, the lower the performance). In all cases, memory access contention is likely to be the main limiting factor.

Given our experience, we think that our setup is fairly representative of an academic institution. We have to admit that measuring Snort performance in absolute terms is difficult. No two networks are the same and rule configurations vary even more widely. Nevertheless, the relative performance variations are important and of general interest. You can draw your own conclusions from the above graph; however here are some interesting observations:

  • At the high end (6900 rules) ICC makes a big difference by increasing the throughput by ~1 Gbps (25%)
  • GCC is just as good at maintaining throughput around 5 Gbps
  • PF_RING DNA is always better than PF_RING NAPI

Configuring PF_RING for 700 Mbps Processing

File Downloads

Instructions

  1. Install these packages:
    yum install kernel-devel;
    yum install libtool;
    yum install subversion;
    yum install automake;
    yum install make;
    yum install autoconf;
    yum install pcre-devel;
    yum install libpcap-devel;
    yum install flex;
    yum install bison;
    yum install byacc;
    yum install gcc;
    yum install zlib-devel;
    yum install gcc-c++;
    
  2. Download and install libdnet-1.12.
  3. Download our modified version of PF_RING and build the the PF_RING in line libraries and kernel module.
    tar xvfz PF_RING.tgz
    cd  PF_RING; make clean
    cd kernel;
    make clean; make; make install
    cd ../userland/lib;
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib;
    export LIBS='-L/usr/local/lib';
    ./configure;
    make clean; make; make install
    cd ../libpcap;
    export LIBS='-L/usr/local/lib -lpfring -lpthread';
    ./configure;
    make clean; make; make install;
    make clean; make; make install-shared
    ln -s /usr/local/lib/libpfring.so /usr/lib/libpfring.so
  4. Download the daq-0.6.2 libraries and build:
    tar xvfz daq-0.6.2.tgz
    cd daq-0.6.2;
    chmod 755 configure;
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib;
    export LIBS="-L/usr/local/lib -lpcap -lpthread"
    ./configure --disable-nfq-module --disable-ipq-module \
    --with-libpcap-includes=/usr/local/include \
    --with-libpcap-libraries=/usr/local/lib \
    --with-libpfring-includes=/usr/local/include/ \
    --with-libpfring-libraries=/usr/local/lib
    make clean; make; make install
  5. Go back to the PF_RING directory and build the daq interface module:
    cd PF_RING/userland/snort/pfring-daq-module;
    autoreconf -ivf;
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
    export LIBS='-L/usr/local/lib -lpcap -lpfring -lpthread';
    ./configure; make; make install
  6. Build Snort 2.9.x:
    cd snort-2.9.x;
    make clean ;
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib;
    export LIBS='-L/usr/local/lib -lpfring -lpthread'
    ./configure --with-libpcap-includes=/usr/local/includes \
    --with-libpcap-libraries=/usr/local/lib \
    --with-libpfring-includes=/usr/local/include/ \
    --with-libpfring-libraries=/usr/local/lib \
    --enable-zlib --enable-perfprofiling
    make
    make install
  7. Load the PF_RING module.
    Important!

    The OS will try to load the PF_RING kernel module with default parameters anytime any application with PF_RING runs. The default parameters are wrong when running in line . Never run in line with tx_capture! To prevent this, it is always a good idea to remove pf_ring.ko and reload it with the correct variable before running in line .

    rmmod pf_ring.ko
    insmod pf_ring.ko enable_tx_capture=0
  8. Run Snort with as many instances as your system can handle, limited only to value of CLUSTER_LEN in PF_RING/kernel/linux/pf_ring.h at compile time (and your memory). Remember to replace the interfaces with values for your instance.
    ifconfig eth0 up
    ifconfig eth1 up
    snort -c snort.serv.conf -A console -y -i eth0:eth1 \
    --daq-dir /usr/local/lib/daq --daq pfring --daq-var clusterid=10 \
    --daq-mode inline -Q
  9. If you want even faster performance (about 20% more) and you have one of the Ethernet interfaces in PF_RING/drivers, you can run in transparent mode 1. We have only extensively tested the e1000e driver and we know it is very reliable. To use transparent mode 1 with an e1000e interface:
    cd PF_RING/drivers/PF_RING_aware/intel/e1000e/e1000e-1.3.10a/src;
    make clean;
    make;
    make install

    Now you need to replace the e1000e module by either rebooting or removing the old one and reloading the new driver in /lib/modules/`uname -r`/kernel/drivers/net/e1000e/. You also need to reload the pf_ring.ko module to enable transparent mode 1, also increasing the buffer size for extra *humf*:
    rmmod pf_ring.ko
    insmod pf_ring.ko enable_tx_capture=0 transparent_mode=1 min_num_slots=16384

If you have any issues, go to the Metaflows Google Group for support.

Configuring PF_RING for 5-7 Gbps Multiprocessing

Building and Running PF_RING NAPI

  1. Load ixgbe driver. We found that setting the InterruptThrottleRate to 4000 was optimal for our traffic.
    modprobe ixgbe InterruptThrottleRate=4000
  2. Load PF_RING in transparent mode 2 and set a reasonable buffer size.
    modprobe pf_ring.ko transparent_mode=2 min_num_slots=16384
  3. Bring up the 10gbe interface (in our case this was eth3).
    ifconfig eth3 up
  4. Optimise the Ethernet device. We mostly turned off options which hinder throughput. Substitute eth3 with the interface appropriate to your instance.
    ethtool -C eth3 rx-usecs 1000
    ethtool -C eth3 adaptive-rx off
    ethtool -K eth3 tso off
    ethtool -K eth3 gro off
    ethtool -K eth3 lro off
    ethtool -K eth3 gso off
    ethtool -K eth3 rx off
    ethtool -K eth3 tx off
    ethtool -K eth3 sg off
  5. Set up CPU affinity for interrupts based on the number of RX queues on the NIC, balanced across both processors. This may vary from system to system. Check /proc/cpuinfo to see which processor IDs are associated with each physical CPU.
    printf "%s" 1 > /proc/irq/73/smp_affinity #cpu0 node0
    printf "%s" 2 > /proc/irq/74/smp_affinity #cpu1 node0
    printf "%s" 4 > /proc/irq/75/smp_affinity #cpu2 node0
    printf "%s" 8 > /proc/irq/76/smp_affinity #cpu3 node0
    printf "%s" 10 > /proc/irq/77/smp_affinity #cpu4 node0
    printf "%s" 20 > /proc/irq/78/smp_affinity #cpu5 node0
    printf "%s" 40 > /proc/irq/79/smp_affinity #cpu6 node1
    printf "%s" 80 > /proc/irq/80/smp_affinity #cpu7 node1
    printf "%s" 100 > /proc/irq/81/smp_affinity #cpu8 node1
    printf "%s" 200 > /proc/irq/82/smp_affinity #cpu9 node1
    printf "%s" 400 > /proc/irq/83/smp_affinity #cpu10 node1
    printf "%s" 800 > /proc/irq/84/smp_affinity #cpu11 node1
    printf "%s" 1000 > /proc/irq/85/smp_affinity #cpu12 node0
    printf "%s" 2000 > /proc/irq/86/smp_affinity #cpu13 node0
    printf "%s" 40000 > /proc/irq/78/smp_affinity #cpu18 node1
    printf "%s" 80000 > /proc/irq/88/smp_affinity #cpu19 node1
  6. Launch Snort instances in a PF_RING cluster. In our test, we spawned 24 instances with the following command:
    for i in `seq 0 1 23`; do
    snort -c snort.serv.conf -N -A none -i eth3 --daq-dir /usr/local/lib/daq \
    --daq pfring --daq-var clusterid=10 &
    done

Building and Running PF_RING DNA

  1. Download PF_RING 5.1 .
  2. Configure and make from the top-level directory.
    cd PF_RING-5.1.0
    ./configure
    make
  3. Load the DNA driver.
    insmod /root/PF_RING-5.1.0/drivers/DNA/ixgbe-3.3.9-DNA/src/ixgbe.ko
  4. Load PF_RING in transparent mode 2 and set a reasonable buffer size.
    insmod /root/PF_RING-5.1.0/kernel/pf_ring.ko min_num_slots=8192 transparent_mode=2
  5. Bring the DNA interface up.
    ifconfig dna0 up
    #optimise the Ethernet device, mostly turning off options
    #which hinder throughput.
    #substitute eth3 with the interface appropriate to your instance
    ethtool -C eth3 rx-usecs 1000
    ethtool -C eth3 adaptive-rx off
    ethtool -K eth3 tso off
    ethtool -K eth3 gro off
    ethtool -K eth3 lro off
    ethtool -K eth3 gso off
    ethtool -K eth3 rx off
    ethtool -K eth3 tx off
    ethtool -K eth3 sg off
  6. Set up CPU affinity for interrupts based on the number of RX queues on the NIC, balanced across both processors. This may vary from system to system. Check /proc/cpuinfo to see which processor IDs are associated with each physical CPU.
    printf "%s" 1 > /proc/irq/73/smp_affinity #cpu0 node0
    printf "%s" 2 > /proc/irq/74/smp_affinity #cpu1 node0
    printf "%s" 4 > /proc/irq/75/smp_affinity #cpu2 node0
    printf "%s" 8 > /proc/irq/76/smp_affinity #cpu3 node0
    printf "%s" 10 > /proc/irq/77/smp_affinity #cpu4 node0
    printf "%s" 20 > /proc/irq/78/smp_affinity #cpu5 node0
    printf "%s" 40 > /proc/irq/79/smp_affinity #cpu6 node1
    printf "%s" 80 > /proc/irq/80/smp_affinity #cpu7 node1
    printf "%s" 100 > /proc/irq/81/smp_affinity #cpu8 node1
    printf "%s" 200 > /proc/irq/82/smp_affinity #cpu9 node1
    printf "%s" 400 > /proc/irq/83/smp_affinity #cpu10 node1
    printf "%s" 800 > /proc/irq/84/smp_affinity #cpu11 node1
    printf "%s" 1000 > /proc/irq/85/smp_affinity #cpu12 node0
    printf "%s" 2000 > /proc/irq/86/smp_affinity #cpu13 node0
    printf "%s" 40000 > /proc/irq/78/smp_affinity #cpu18 node1
    printf "%s" 80000 > /proc/irq/88/smp_affinity #cpu19 node1
  7. This loop spawns 16 Snort processes. Each is bound to an RX queue of the NIC interface and is specified as dnaX@Y, where X is the DNA device ID, and Y is the RX queue.
    for i in `seq 0 1 15`; do
    /nsm/bin/snort-2.9.0/src/snort -c /nsm/etc/snort.serv.conf \
    -A none -N -y -i dna0@$i &
    done

Compiling Snort With ICC

  1. We found that ICC gives the best performance using its profiling capability with -march=corei7 -fomit-frame-pointer -no-prec-div -fimf-precision:low -fno-alias -fno-fnalias. Note: the Intel compiler is free to use for research purposes only; using this in production requires a paid license.
  2. Set environmental variables for Intel's compiler.
    source /opt/intel/bin/compilervars.sh intel64
    export CC=/opt/intel/composer_xe_2011_sp1.6.233/bin/intel64/icc
  3. Compile Snort first using -prof-gen.
    export CFLAGS='-march=corei7 -fomit-frame-pointer -no-prec-div \
    -fimf-precision:low -fno-alias -fno-fnalias -prof-gen \
    -prof-dir=/root/nsm_intel/bin/snort-2.9.x'
    cd snort-2.9.x/
    make clean;
    ./configure;
    make
  4. Run Snort on as many hardware threads as you would like using PF_RING_NAPI:
    for i in `seq 0 1 23`; do
    snort -c /nsm/etc/snort.serv.conf -N -A none -i eth3 \
    --daq-dir /usr/local/lib/daq --daq pfring --daq-var clusterid=10 &
    done

    Or PF_RING DNA:
    for i in `seq 0 1 15`; do
    snort -c /nsm/etc/snort.serv.conf -N -A none -i dna0@$i &
    done
  5. Make sure to send some traffic for a few minutes. You might notice very high CPU utilization in prof-gen mode because the application is in profiling mode:
    # Kill Snort to output the profile data
    killall snort
    # Now recompile it using "-prof-use" instead
    export CFLAGS='-march=corei7 -fomit-frame-pointer -no-prec-div \
    -fimf-precision:low -fno-alias -fno-fnalias -prof-use \
    -prof-dir=/root/nsm_intel/bin/snort-2.9.x'
    cd snort-2.9.x/
    make clean; ./configure; make

If you have any issues, go to the Metaflows Google Group for support.

Use our appliances or run a trial on your own hardware. Try it today.
Start Free 14-Day Trial Schedule a Demo

Follow us
MetaFlows on Google GroupsMetaFlows on LinkedIn

Recent Posts About Malware Detection and Network Security
  • test