PF_RING based 10 Gbps Snort multiprocessing

Tested on CentOS 6 64bit using our custom PF_RING source

PF_RING load balances network traffic originating from an Ethernet interface by hashing the IP headers into N buckets. This allows it to spawn N instances of Snort, each processing a single bucket and achieve higher throughput through multiprocessing. In order to take full advantage of this, you need a multicore processor (like an I7 with 8 processing threads) or a dual or quad processor board that increases parallelism even further across multiple chips.

In a related article we measured the performance of PF_RING with Snort inline at 1 Gbps on an I7 950. The results were impressive.

The big deal is that now you can build low-cost IDPS systems using standard off-the-shelf hardware.

You can purchase our purpose-built Hardware with MetaFlows PF_RING pre-installed, giving you a low cost high performance platform to run your custom PF_RING applications on. If you are interested in learning more, please contact us.

In this article we report on our experiment running Snort on a dual processor board with a total of 24 hyperthreads (using the Intel X5670). Besides measuring Snort processing throughput varying the number of rules, we also (1) changed the compiler used to compile Snort (GCC vs. ICC) and (2) compared PF_RING in NAPI mode (running 24 Snort processes in parallel) and PF_RING Direct NIC Access technology (DNA) (running 16 Snort processes in parallel).

PF_RING NAPI performs the hashing of the packets in software and has a traditional architecture where the packets are copied to user space by the driver. Snort is parallelized using 24 processes that are allowed to float on the 24 hardware threads while the interrupts are parallelized on 16 of the 24 hardware threads.

PF_RING DNA performs the hashing of the packets in hardware (using the Intel 52599 RSS functionality) and relies on 16 hardware queues. The DNA driver allows 16 instances of Snort to read packets directly from the hardware queues therefore virtually eliminating system-level processing overhead. The limitation of DNA is that (1) supports a maximum of 16x parallelism per 10G interface, (2) it only allows 1 process to attach to each hardware queue and (3) it costs a bit of money or requires Silicom cards(well worth it). (2) is significant because it does not allow multiple processes to receive the same data. So, for example if you run “tcpdump -i dna0″, you could not also run “snort -i dna0 -c config.snort -A console” at the same time. The second invocation would return an error.

GCC is the standard open source compiler that comes with CentOS 6 and virtually all other Unix systems. It is the foundation of open source and without it we would still be in the stone age (computationally).

ICC is an Intel proprietary compiler that goes much further in extracting instruction- and data-level parallelism of modern multicore processors such as the i7 and Xeons.

All results are excellent and show that you can build a 5-7 Gbps IDS using standard off-the-shelf machines and PF_RING. The system we used to perform these experiments is below:

Dual Intel(R) Core(TM) X5670 CPU, Intel 10 Gbps 52599 (ixgbe), 24 GB RAM

10Gresults21 10 Gbps PF RING

The graph above shows the sustained Snort performance of 4 different configurations using a varying number of Emerging Threats Pro rules. As expected, the number of rules has a dramatic effect on performance for all configurations (the more rules, the lower the performance). In all cases, memory access contention is likely to be the main limiting factor.

Given our experience, we think that our setup is fairly representative of an academic institution we have to admit that measuring Snort performance in the absolute is hard. No two networks are the same and rule configurations vary even more widely, nevertheless, the relative performance variations are important and of general interest. You can draw your own conclusions from the above graph; however here are some interesting observations:

  • At the high end (6900 rules) ICC makes a big difference by increasing the throughput by ~1 Gbps (25%)
  • GCC is just as good at maintaining throughput around 5 Gbps
  • PF_RING DNA is always better than PF_RING NAPI.

We describe below how to reproduce these numbers on Linux CentOS 6. If you do not want to go through these steps, we also provide this functionality through our security system (MSS) pre-packaged and ready to go. It would help us if you tried it and let us know what you think. If you want to go at it on your own see below:

Building and Running PF_RING NAPI

See our other article but use the Intel ixgbe-3.1.15-FlowDirector driver instead of the e1000e driver. Then execute:

#Load ixgbe driver, we found that setting the InterruptThrottleRate to
#4000 was optimal for our traffic
modprobe ixgbe InterruptThrottleRate=4000
#Load pf_ring in transparent mode 2 and set a reasonable buffer size
modprobe pf_ring.ko transparent_mode=2 min_num_slots=16384
#bring up the 10gbe interface, in our case eth3
ifconfig eth3 up
#optimise the Ethernet device, mostly turning off options which hinder throughput.
#substitute eth3 with the interface appropriate to your instance
ethtool -C eth3 rx-usecs 1000
ethtool -C eth3 adaptive-rx off
ethtool -K eth3 tso off
ethtool -K eth3 gro off
ethtool -K eth3 lro off
ethtool -K eth3 gso off
ethtool -K eth3 rx off
ethtool -K eth3 tx off
ethtool -K eth3 sg off

#Setup CPU affinity for interrupts based on the number of rx queues
#on the nic, balanced across both processors.
#This may vary from system to system, check /proc/cpuinfo to see which
#processor ids go to which physical cpu.

printf "%s" 1 > /proc/irq/73/smp_affinity #cpu0 node0
printf "%s" 2 > /proc/irq/74/smp_affinity #cpu1 node0
printf "%s" 4 > /proc/irq/75/smp_affinity #cpu2 node0
printf "%s" 8 > /proc/irq/76/smp_affinity #cpu3 node0
printf "%s" 10 > /proc/irq/77/smp_affinity #cpu4 node0
printf "%s" 20 > /proc/irq/78/smp_affinity #cpu5 node0
printf "%s" 40 > /proc/irq/79/smp_affinity #cpu6 node1
printf "%s" 80 > /proc/irq/80/smp_affinity #cpu7 node1
printf "%s" 100 > /proc/irq/81/smp_affinity #cpu8 node1
printf "%s" 200 > /proc/irq/82/smp_affinity #cpu9 node1
printf "%s" 400 > /proc/irq/83/smp_affinity #cpu10 node1
printf "%s" 800 > /proc/irq/84/smp_affinity #cpu11 node1
printf "%s" 1000 > /proc/irq/85/smp_affinity #cpu12 node0
printf "%s" 2000 > /proc/irq/86/smp_affinity #cpu13 node0
printf "%s" 40000 > /proc/irq/78/smp_affinity #cpu18 node1
printf "%s" 80000 > /proc/irq/88/smp_affinity #cpu19 node1

#launch snorts in a pf_ring cluster, in our test we spawned the following command 24 times.

for i in `seq 0 1 23`; do
snort -c snort.serv.conf -N -A none -i eth3 --daq-dir /usr/local/lib/daq \
--daq pfring --daq-var clusterid=10 &
done

Building and Running PF_RING DNA

#Download PF_RING 5.1 here
#configure and make from the top level directory

cd  PF_RING-5.1.0
./configure
make

#load the DNA driver

insmod /root/PF_RING-5.1.0/drivers/DNA/ixgbe-3.3.9-DNA/src/ixgbe.ko

#Load pf_ring in transparent mode 2 and set a reasonable buffer size
insmod /root/PF_RING-5.1.0/kernel/pf_ring.ko min_num_slots=8192 transparent_mode=2

#bring the DNA interface up

ifconfig dna0 up
#optimise the Ethernet device, mostly turning off options
#which hinder throughput.
#substitute eth3 with the interface appropriate to your instance
ethtool -C eth3 rx-usecs 1000
ethtool -C eth3 adaptive-rx off
ethtool -K eth3 tso off
ethtool -K eth3 gro off
ethtool -K eth3 lro off
ethtool -K eth3 gso off
ethtool -K eth3 rx off
ethtool -K eth3 tx off
ethtool -K eth3 sg off

#Setup CPU affinity for interrupts based on the number of rx queues on the nic, balanced across both processors.
#This may vary from system to system, check /proc/cpuinfo to see which processor ids go to which physical cpu.

printf "%s" 1 > /proc/irq/73/smp_affinity #cpu0 node0
printf "%s" 2 > /proc/irq/74/smp_affinity #cpu1 node0
printf "%s" 4 > /proc/irq/75/smp_affinity #cpu2 node0
printf "%s" 8 > /proc/irq/76/smp_affinity #cpu3 node0
printf "%s" 10 > /proc/irq/77/smp_affinity #cpu4 node0
printf "%s" 20 > /proc/irq/78/smp_affinity #cpu5 node0
printf "%s" 40 > /proc/irq/79/smp_affinity #cpu6 node1
printf "%s" 80 > /proc/irq/80/smp_affinity #cpu7 node1
printf "%s" 100 > /proc/irq/81/smp_affinity #cpu8 node1
printf "%s" 200 > /proc/irq/82/smp_affinity #cpu9 node1
printf "%s" 400 > /proc/irq/83/smp_affinity #cpu10 node1
printf "%s" 800 > /proc/irq/84/smp_affinity #cpu11 node1
printf "%s" 1000 > /proc/irq/85/smp_affinity #cpu12 node0
printf "%s" 2000 > /proc/irq/86/smp_affinity #cpu13 node0
printf "%s" 40000 > /proc/irq/78/smp_affinity #cpu18 node1
printf "%s" 80000 > /proc/irq/88/smp_affinity #cpu19 node1

#This loop to spawns 16 snort processes, each bound to an RX queue of the nic
#interface is specified as dnaX@Y, where X is the dna device id, and Y is the RX queue

for i in `seq 0 1 15`; do
/nsm/bin/snort-2.9.0/src/snort -c /nsm/etc/snort.serv.conf \
-A none -N -y -i dna0@$i &
done

Compiling Snort with ICC

#We found that ICC gives the best performance using its profiling capability and -march=corei7 -fomit-frame-pointer -no-prec-div -fimf-precision:low -fno-alias -fno-fnalias.
#Note the intel compiler is free to use for research purposes only,
#it requires a paid license to use in production.

#Set environmental variables for intel's compiler
source /opt/intel/bin/compilervars.sh intel64
export CC=/opt/intel/composer_xe_2011_sp1.6.233/bin/intel64/icc
#Compile snort first using "-prof-gen"
export CFLAGS='-march=corei7 -fomit-frame-pointer -no-prec-div \
-fimf-precision:low -fno-alias -fno-fnalias -prof-gen \
-prof-dir=/root/nsm_intel/bin/snort-2.9.x'
cd snort-2.9.x/
make clean;./configure; make

#run snort on as many hardware threads as you would like using PF_RING_NAPI or PF_RING DNA

for i in `seq 0 1 23`; do
snort -c /nsm/etc/snort.serv.conf -N -A none -i eth3 \
--daq-dir /usr/local/lib/daq --daq pfring --daq-var clusterid=10 &
done
for i in `seq 0 1 15`; do
snort -c /nsm/etc/snort.serv.conf -N -A none -i dna0@$i &
done

#make sure you send some traffic for a few minutes;
#You might notice very high CPU utilization in prof-gen mode because the application is in profiling mode

#kill snort to output the profile data
killall snort
#Now recompile it using "-prof-use" instead
export CFLAGS='-march=corei7 -fomit-frame-pointer -no-prec-div \
-fimf-precision:low -fno-alias -fno-fnalias -prof-use \
-prof-dir=/root/nsm_intel/bin/snort-2.9.x'
cd snort-2.9.x/
make clean; ./configure; make

If you have any issues go to the metaflows google group for support.