-------------------------------------------------------------------
Inline Snort multiprocessing with PF_RING – Tested on CentOS 6

We have modified PF_RING to work with inline Snort (while still supporting the current passive multiprocessing functionality). PF_RING load balances the traffic to analyze by hashing the TCP/UDP headers in multiple buckets. This allows to spawn multiple instances of Snort each processing a single bucket and achieve higher throughput though multiprocessing. In order to take full advantage of this you need a multicore processor (like an I7 with 8 processing threads). This should also work well with dual or quad processor boards to increase parallelism even further.

The big deal is that now you can build really cheap IPS systems using standard off-the-shelf machines. Here are is the system we have ported PF_RING inline to:

Intel(R) Core(TM) i7 CPU 950  @ 3.07GHz
PF_RING e1000e driver, transparent_mode=1
Snort 2.9.0.x using the SRI BotHunter Ruleset, in addition to the Emerging Threats Pro Ruleset

Throughput: ~800Mbps
Latency: ~200us

Please install the following packages first. Most of these can be installed as: yum install –package–

kernel-devel
libtool
subversion
automake
make
autoconf
pcre-devel
libpcap-devel
flex
bison
byacc
gcc
zlib-devel
gcc-c++

#Download and install
https://www.metaflows.com/assets/downloads/pf_ring/libdnet-1.12.tgz


#Build the PF_RING inline libraries and kernel module:

#download our modified PF_RING source here
https://www.metaflows.com/assets/downloads/pf_ring/PF_RING.tgz
tar xvfz pfring_inline.tgz

    cd  PF_RING; make clean
    cd kernel; make clean; make; make install
    cd ../userland/lib; export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib; export LIBS=’-L/usr/local/lib’; ./configure; make clean; make; make install
    cd ../libpcap; export LIBS=’-L/usr/local/lib -lpfring -lpthread’; ./configure; make clean; make; make install; make clean; make; make install-shared
    ln -s /usr/local/lib/libpfring.so /usr/lib/libpfring.so

#Build the daq-0.6.2 libraries:

#downlaod daq-0.6.2 here
https://www.metaflows.com/assets/downloads/pf_ring/daq-0.6.2.tgz
tar xvfz daq-0.6.2.tgz
cd daq-0.6.2;
chmod 755 configure;
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib;
export LIBS="-L/usr/local/lib -lpcap -lpthread"
./configure --disable-nfq-module --disable-ipq-module \
--with-libpcap-includes=/usr/local/include \
--with-libpcap-libraries=/usr/local/lib \
--with-libpfring-includes=/usr/local/include/ \
--with-libpfring-libraries=/usr/local/lib
make clean; make; make install

#Go back to the PF_RING directory and build the daq interface module

    cd  PF_RING/userland/snort/pfring-daq-module; autoreconf -ivf; export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
    export LIBS=’-L/usr/local/lib -lpcap -lpfring -lpthread’; ./configure; make; make install

# Build Snort 2.9.x #

    cd snort-2.9.x;make clean ; export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib; export LIBS=’-L/usr/local/lib -lpfring -lpthread’
    ./configure –with-libpcap-includes=/usr/local/includes –with-libpcap-libraries=/usr/local/lib –with-libpfring-includes=/usr/local/include/ –with-libpfring-libraries=/usr/local/lib –enable-zlib –enable-perfprofiling
    make
    make install

# Load PF_RING MODULE
#never run inline with tx_capture!!!!

    insmod pf_ring.ko enable_tx_capture=0

# Run Snort     #
# Run as many instances as your system can handle limited only to value of CLUSTER_LEN in PF_RING/kernel/linux/pf_ring.h at compile time (and your memory).
#Remember to replace the interfaces with ones appropriate for your instance.

    ifconfig eth0 up
    ifconfig eth1 up
    snort -c snort.serv.conf -A console -y -i eth0:eth1 –daq-dir /usr/local/lib/daq –daq pfring –daq-var clusterid=10 –daq-mode inline -Q

----------------------------------------------------------
Configuring PF_RING for 5-7 Gbps Multiprocessing
Building and Running PF_RING NAPI

    Load ixgbe driver. We found that setting the InterruptThrottleRate to 4000
was optimal for our traffic.

    modprobe ixgbe InterruptThrottleRate=4000

    Load PF_RING in transparent mode 2 and set a reasonable buffer size.

    modprobe pf_ring.ko transparent_mode=2 min_num_slots=16384

    Bring up the 10gbe interface (in our case this was eth3).

    ifconfig eth3 up

    Optimise the Ethernet device. We mostly turned off options which hinder
throughput. Substitute eth3 with the interface appropriate to your instance.

    ethtool -C eth3 rx-usecs 1000
    ethtool -C eth3 adaptive-rx off
    ethtool -K eth3 tso off
    ethtool -K eth3 gro off
    ethtool -K eth3 lro off
    ethtool -K eth3 gso off
    ethtool -K eth3 rx off
    ethtool -K eth3 tx off
    ethtool -K eth3 sg off

    Set up CPU affinity for interrupts based on the number of RX queues on the
NIC, balanced across both processors. This may vary from system to system.
Check /proc/cpuinfo to see which processor IDs are associated with each
physical CPU.

    printf "%s" 1 > /proc/irq/73/smp_affinity #cpu0 node0
    printf "%s" 2 > /proc/irq/74/smp_affinity #cpu1 node0
    printf "%s" 4 > /proc/irq/75/smp_affinity #cpu2 node0
    printf "%s" 8 > /proc/irq/76/smp_affinity #cpu3 node0
    printf "%s" 10 > /proc/irq/77/smp_affinity #cpu4 node0
    printf "%s" 20 > /proc/irq/78/smp_affinity #cpu5 node0
    printf "%s" 40 > /proc/irq/79/smp_affinity #cpu6 node1
    printf "%s" 80 > /proc/irq/80/smp_affinity #cpu7 node1
    printf "%s" 100 > /proc/irq/81/smp_affinity #cpu8 node1
    printf "%s" 200 > /proc/irq/82/smp_affinity #cpu9 node1
    printf "%s" 400 > /proc/irq/83/smp_affinity #cpu10 node1
    printf "%s" 800 > /proc/irq/84/smp_affinity #cpu11 node1
    printf "%s" 1000 > /proc/irq/85/smp_affinity #cpu12 node0
    printf "%s" 2000 > /proc/irq/86/smp_affinity #cpu13 node0
    printf "%s" 40000 > /proc/irq/78/smp_affinity #cpu18 node1
    printf "%s" 80000 > /proc/irq/88/smp_affinity #cpu19 node1

    Launch Snort instances in a PF_RING cluster. In our test, we spawned 24
instances with the following command:

    for i in `seq 0 1 23`; do
    snort -c snort.serv.conf -N -A none -i eth3 --daq-dir /usr/local/lib/daq \
    --daq pfring --daq-var clusterid=10 &
    done


-----------------------------------------------------------------------------
Building and Running PF_RING DNA

    Download PF_RING 5.1 .
    Configure and make from the top-level directory.

    cd PF_RING-5.1.0
    ./configure
    make

    Load the DNA driver.

    insmod /root/PF_RING-5.1.0/drivers/DNA/ixgbe-3.3.9-DNA/src/ixgbe.ko

    Load PF_RING in transparent mode 2 and set a reasonable buffer size.

    insmod /root/PF_RING-5.1.0/kernel/pf_ring.ko min_num_slots=8192
transparent_mode=2

    Bring the DNA interface up.

    ifconfig dna0 up
    #optimise the Ethernet device, mostly turning off options
    #which hinder throughput.
    #substitute eth3 with the interface appropriate to your instance
    ethtool -C eth3 rx-usecs 1000
    ethtool -C eth3 adaptive-rx off
    ethtool -K eth3 tso off
    ethtool -K eth3 gro off
    ethtool -K eth3 lro off
    ethtool -K eth3 gso off
    ethtool -K eth3 rx off
    ethtool -K eth3 tx off
    ethtool -K eth3 sg off

    Set up CPU affinity for interrupts based on the number of RX queues on the
NIC, balanced across both processors. This may vary from system to system.
Check /proc/cpuinfo to see which processor IDs are associated with each
physical CPU.

    printf "%s" 1 > /proc/irq/73/smp_affinity #cpu0 node0
    printf "%s" 2 > /proc/irq/74/smp_affinity #cpu1 node0
    printf "%s" 4 > /proc/irq/75/smp_affinity #cpu2 node0
    printf "%s" 8 > /proc/irq/76/smp_affinity #cpu3 node0
    printf "%s" 10 > /proc/irq/77/smp_affinity #cpu4 node0
    printf "%s" 20 > /proc/irq/78/smp_affinity #cpu5 node0
    printf "%s" 40 > /proc/irq/79/smp_affinity #cpu6 node1
    printf "%s" 80 > /proc/irq/80/smp_affinity #cpu7 node1
    printf "%s" 100 > /proc/irq/81/smp_affinity #cpu8 node1
    printf "%s" 200 > /proc/irq/82/smp_affinity #cpu9 node1
    printf "%s" 400 > /proc/irq/83/smp_affinity #cpu10 node1
    printf "%s" 800 > /proc/irq/84/smp_affinity #cpu11 node1
    printf "%s" 1000 > /proc/irq/85/smp_affinity #cpu12 node0
    printf "%s" 2000 > /proc/irq/86/smp_affinity #cpu13 node0
    printf "%s" 40000 > /proc/irq/78/smp_affinity #cpu18 node1
    printf "%s" 80000 > /proc/irq/88/smp_affinity #cpu19 node1

    This loop spawns 16 Snort processes. Each is bound to an RX queue of the
NIC interface and is specified as dnaX@Y, where X is the DNA device ID, and Y
is the RX queue.

    for i in `seq 0 1 15`; do
    /nsm/bin/snort-2.9.0/src/snort -c /nsm/etc/snort.serv.conf \
    -A none -N -y -i dna0@$i &
    done

--------------------------------------------------------------------
Compiling Snort With ICC

    We found that ICC gives the best performance using its profiling
capability with -march=corei7 -fomit-frame-pointer -no-prec-div
-fimf-precision:low -fno-alias -fno-fnalias. Note: the Intel compiler is free
to use for research purposes only; using this in production requires a paid
license.
    Set environmental variables for Intel's compiler.

    source /opt/intel/bin/compilervars.sh intel64
    export CC=/opt/intel/composer_xe_2011_sp1.6.233/bin/intel64/icc

    Compile Snort first using -prof-gen.

    export CFLAGS='-march=corei7 -fomit-frame-pointer -no-prec-div \
    -fimf-precision:low -fno-alias -fno-fnalias -prof-gen \
    -prof-dir=/root/nsm_intel/bin/snort-2.9.x'
    cd snort-2.9.x/
    make clean;
    ./configure;
    make

    Run Snort on as many hardware threads as you would like using
PF_RING_NAPI:

    for i in `seq 0 1 23`; do
    snort -c /nsm/etc/snort.serv.conf -N -A none -i eth3 \
    --daq-dir /usr/local/lib/daq --daq pfring --daq-var clusterid=10 &
    done


    Or PF_RING DNA:

    for i in `seq 0 1 15`; do
    snort -c /nsm/etc/snort.serv.conf -N -A none -i dna0@$i &
    done

    Make sure to send some traffic for a few minutes. You might notice very
high CPU utilization in prof-gen mode because the application is in profiling
mode:

    # Kill Snort to output the profile data
    killall snort
    # Now recompile it using "-prof-use" instead
    export CFLAGS='-march=corei7 -fomit-frame-pointer -no-prec-div \
    -fimf-precision:low -fno-alias -fno-fnalias -prof-use \
    -prof-dir=/root/nsm_intel/bin/snort-2.9.x'
    cd snort-2.9.x/
    make clean; ./configure; make