Scaling
Introduction
Our maximum throughput on single server (24 cores Xeon, 10Gbit NIC card) is around 60000 calls without RTP analyzing (only SIP) and 10000 concurrent calls with full RTP analyzing and packet capturing to disk (around 1.6Gbit voip traffic). VoIPmonitor can work in distributed mode where remote sniffers writes to one central database having one GUI accessing all data from all sensors.
VoIPmonitor is able to use all available CPU cores but there are several bottlenecks which you should consider before deploying and configuring VoIPmonitor (do not hesitate to write email to support@voipmonitor.org if you need more info / help with deploying)
There are three types of bottlenecks - CPU, disk I/O throughput (if writing pcap files) and storing CDR to MySQL (which is both I/O and CPU). The sniffer is multithreaded application but certain task cannot be split to more threads. Main thread is reading packets from kernel - this is the top most consuming thread and it depends on CPU type and kernel version (and number of packets per second). Below 500Mbit of traffic you do not need to be worried about CPU on usual CPU (Xeon, i5).
I/O bottleneck is most common problem for VoIPmonitor and it depends if you store to local mysql database along side with storing pcap files to the same storage. See next chapter I/O throughput.
CPU bound
Reading packets
Main thread (called t0CPU) which reads packets from kernel cannot be split into more threads which limits number of concurrent calls for the whole server. You can check how much CPU is spent in T0 thread looking in the syslog where voipmonitor sends each 10 seconds information about CPU and memory. If the t0CPU is >95% you are at the limit of capturing packets from the interface. Your options are:
- better CPU (faster core)
- kernel >= 3.2 (out latest static binaries supports TPACKET_V3 feature which speedups capturing and reducing t0CPU
- special network card (Napatech for example reduces t0CPU to 3% for 1.6Gbit voip traffic)
- Commercial ntop.org drivers for intel cards which offloads CPU from 90% to 20-40% for 1.5Gbit (tested)
Our recent sniffer versions with kernel >3.2 is able to sniff up to 2Gbit voip traffic on 10Gbit intel card with native intel drivers. The CPU configuration is Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz.
There is also important thing to check for high throughput (>500Mbit) especially if you use kernel < 3.X which do not balance IRQ from RX NIC interrupts by default. You need to check /proc/interrupts to see if your RX queues are not bound only to CPU0 in case you see in syslog that CPU0 is on 100%. If you are not sure just upgrade your kernel to 3.X and the IRQ balance is by default spread to more CPU automatically.
Another consideration is limit number of rx tx queues on your nic card which is by default number of cores in the system which adds a lot of overhead causing more CPU cycles. two cores are sufficient with up to 2Gbit traffic on 10Gbit intel card. This is how you can limit it:
modprobe ixgbe DCA=2,2 RSS=2,2
The next important thing is to set ring buffer in the hardware for RX to its maximum value. You can get what you can set at maximum by running
ethtool -g eth3
For example for Intel 10Gbit card default value is 4092 which can be extended to 16384
ethtool -G eth3 rx 16384
This will prevent to miss packets when interrupts spike occurs
On following picture you can see how packets are proccessed from ethernet card to Kernel space to ethernet driver which queues packets to ring buffer. Ring buffer (available since kernel 2.6.32 and libpcap > 1.0) is packet queue waiting to be fetched by the VoIPmonitor sniffer. This ringbuffer prevents packet loss in case the VoIPmonitor process does not have enough CPU cycles from kernel planner. Once the ringbuffer is filled it is logged to syslog that the sniffer loosed some packets from the ringbuffer. In this case you can increase the ringbuffer from the default 50MB to its maximum value of 2000MB.
From the kernel ringbuffer voipmonitor is storing packets to its internal heap memory cache which you can control with packetbuffer_total_maxheap parameter. Default value is 2000MB. This cache is also compressed by very fast lz4 compression method which allows to store more packets in the cache (about 50% ratio for G711 calls). This heap cache is important in case the voipmonitor sniffer is not able to write packets to pcap files due to slow I/O which occurs usually if there is another process wanting to read/write to disk so the sniffer has to wait to finish the I/O. If this cache heap is full (which you can see in the syslog - heap[100|0|0]) the voipmonitor will stop reading packets from the kernel ringbuffer. You can enable packetbuffer_file_totalmaxsize which will store this cache heap to file and which will be processed later once the I/O will unblock and be able to process all writes. But this will not help if you will set path to the file cache to the same I/O storage - you need some dedicated storage for this.
Packets from the cache heap is send to processing threads which analyzes SIP and RTP. From there packets are destroyd or continues to the write stage which has several problems and options:
- by default there is another cache mechanism (which size you can controll with pcap_dump_asyncwrite_maxsize = 100) which do not stores raw packet in memory but stores only data which are supposed to write to disk. The advantage is that the sniffer can independently process all packets in realtime (from kernel ringbuffer and from sniffer internal cache heap) and sends processed packets to another memory cache queue. This is important especially in case where the operating system blocks writing packets. Without the write cache queue the sniffer has to stop and cannot read packets from cache heap thus the cache heap is growing until I/O allows to continue with processing packets from the cache heap which introduces new problem with CPU because you can have now 10GB in cache heap waiting to process at full rate speed which hits CPU limit thus the queue is slowly shrinking and if there is another I/O blocks the cache is filled quickly. Thus having standalone cache only for packets for pcap files allows to overcome this problem. Default maximum size of this cache is pcap_dump_asyncwrite_maxsize = 100 and we recommend to make it bigger. The other advantage is that if you enable zip compression you need less RAM space in case I/O is blocking writes (keeping packets in memory in heap buffer which uses lz4 compression is slightly worse then compressing pcap files with gzip)
- default configuration is storing each call to standalone pcap file format (which you can open with wireshark and other libpcap compatible software) which is split to RTP/call.pcap and SIP/call.pcap. You can disable RTP storing or SIP storing completely. By default pcap files are compressed natively (recent wireshark can open gzip pcap files natively) by gzip compression level 6. Default pcap compression takes a lot of CPU and if you have huge traffic >1000 simultaneous calls compression threads will grow on demand (in syslog it is tacCPU[%]. If you have not enough CPU cores to handle default compression level 6 you can change the compression pcap_dump_ziplevel between 1 - 9 (1 is the fastest, 6 is default, 9 is the slowest) or you can turn off the compression completely.
Writing each call to one file generates random write workload for the storage which
- alternative writing strategy configuration is storing SIP and RTP pcap files into one tar file (tar=yes) while the tar file can be uncompressed or compressed with gzip, lzma or lz4. The main advantage is reducing number of IOPS for the disk because calls are concatenated into single tar files reducing random writes and file system overhead. For 2000 concurrent calls IOPS are around 400 for single pcap files and only 40 when enabling storing to tar files with gzip compression. The next advantage is that cleaning old files is very fast because there are only 120(SIP+RTP) files for 1 hour. Cleaning files and writing thousands of files / second is challenge for any I/O storage hardware (even for SSD). The main disadvantage is that in this configuration all packets have to be in memory unless the call ends which means that you need a lot of memory. This memory is compressed with lz4 before it goes to tar which can reduce it by 50% for G711 codec. The reason why call has to be in memory until it ends is that you have to write calls to tar file without fragmentation. The next disadvantage is that if you want to access some pcap file you have to uncompress the tar file and find the file in the tar file. For 300MB RTP tar file getting one file takes around 2 seconds for lz4 compression, 16 seconds for lzma.
if voipmonitor sniffer is running with at least "-v 1" you can watch several metrics:
tail -f /var/log/syslog (on debian/ubuntu) tail -f /var/log/messages (on redhat/centos)
voipmonitor[15567]: calls[315][355] PS[C:4 S:29/29 R:6354 A:6484] SQLq[C:0 M:0 Cl:0] heap[0|0|0] comp[54] [12.6Mb/s] t0CPU[5.2%] t1CPU[1.2%] t2CPU[0.9%] tacCPU[4.6|3.0|3.7|4.5%] RSS/VSZ[323|752]MB
- voipmonitor[15567] - 15567 is PID of the process
- calls - [X][Y] - X is actual calls in voipmonitor memory. Y is total calls in voipmonitor memory (actual + queue buffer) including SIP register
- PS - call/packet counters per second. C: number of calls / second, S: X/Y - X is number of valid SIP packets / second on sip ports. Y is number of all packets on sip ports. R: number of RTP packets / second of registered calls by voipmonitor per second. A: all packets per second
- SQLqueue - is number of sql statements (INSERTs) waiting to be written to MySQL. If this number is growing the MySQL is not able to handle it. See Scaling#innodb_flush_log_at_trx_commit
heap[A|B|C] - A: % of used heap memory. If 100 voipmonitor is not able to process packets in realtime due to CPU or I/O. B: number of % used memory in packetbuffer. C: number of % used for async write buffers (if 100% I/O is blocking and heap will grow and than ring buffer will get full and then packet loss will occur)
- hoverruns - if this number grows the heap buffer was completely filled. In this case the primary thread will stop reading packets from ringbuffer and if the ringbuffer is full packets will be lost - this occurrence will be logged to syslog.
- comp - compression buffer ratio (if enabled)
- [12.6Mb/s] - total network throughput
- t0CPU - This is %CPU utilization for thread 0. Thread 0 is process reading from kernel ring buffer. Once it is over 90% it means that the current setup is hitting limit processing packets from network card. Please write to support@voipmonitor.org if you hit this limit.
- t1CPU - This is %CPU utilization for thread 1. Thread 1 is process reading packets from thread 0, adding it to the buffer and compress it (if enabled).
- t2CPU - This is %CPU utilization for thread 2. Thread 2 is process which parses all SIP packets. If >90% there the sensor is hitting limit - please contact support@voipmonitor.org if you see >90%.
- tacCPU[N0|N1|N...] - %CPU utilization when compressing pcap files. Threads are growing dynamically.
- RSS/VSZ[323|752]MB - RSS stands for the resident size, which is an accurate representation of how much actual physical memory sniffer is consuming. VSZ stands for the virtual size of a process, which is the sum of memory it is actually using, memory it has mapped into itself (for instance the video card’s RAM for the X server), files on disk that have been mapped into it (most notably shared libraries), and memory shared with other processes. VIRT represents how much memory the program is able to access at the present moment.
Good tool for measuring CPU is http://htop.sourceforge.net/
Software driver alternatives
- TPACKET_V3 - New libpcap 1.5.3 and >= 3.2 kernel supports TPACKET_V3 which means that you need to compile this libpcap against recent linux kernel. In our tests we are able to sniff on 10Gbit intel card 2Gbit traffic without special drivers - just using the latest libpcap and kernel. Our latest statically compiled binaries (in download section) already includes TPACKET_V3 which means that if you are running kernel >= 3.2 it is used.
- Direct NIC Access http://www.ntop.org/products/pf_ring/dna/ - We have tried DNA driver for stock 1Gbit Intel card which reduces 100% CPU load to 20%.
Hardware NIC cards
We have successfully tested 1Gbit and 10Gbit cards from Napatech which delivers packets to VoIPmonitor at <3% CPU.
I/O bottleneck
For storing up to 200 simultaneous calls (with all SIP and RTP packets saving) you do not need to be worried about I/O performance much. For storing up to 500 calls your disk must have enabled write cache (some raid controllers are not set well for random write scenarios or has write cache disabled at all). Especially kvm virtual default settings does not use write-back cache. For up to 1000 calls you can use ordinary SATA 7.2kRPM disks with NCQ enabled - like Western digital RE4 edition (RE4 is important as it implements good NCQ) and we use it for installations for saving full SIP+RTP up to 1000 simultaneous calls. If you have more than 1000 simultaneous calls you can still use usual SATA disk but using cachedir feature (see below) or you need to look for some enterprise hardware raid witch write-back cache.
Since version 10 the sniffer is compressing pcap files by default using asynchronous write queue which can be set to huge numbers - GB of RAM to help overcome I/O bottleneck and peaks. Since version 11 the sniffer can be set to concatenate files into single tar files every minute reducing IOPS from 400 to 40 for 2000 simultaneous calls.
SSD disks are not recommended for pcap storing because of its low durability but if this is not an concern it is the best option for random write workload (no special HW raid is needed).
VoIPmonitor sniffer produces the worst case scenario for spin disks - random write. The situation gets worse in case of ext3/ext4 file systems which uses journal and writes meta data enabled by default thus adding more I/O writes. But ext4 can be tweaked to get maximum performance disabling journal and some other tweaks in cost of readability in case of system crash. We are recommending to use dedicated disk and format it with special ext4 switches. If you cannot use dedicated disk for storing pcap files use dedicated partition formatted with special tweaks (see below). This special tweaks are not needed in case you turn on tar=yes and use some compression which reduces write workload by factor 10.
The fastest filesystem for voipmonitor spool directory is EXT4 with following tweaks. Assuming your partition is /dev/sda2:
export mydisk=/dev/sda2 mke2fs -t ext4 -O ^has_journal $mydisk tune2fs -O ^has_journal $mydisk tune2fs -o journal_data_writeback $mydisk #add following line to /etc/fstab /dev/sda2 /var/spool/voipmonitor ext4 errors=remount-ro,noatime,nodiratime,data=writeback,barrier=0 0 0
In case your disk is still not able to handle traffic you can enable tar=yes or cachedir feature (voipmonitor.conf:cachedir) which stores all files into fast storage which can handle random write - for example RAM disk located at /dev/shm (every linux distribution have enabled this for up to 50% of memory). After the file is closed (call ends) voipmonitor automatically move the file from this storage to spooldir directory which is located on slower storage in guaranteed serial order which eliminates random write problem. This also allows to use network shares which is usually too slow to use it for writing directly to it by voipmonitor sniffer.
LSI write back cache policy
On many installations a raid controller is in not optimally configured. To check what is your cache policy run:
megacli -LDGetProp -Cache -L0 -a0 Adapter 0-VD 0(target id: 0): Cache Policy:WriteThrough, ReadAheadNone, Direct, No Write Cache if bad BBU
Cache policy write through has very bad random write performance so you probably want to change it to write back cache policy:
megacli -LDSetProp -WB -L0 -a0 Battery needs replacement So policy Change to WB will not come into effect immediately Set Write Policy to WriteBack on Adapter 0, VD 0 (target id: 0) success
Recheck if the cache was really set to write back if not, you need to force write cache if battery is bad / missing with this command:
megacli -LDSetProp CachedBadBBU -Lall -aAll Set Write Cache OK if bad BBU on Adapter 0, VD 0 (target id: 0) success Set Write Cache OK if bad BBU on Adapter 0, VD 1 (target id: 1) success
And then set the write back cache again
megacli -LDSetProp -WB -L0 -a0
Please note that this example assumes you have one logical drive if you have more you need to repeat it for all of your virtual disks.
MySQL performance
Write performance
Write performance depends a lot if a storage is also used for pcap storing (thus sharing I/O with voipmonitor) and on how mysql handles writes (innodb_flush_log_at_trx_commit parameter - see below). Since sniffer version 6 MySQL tables uses compression which doubles write and read performance almost with no trade cost on CPU (well it depends on CPU type and ammount of traffic).
innodb_flush_log_at_trx_commit
Default value of 1 will mean each update transaction commit (or each statement outside of transaction) will need to flush log to the disk which is rather expensive, especially if you do not have Battery backed up cache. Many applications are OK with value 2 which means do not flush log to the disk but only flush it to OS cache. The log is still flushed to the disk each second so you normally would not loose more than 1-2 sec worth of updates. Value 0 is a bit faster but is a bit less secure as you can lose transactions even in case MySQL Server crashes. Value 2 only cause data loss with full OS crash. If you are importing or altering cdr table it is strongly recommended to set temporarily innodb_flush_log_at_trx_commit = 0 and turn off binlog if you are importing CDR via inserts.
innodb_flush_log_at_trx_commit = 2
compression
MySQL 5.1
set in my.cf in [global] section this value:
innodb_file_per_table = 1
MySQL > 5.1
MySQL> set global innodb_file_per_table = 1; MySQL> set global innodb_file_format = barracuda;
Tune KEY_BLOCK_SIZE
If you choose KEY_BLOCK_SIZE=2 instead of 8 the compression will be twice better but with CPU penalty on read. We have tested differences between no compression, 8kb and 2kb block size compression on 700 000 CDR with this result (on single core system – we do not know how it behaves on multi core systems). Testing query is select with group by.
No compression – 1.6 seconds 8kb - 1.7 seconds 4kb - 8 seconds
Read performance
Read performance depends how big the database is and how fast disk operates and how much memory is allocated for innodb cache. Since sniffer version 7 all large tables uses partitioning by days which reduces needs to allocate very large cache to get good performance for the GUI. Partitioning works since MySQL 5.1 and is highly recommended. It also allows instantly removes old data by wiping partition instead of DELETE rows which can take hours on very large tables (millions of rows).
innodb_buffer_pool_size
This is very important variable to tune if you’re using Innodb tables. Innodb tables are much more sensitive to buffer size compared to MyISAM. MyISAM may work kind of OK with default key_buffer_size even with large data set but it will crawl with default innodb_buffer_pool_size. Also Innodb buffer pool caches both data and index pages so you do not need to leave space for OS cache so values up to 70-80% of memory often make sense for Innodb only installations.
We recommend to set this value to 50% of your available RAM. 2GB at least, 8GB is optimal. All depends how many CDR do you have per day.
put into /etc/mysql/my.cnf (or /etc/my.cnf if redhat/centos) [mysqld] section innodb_buffer_pool_size = 8GB
Partitioning
Partitioning is enabled by default since version 7. If you want to take benefit of it (which we strongly recommend) you need to start with clean database - there is no conversion procedure from old database to partitioned one. Just create new database and start voipmonitor with new database and partitioning will be created. You can turn off partitioning by setting cdr_partition = no in voipmonitor.conf
.