Performance of Connext 5.3.1 on Beagle Black Bones (Perftest Benchmark)

2 posts / 0 new
Last post
Offline
Last seen: 6 years 3 months ago
Joined: 10/24/2013
Posts: 15
Performance of Connext 5.3.1 on Beagle Black Bones (Perftest Benchmark)

Hello,

I  am running RTI Connext 5.3.1 on a cluster of Beagle Black Bone (BBB) devices for a project. To benchmark the baseline maximum throughput for different data sample sizes, we used RTI DDS perftest 2.3.2.  I had some questions about the observed performance results and I will be very grateful if you can help me in understanding the reason behind the observed results. 

Test Setup: We are running perftest publisher on one BBB device and the perftest subscriber on another BBB device. For a given data sample size, the publisher sends data as fast as it can (we have not used either sleep or spin) to the subscriber for 5 minutes. Each test was repeated 3 times and the plotted results are the average values across these 3 runs. (Error bars denote std. dev.). This test was performed under both- default reliable QoS settings and bestEffort QoS settings (with -bestEffort commandline parameter). 

Reliable Test Configuration: 

Publisher command line parameters: ./perftest_cpp -pub -cpu -noPrintIntervals -nic eth0 -transport UDPv4 -dataLen <dataLength>  -batchSize 0 -executionTime 300 

Subscriber command line parameters: ./perftest_cpp -sub -cpu -noPrintIntervals -nic eth0 -transport UDPv4 

Best Effort Test Configuration: 

Publisher command line parameters:  ./perftest_cpp -pub -cpu -noPrintIntervals -nic eth0 -transport UDPv4 -dataLen <dataLength>  -batchSize 0 -executionTime 300 -bestEffort 

Subscriber command line parameters: ./perftest_cpp -sub -cpu -noPrintIntervals -nic eth0 -transport UDPv4  -bestEffort 

Questions about observed results:

1.  The perftest publisher is not using sleep/spin and is sending data as fast as it can. Yet, we are not able to saturate the CPU. The attached graph: cpu_pub.png shows the CPU utilization of perftest on publisher side and the graph: cpu_sub.png shows the CPU utilization on the subscriber side. I wanted to understand what is the bottleneck resource which is throttling the publisher and thereby limiting the maximum observed throughput (graph: throughput_pks).  This behavior is observed even for the bestEffort configuration. 

2. Why does the  CPU utilization for both publisher and subscriber decrease with increasing dataLength sizes. I understand that the throughput in packets/second decreases as it takes longer to send larger messages which may impact the CPU utilization, but are there other reasons behind the observed trend. 

 

Thank you for your time and help. 

Shweta 

 

 

 

 

 

Organization:
ajimenez's picture
Offline
Last seen: 2 years 6 months ago
Joined: 09/29/2017
Posts: 21
Hi Shweta,
 
Thank you for all the information that you have provided.
Addressing your questions:
 
    1. The reasons why RTI Perftest is not using the whole CPU might be:
            1a. The write loop is being executed in one single thread. This thread will make use of the 100% of one core of the machine, but not the rest. This would explain why you don't see the whole CPU being used. 
            Could you verify if one core is 100% used of your CPU?
 
            1b. The bottleneck is the network card (since I believe the boards where you are running your test might not have a great nic), so when the write() operation is called the sample is being sent via a socket and that operation is a blocking operation.
    
    2.  As I just mentioned, the write operation in the writer will copy the sample you are about to send and then will try to send() it via a socket. That operation in the socket is blocking.
 
        For small sizes, the write operation takes less time (both copy and send()), so the loop where we do all the process of sending the sample is exercised more often (more CPU is used).
        For large sizes, the write() operation and the copies of data take more time, so the CPU usage is less.
 
        In conclusion, when the middleware is writing large data, the block time to do the write() call is longer, thus the CPU usage is used less.
 
Does this make sense with what you see?
Please let me know if you have any other question.
 
Best,
Antonio