RTI Connext + DDSI Maximum Throughput

10 posts / 0 new
Last post
Offline
Last seen: 2 months 5 days ago
Joined: 09/05/2019
Posts: 4
RTI Connext + DDSI Maximum Throughput

Hello,

I would like to use DDS to exchange data with small effective payload (< 30 bytes) at a high rate (more than 2000 samples produced every 10 ms). Interoperability is mandatory in my case so I cannot use the batching feature of RTI Connext.

I hav used the rtiperftest framework to evaluate the maximum throughput I could achieve and it seems that the limit is around 100,000 samples per seconds. My tests were performed using two high-end workstations with Gb Ethernet connection. I configured the publishing application to be asynchronous and the subscribing application to use a separate thread (no Listeners).

Could you confirm that 100,000 samples per second is the maximum throughput attainable with RTI Connext without batching? If not, is there any standard way to increase this number?

Regards,

Thibault Brezillon

Organization:
Offline
Last seen: 2 months 1 week ago
Joined: 02/15/2013
Posts: 17

Hi,

I'm sure there is no fixed limit. But you are using a lot of resources, CPU, RAM, network interface, cables, switches, etc. All of them have limits. Your limit is most likely the number of packets your network stack can handle in a second. You did not specify the operating system but sending a huge amout of small packets is the best way to overload any communication. Also your 30 bytes do not sound much, but the overhead of RTPS2, UDP and IP add up to about 250 bytes on the wire.

You also did not specifiy if you are doing 'best effort' or 'reliable' (with this amount of packets only 'best effort' would make sense to me).

What do you mean with 'interoperability is mandatory' and why cannot you use batching?

There are some things you can test, first I would recommend using 'iperf' to test the thruput on your network using UDP packets with payload size of about 200 bytes. This will tell you the limit of your machine/network/cables/switches/etc. I doubt the result will be significantly above 100.000 packets/second.

Regards

Josef

jmorales's picture
Offline
Last seen: 2 weeks 21 hours ago
Joined: 08/28/2013
Posts: 35

Hi Thibault,

I agree with Josef's anwer. Allow me to add a few more comments:

Actually, with the latest version of RTI Perftest (3.0) you can make use of the -rawTransport option, which is going to allow you to send using raw sockets, instead of using the RTI Connext DDS middleware (it will use the same serialization we use though). With that, and making sure you use the "-batchSize 0", you should be able to determine the maximum thoughput achievable for the data size of your samples.

Now, wrt to RTI Perftest using Connext DDS (its default behavior), I see you mentioned you used asynchronous publishing, although I think I understand the reasons why you want to use that, in general, for small data samples, we recommend using synchronous, the reason being that by using asynchronous you need to do an extra copy of the sample and you use a different thread to send, which is going to slow down a bit the system. I will also suggest for perftest using listeners (its default behavior), as it gives you a bit better performance as well.

It is also important to know (as Josef mentioned) if you really need strict reliability, or if Best-Effort is an option, which is going to give you an extra push.

I would try something like:

For Raw Transport:

Pub:

perftest_cpp -pub -datalen 28 -nic <publisher_ip> -peer <subscriber_ip> -raw -batchSize 0 -exec 15 -noPrint

Sub:

perftest_cpp -sub -nic <subscriber_ip> -peer <publisher_ip> -raw

For RTI Connext DDS

Pub:

perftest_cpp -pub -datalen 28 -nic <publisher_ip> -best -batchSize 0 -exec 15 -noPrint

Sub:

perftest_cpp -sub -nic <subscriber_ip> -best
Offline
Last seen: 2 months 5 days ago
Joined: 05/27/2014
Posts: 3

Throughput will always be platform specific. If you get comparable throughput using raw UDP, it it a good bet that you have accurate results.

In your case the performance bottleneck will be the receive processing required by the operating system for all those very small packets on the Datareader side of the interface.  This is why batching/aggregation can have such a large effect on throughput and even improve average latency under load in such cases.

If you are not using keyed data, it is a fairly simple matter to effectively batch data in the application space by using a bounded sequence of your small data type (enlosed in a struct) as the actual message type and "batch" the data yourself.   This solution would maintain interoperability.

Keys may not be used in this pattern because keyed elements are not allowed to be inside a variable or optional data element.  This is the primary reason why RTI added a batching capability (enabling use of keys while aggregating small data types), in addition to ease of use.

 

Offline
Last seen: 2 months 5 days ago
Joined: 09/05/2019
Posts: 4

Hi

First thanks a lot for your quick answers.

The tests where performed with Reliable. Strict reliability is necessary for my use case.

Regarding my test environment. I can reach the maximum throughput of my Gb Ethernet network (>950Mb/s measured with iperf). The bottleneck is apparently not the network in my environment: no more than 200 Mb/s of ethernet traffic, no packet losses, large ingress and egress buffers that are never full. The bottleneck in my tests is the CPU and the processing of a lot of small samples. In terms of OS I am using linux-RT-patched. I am using RTI Connext 6.0.0.

"Interoperability is mandatory for me" means that I cannot use non standard means as I need to be able to communicate with other implementations of DDS (namely OpenSplice). From what I understand, using batching would render my application non interoperable. By the way, I tried my performance tests using batching which gave me results that would be suitable for my use case (>600,000 samples per seconds).

Here are some result out of the perftest framework:

- Raw, batchSize=0, datalen=28, synchronous: ~620,000 samples/sec (0.1% loss)

- Cpp03, RTI Connext DDS, batchSize=0, datalen=28, best effort, synchronous: ~220,000 samples/sec (~10% loss)

- Cpp, RTI Connext DDS, batchSize=0, datalen=28, best effort, synchronous: ~240,000 samples/sec  (~5% loss)

- Cpp03 and Cpp, RTI Connext DDS, batchSize=0, datalen=28, reliable, synchronous, readThread and listener: ~120,000 samples/sec (0% loss)

- Cpp03 and Cpp, RTI Connext DDS, batchSize=0, datalen=28, reliable, asynchronous, readThread and listener: ~110,000 samples/sec (0% loss)

 

I hope that gives you a better understanding.

 

Thibault Brezillon

 

jmorales's picture
Offline
Last seen: 2 weeks 21 hours ago
Joined: 08/28/2013
Posts: 35

Hi Thibault,

I am glad to see that at least using batching you achieve the performance you want to, although I understand this feature is RTI's and not part of the DDS standard. I conducted some tests myself in our lab, finding similar results to the ones you got. I also agree that the bottleneck is the cpu of both sides, trying to build, send and process samples as fast as possible.

What Mark suggests is basically doing batching at the application level, that should add a bit of extra logic to your send and receive, but I believe is a good way to achieve your performance goals. By doing that, the main issue you will need to take care of is the maximum time your samples can wait before being sent, and send batches dynamically based on number of samples, time elapses since the first sample came and maximum size of the resulting packet.

Hope this helps!

PS: I will test on my side using Asynchronous Publishing and a custom Flow Controller. Is this something you tried already?

Offline
Last seen: 2 months 5 days ago
Joined: 09/05/2019
Posts: 4

Hi,

Thanks. This is always good to get some confirmation.

I tried several things with Flow Controllers (DEFAULT, ON_DEMAND and different flow control parameters) but couldn't manage to get better performance. Basically I had in mind that using asynchronous publication would give RTI more opportunities to batch things and better optimize the writes but it doesn't seem to be that way. I also though that I could delegate most of the work to the writer thread and reduce the load in the main thread (that performs the writes) but again I might not comprehend everything because I never seem to use more than 100% of one CPU (overall).

Offline
Last seen: 2 months 5 days ago
Joined: 05/27/2014
Posts: 3

Hey, a few more comments from the "peanut gallery"....

Async pub without batching does not generally aggregate data (and adds a thread/context switch to the write() path). 

In addition to the batching of data in the application, there is another "old school" method of aggregating data.  This is setting the "push_on_write" QoS setting to false for the data writer and using reliability heartbeats to effectively "pull" the data from the writer to the reader.  While technically interoperable, I have doubts whether other DDS implemenations give you the level of control over heartbeats to accomplish this.  Unlike batching which uses one RTPS header per batch, this alternative includes an RTPS header per message (it is essentially a reliable retransmission), another reason we implemented batching.

In the application batching/aggregation method, your message would look something like the one below (please excuse typos)..... and you would add the logic to fill and unpack  the buffer, as Javi mentioned.  Remember this only works if there are no Keys in My_Small_Data_Type;

const long MAX_MESSAGE_ELEMENTS = 100;

@nested struct My_Small_Data_Type {

  long test_element;

};

struct My_Message_Type {

  sequence<My_Small_Data_Type, MAX_MESSAGE_ELEMENTS>;

};

Offline
Last seen: 2 months 5 days ago
Joined: 09/05/2019
Posts: 4

Hi,

I'll have a look into this "push_on_write" setting. It seems interesting.

Thanks for the small snippet. I had that in mind also, but unfortunately our system relies on keyed topics and having different queues per instance. Furthermore, we have a lot of instances and not "that many" samples per instance. Using "user-batching" would mean that we need to define (I say define here because I am part of a consortium whose goal it to define a standard on top of DDS/DDSI for Avionics/Automotive Test Bench interconnection) our own queue dispatching which is something we try to avoid.

Do you know if there is any plan at RTI to improve sample throughput performances using standard DDSI?

 

 

 

sara's picture
Offline
Last seen: 2 months 3 days ago
Joined: 01/16/2013
Posts: 122

Hi Thibault,

I'd like to understand better your use case and see alternatives to help you design your system towards interoperability while optimizing performance. By your name I'm guessing you're located in Europe :) Feel free to send me a note to sara AT rti DOT com and we can setup a call to review this together.

Thanks,
Sara