Achieving low-jitter performance with Connext Pro

Table of Contents

General guidance

Achieving low jitter and more deterministic write times and real-time performance requires in-depth tuning at the OS layer and DDS layer. It tends to involve a long list of recommendations to apply e.g. you can see advice online that involves similar checklists.

This article summarizes best practices, and we recommend exploring and testing all the recommendations listed here. Note that any specific numbers shown for QoS settings are presented purely as an example, and may not be correct for your system. As always, if you need advice on tuning specific values, please contact RTI Support.

Connext Version

For reducing jitter, we recommend the latest Connext release, which will contain the latest improvements and bug fixes related to real-time performance.

Sleeping vs Spinning

To avoid unneccesary context switching, avoid the use of sleep() in a time critical process. It is recommended to use spin() instead, which avoids yielding the CPU.

One publisher per writer

To reduce mutex contention in the framework, we recommend to create one DDS Publisher entity per DataWriter entity, as shown below:

dds::pub::Publisher publisher1(participant);
dds::pub::DataWriter<test> writer1(publisher1, topic);

dds::pub::Publisher publisher2(participant);
dds::pub::DataWriter<test> writer2(publisher2, topic);    

Each publisher has its own mutex (called an Exclusive Area in Connext) that all DataWriters created within that Publisher share. 

One subscriber per reader

To reduce mutex contention in the framework, we recommend to create one DDS Subscriber entity per DataReader entity, as shown below:

dds::pub::Subscriber subscriber1(participant);
dds::pub::DataReader<test> reader1(subscriber1, topic);

dds::pub::Subscriber subscriber2(participant);
dds::pub::DataReader<test> reader2(subscriber2, topic);    

Each subscriber has its own mutex (called an Exclusive Area in Connext) that all DataReaders created within that Subscriber share. 

Socket buffer sizes

If UDP is being used: For Linux-like OSes you must first set wmem_max and rmem_max in your kernel to accommodate the maximum socket buffer sizes you want Connext to be capable of using. See this article (or for QNX, this article) for details of how to set it in the "Size of kernel receive and send socket buffers" section. You can start by setting both rmem_max and wmem_max to something that accommodates ~1 second worth of data sent/received by the participant as a starting point. This is overkill, but it gives a big margin of safety and you can back it off from there. For example, if you're sending 12MB/sec, you can set wmem_max and rmem_max to 12MB.

For Windows, there is no need to configure the kernel.

Next, configure Connext to set the socket buffer sizes:

<participant_qos>
    <transport_builtin>
        <udpv4>
            <send_socket_buffer_size>12000000</send_socket_buffer_size>
            <recv_socket_buffer_size>12000000</recv_socket_buffer_size>
        </udpv4>
    </transport_builtin>
</participant_qos>

Shared memory transport configuration

If Shared Memory is being used, configure the following transport level settings:

<participant_qos>
    <transport_builtin>
        <shmem>
            <message_size_max>X</message_size_max>
            <received_message_count_max>Y</received_message_count_max>
            <receive_buffer_size>Z</receive_buffer_size>
        </shmem>
    </transport_builtin>
</participant_qos>
message_size_max: X should be an integer number of bytes that represents the maximum size of an RTPS message you can send or receive on the transport. The default, 65536, is usually fine, but if any of your types send samples larger than 65536 bytes, we will need to increase X to accommodate the larger samples.
 
received_message_count_max: Y should be set to an integer number of messages to be able to buffer in the shared memory receive queue. By default it is 64, but if you're sending data at a high rate, e.g. if you have multiple writers sending samples every 1ms, then 64 may not be sufficient. You can experiment with various values, like 128, 256, 512, 1024, etc to see if these have a positive or negative impact.
 
receive_buffer_size Z is an integer number of bytes that represents the total number of bytes that can be buffered in the receive queue. The default is 1MB, but if you change X or Y from their defaults, you should calculate Z by multiplying together X and Y.

CPU pinning and isolation

This is a platform dependent configuration that must be done at the OS level. See this article for more details and read the bullet points related to “Avoiding CPU migrations” and “Isolating the CPU core being used.”

Our recommendation is to completely isolate each time critical application to its own core with no other processes running on that core.The advice of having just 1 process per core is mirrored in online resources, e.g. here.

Thread priority

Configure the internal Connext threads to all use the same priority:

<receiver_pool>
    <thread>
        <priority>X</priority>
        <mask>DDS_THREAD_SETTINGS_REALTIME_PRIORITY</mask>
    </thread>
</receiver_pool>

<database>
    <thread>
        <priority>X</priority>
        <mask>DDS_THREAD_SETTINGS_REALTIME_PRIORITY</mask>
    </thread>
</database>

<event>
    <thread>
        <priority>X</priority>
        <mask>DDS_THREAD_SETTINGS_REALTIME_PRIORITY</mask>
    </thread>
</event>

The value of X in the snippet above will depend on your system, but the suggestion here is to make all internal Connext threads the same priority. The reason to keep all internal threads at the same priority is to ensure round-robin scheduling and avoid any one thread hogging the CPU.

If the network stack of the OS (e.g. io-pkt on QNX) has a configurable priority, this should be higher relative to internal Connext threads, because internal Connext threads rely on the network stack for sending and receiving data. The Connext internal threads should be higher priority relative to application threads making Connext API calls, again because application threads making Connext API calls rely on the internal threads for data to be ready at the application level.

TRANSPORT_PRIORITY

Out of the box, all DataWriters (both discovery and user writers) within a participant share a send socket, but we can use the TRANSPORT_PRIORITY QoS to separate time-critical data to its own send socket.

To use this QoS, you can set it per-DataWriter as follows:

<datawriter_qos>
    <transport_priority>
        <value>42</value>
    </transport_priority>
</datawriter_qos>

Note that the specific value you pick (e.g. 42 in this example) doesn't really matter. What is important is that you should use a unique value per DataWriter that you want to have its own socket. (Values do not need to be unique across DomainParticipants – just unique values within a participant.)

TRANSPORT_UNICAST

To reduce thread and socket level contention in UDP, the TRANSPORT_UNICAST QoS can be applied. Similarly, in Shared Memory, the TRANSPORT_UNICAST policy allows each reader to have its own shared memory segment and shared memory semaphore. There are a few ways to consider using this QoS on the DataReader side:

  1. You can configure each individual DataReader to set a different receive_port in the TRANSPORT_UNICAST QoS policy in order to give each reader its own thread and socket (this is expensive though). 

  2. Alternatively, you can configure receive_port for certain readers (e.g. readers on large data topics) to give special treatment to those readers. You could leave the other readers default (i.e. don’t set TRANSPORT_UNICAST for other readers). This configuration is helpful to separate out a large data topic so it doesn’t hog the thread and socket for other readers on other topics.

  3. Alternatively, you can create groups of DataReaders that share the same receive_port in order to give groups of readers their own dedicated thread and socket.

You can also set TRANSPORT_UNICAST on the DataWriter in order to create an additional receive thread to process reliability traffic. 

To configure this in XML:

<datawriter_qos>
    <unicast>
        <value>
            <element>
                <receive_port>7300</receive_port>
            </element>
        </value>
    </unicast>
</datawriter_qos>

It is important to choose a unique port number each time you want a new thread/socket, and to reuse a port number when you want to share a thread/socket.

We recommend picking values outside of the ports used by your domain ID, e.g. port numbers below 7400 would not be used by Connext. For more details on which port numbers Connext uses, see this article.

Eliminate unnecessary transports and threads

If the IP address of a DomainParticipant is not expected to change at runtime, then you can likely disable the IP mobility interface tracking thread:

<participant_qos>
    <transport_builtin>
        <udpv4>
            <disable_interface_tracking>true</disable_interface_tracking>
        </udpv4>
    </transport_builtin>
</participant_qos>

If you need both SHMEM and UDP, you can move on to the next step, but if you can operate with just 1 of these two transports, you can also turn the unneeded transport off, which will prevent the creation of unnecessary receive threads. For example, to disable SHMEM and just leave UDP enabled:

<participant_qos>
    <transport_builtin>
        <mask>UDPv4</mask>
    </transport_builtin>
</participant_qos>

Eliminate avoidable memory allocations

You can increase initial_samples to “preallocate” a larger pool of samples. The value you pick depends on how much memory you want to preallocate, but if you already have max_samples/max_instances set, you can simply set initial_samplesmax_samples and  initial_instances = max_instances. This reduces unnecessary runtime allocations. These settings should be applied on both the reader and writer:

<datareader_qos>
    <resource_limits>
        <initial_samples>2048</initial_samples>
        <initial_instances>2048</initial_instances>
    </resource_limits>
</datareader_qos>

<datawriter_qos>
    <resource_limits>
        <initial_samples>2048</initial_samples>
        <initial_instances>2048</initial_instances>
    </resource_limits>
</datawriter_qos>

Note that 2048 is just an example. You should choose a value that aligns with the maximum number of samples/instances you want the reader/writer to manage at any given time.

You can also experiment with preallocating memory for discovery related allocations. 

  • DomainParticipantResourceLimits:
    • remote_writer_allocation, remote_reader_allocation, remote_participant_allocation, matching_writer_reader_pair_allocation, matching_reader_writer_pair_allocation, serialized_type_object_dynamic_allocation_threshold
  • Database QoS policy:
    • initial_records
  • Discovery Config builtin reader resource limits:
    • initial_samples

For example:

<participant_qos>
    <resource_limits>
        <remote_participant_allocation>
            <initial_count>32</initial_count>
        </remote_participant_allocation>

        <remote_reader_allocation>
            <initial_count>128</initial_count>
        </remote_reader_allocation>

        <remote_writer_allocation>
            <initial_count>128</initial_count>
        </remote_writer_allocation>

        <matching_writer_reader_pair_allocation>
            <initial_count>64</initial_count>
        </matching_writer_reader_pair_allocation>

        <matching_reader_writer_pair_allocation>
            <initial_count>64</initial_count>
        </matching_reader_writer_pair_allocation>

        <serialized_type_object_dynamic_allocation_threshold>65535</serialized_type_object_dynamic_allocation_threshold>
    </resource_limits>

    <database>
        <initial_records>2048</initial_records>
    </database>

    <discovery_config>
        <participant_reader_resource_limits>
            <initial_samples>128</initial_samples>
        </participant_reader_resource_limits>

        <publication_reader_resource_limits>
            <initial_samples>128</initial_samples>
        </publication_reader_resource_limits>

        <subscription_reader_resource_limits>
            <initial_samples>128</initial_samples>
        </subscription_reader_resource_limits>
    </discovery_config>

</participant_qos>

Note that these values are examples, so you should pick values that correspond to your real system. For example, remote_participant_allocation should correspond to the number of remote participants you expect to discover.

Eliminate unnecessary sequences/strings/optionals in the type

Sequences, strings (not including char arrays), and optional members in types can cause dynamic allocations, and are not recommended for real-time performance.

Reliability protocol tuning

You can set the min_/ max_nack_response_delay to 0 in order to ensure repairs are sent for reliable traffic from the receive thread. This reduces the delay in receiving repair data. If the delay is > 0, then the event thread must repair the data which requires a context switch.

<datawriter_qos>
    <protocol>
        <rtps_reliable_writer>
            <min_nack_response_delay>
                <sec>DURATION_ZERO_SEC</sec>
                <nanosec>DURATION_ZERO_NSEC</nanosec>
            </min_nack_response_delay>

            <max_nack_response_delay>
                <sec>DURATION_ZERO_SEC</sec>
                <nanosec>DURATION_ZERO_NSEC</nanosec>
            </max_nack_response_delay>
        </rtps_reliable_writer>
    </protocol>
</datawriter_qos>

Non-blocking sockets

You can experiment with using non-blocking sockets. Note that non-blocking sockets are not common in RTI’s customer base (the vast majority of customers use default blocking sockets). The behavior of non-blocking sockets may be desirable for this use case where timing is critical, but this should be tested carefully to ensure that it performs well. The potential disadvantage of non-blocking sockets is that if a large number of packets can’t be sent due to an OS-level issue (e.g. a socket buffer being full), Connext is not “notified” or “throttled” by the OS blocking anymore, and this causes a repair storm for reliable data. This performance degradation is not guaranteed to happen, but it is a possibility that you should be aware of.

<participant_qos>
    <transport_builtin>
        <udpv4>
            <send_blocking>TRANSPORT_BLOCKING_NEVER</send_blocking>
        </udpv4>
    </transport_builtin>
</participant_qos>

Routing Service: Sessions

Routing Service uses Sessions for managing route events (see here). To increase concurrency in Routing Service events, we recommend splitting up routes into multiple Sessions, and configuring the thread pool in the Session with an appropriate number of threads. 

For more information about configuring Sessions, see here.

Replay Service Playback: Indexing

Playback of large topics can take a while to start if recordings are not indexed, which may result in startup jitter. We recommend using the builtin Indexing Application to mitigate this.