Reliable Messaging

You are here: Design Patterns for High Performance > Reliable Messaging > Implementation

Reliable Messaging

Packets sent by a middleware may be lost by the physical network or dropped by routers, switches and even the operating system of the subscribing applications when buffers become full. In reliable messaging, the middleware keeps track of whether or not data sent has been received by subscribing applications, and will resend data that was lost on transmission.

Like most reliable protocols (including TCP), the reliability protocol used by RTI uses additional packets on the network, called metadata, to know when user data packets are lost and need to be resent. RTI offers the user a comprehensive set of tunable parameters that control how many and how often metadata packets are sent, how much memory is used for internal buffers that help overcome intermittent data losses, and how to detect and respond to a reliable subscriber that either falls behind or otherwise disconnects.

When users want applications to exchange messages reliably, there is always a need to trade-off between performance and memory requirements. When strictly reliable communication is enabled, every written DDS sample will be kept by Connext DDS inside an internal buffer until all known reliable subscribers acknowledge receiving the DDS sample1Connext DDS also supports reliability based only on negative acknowledgements ("NACK-only reliability"). This feature is described in detail in the User's Manual (Section 6.5.2.3) but is beyond the scope of this document..

If the publisher writes DDS samples faster than subscribers can acknowledge receiving, this internal buffer will eventually be completely filled, exhausting all the available space—in that case, further writes by the publishing application will block. Similarly, on the subscriber side, when a DDS sample is received, it is stored inside an internal receive buffer, waiting for the application to take the data for processing. If the subscribing application doesn't take the received DDS samples fast enough, the internal receive buffer may fill up—in that case, newly received data will be discarded and would need to be repaired by the reliable protocol.

Although the size of those buffers can be controlled from the QoS, you can also use QoS to control what Connext DDS will do when the space available in one of those buffers is exhausted. There are two possible scenarios for both the publisher and subscriber:

Publishing side: If write() is called and there is no more room in the DataWriter’s buffer, Connext DDS can:
1. Temporarily block the write operation until there is room on this buffer (for example, when one or more DDS samples is acknowledged to have been received from all the subscribers).
2. Drop the oldest DDS sample from the queue to make room for the new one.
Subscribing side: If a DDS sample is received (from a publisher) and there is no more room on the DataReader’s buffer:
1. Drop the DDS sample as if it was never received. The subscribing application will send a negative acknowledgement requesting that the DDS sample be resent.
2. Drop the oldest DDS sample from the queue to make room for the new one.

Implementation

There are many variables to consider, and finding the optimum values to the queue size and the right policy for the buffers depends on the type of data being exchanged, the rate of which the data is written, the nature of the communication between nodes and various other factors.

The RTI Connext DDS Core Libraries User's Manual dedicates an entire section to the reliability protocol, providing details on choosing the correct values for the QoS based on the system configuration. For more information, refer to Chapter 10 in the User’s Manual.

The following sections highlight the key QoS settings needed to achieve strict reliability. In the reliable.xml QoS profile file, you will find many other settings besides the ones described here. A detailed description of these QoS is outside the scope of this document, and for further information, refers to the comments in the QoS profile and in the User’s Manual.

Enable Reliable Communication

The QoS that control the kind of communication is the Reliability QoS of the DataWriter and DataReader:

<datawriter_qos>
	...
	<reliability>
		<kind>RELIABLE_RELIABILITY_QOS</kind>
		<max_blocking_time>
			<sec>5</sec>
			<nanosec>0</nanosec>
		</max_blocking_time>
	</reliability>
	...
</datawriter_qos>
	...
<datareader_qos>
	<reliability>
		<kind>RELIABLE_RELIABILITY_QOS</kind>
	</reliability>
</datareader_qos>

This section of the QoS file enables reliability on the DataReader and DataWriter, and tells the middleware that a call to write() may block up to 5 seconds if the DataWriter’s cache is full of unacknowledged DDS samples. If no space opens up in 5 seconds, write() will return with a timeout indicating that the write operation failed and that the data was not sent.

Set History To KEEP_ALL

The History QoS determines the behavior of a DataWriter or DataReader when its internal buffer fills up. There are two kinds:

KEEP_ALL:The middleware will attempt to keep all the DDS samples until they are acknowledged (when the DataWriter’s History is KEEP_ALL), or taken by the application (when the DataReader’s History is KEEP_ALL).
KEEP_LAST:The middleware will discard the oldest DDS samples to make room for new DDS samples. When the DataWriter’s History is KEEP_LAST, DDS samples are discarded when a new call to write() is performed. When the DataReader’s History is KEEP_LAST, DDS samples in the receive buffer are discarded when new DDS samples are received. This kind of history is associated with a depth that indicates how many historical DDS samples to retain.

<datawriter_qos>
	<history>
		<kind>KEEP_ALL_HISTORY_QOS</kind>
	</history>
	...
</datawriter_qos>
...
<datareader_qos>
	<history>
		<kind>KEEP_ALL_HISTORY_QOS</kind>
	</history>
	...
</datareader_qos>

The above section of the QoS profile tells RTI to use the policy KEEP_ALL for both DataReader and DataWriter.

Controlling Middleware Resources

With the ResourceLimits QosPolicy, you have full control over the amount of memory used by the middleware. In the example below, we specify that both the reader and writer will store up to 10 DDS samples (if you use a History kind of KEEP_LAST, the values specified here must be consistent with the value specified in the History’s depth).

<datawriter_qos>
   <resource_limits>
       <max_samples>10</max_samples>
   </resource_limits>
   ...
</datawriter_qos>
...
<datareader_qos>
   <resource_limits>
       <max_samples>2</max_samples>
   </resource_limits>
   ...
</datareader_qos>

The above section tells RTI to allocate a buffer of 10 DDS samples for the DataWriter and 2 for the DataReader. If you do not specify any value for max_samples, the default behavior is for the middleware to allocate as much space as it needs.

One important function of the Resource Limits policy, when used in conjunction with the Reliability and History policies, is to govern how far "ahead" of its DataReaders a DataWriter may get before it will block, waiting for them to catch up. In many systems, consuming applications cannot acknowledge data as fast as its producing applications can put new data on the network. In such cases, the Resource Limits policy provides a throttling mechanism that governs how many sent-but-not-yet-acknowledged DDS samples a DataWriter will maintain. If a DataWriter is configured for reliable KEEP_ALL operation, and it exceeds max_samples, calls to write() will block until the writer receives acknowledgements that will allow it to reclaim that memory.

If you see that your reliable publishing application is using an unacceptable amount of memory, you can specify a finite value for max_samples. By doing this, you restrain the size of the DataWriter's cache, causing it to use less memory; however, a smaller cache will fill more quickly, potentially causing the writer to block for a time when sending, decreasing throughput. If decreased throughput proves to be an issue, you can tune the reliability protocol to process acknowledgements and repairs more aggressively, allowing the writer to clear its cache more effectively. A full discussion of the relevant reliability protocol parameters is beyond the scope of this example. However, you can find a useful example in high_throughput.xml. Also see the documentation for the DataReaderProtocol and DataWriterProtocol QoS policies in the on-line API documentation.