Streaming Data over Unreliable Network Connections

You are here: Design Patterns for High Performance > Streaming Data over Unreliable Network Connections > Implementation

Streaming Data over Unreliable Network Connections

Systems face unique challenges when sending data over lossy networks that also have high-latency and low-bandwidth constraints—for example, satellite and long-range radio links. While sending data over such a connection, a middleware tuned for a high-speed, dedicated Gigabit connection would throttle unexpectedly, cause unwanted timeouts and retransmissions, and ultimately suffer severe performance degradation.

For example, the transmission delay in satellite connections can be as much as 500 milliseconds to 1 second, which makes such a connection unsuitable for applications or middleware tuned for low-latency, real-time behavior. In addition, satellite links typically have lower bandwidth, with near-symmetric connection throughput of around 250–500 Kb/s and an advertised loss of approximately 3% of network packets. (Of course, the throughput numbers will vary based on the modem and the satellite service.) In light of these facts, a distributed application needs to tune the middleware differently when sending data over such networks.

Connext DDS is capable of maintaining liveliness and application-level QoS even in the presence of sporadic connectivity and packet loss at the transport level, an important benefit in mobile, or otherwise unreliable networks. It accomplishes this by implementing a reliable protocol that not only sequences and acknowledges application-level messages, but also monitors the liveliness of the link. Perhaps most importantly, it allows your application to fine-tune the behavior of this protocol to match the characteristics of your network. Without this latter capability, communication parameters optimized for more performant networks could cause communication to break down or experience unacceptable blocking times, a common problem in TCP-based solutions.

Implementation

When designing a system that demands reliability over a network that is lossy and has high latency and low throughput, it is critical to consider:

How much data you send at one time (e.g., your sample or batch size).
How often you send it.
The tuning of the reliability protocol for managing meta- and repair messages.

It is also important to be aware of whether your network supports multicast communication; if it does not, you may want to explicitly disable it in your middleware configuration (e.g., by using the NDDS_DISCOVERY_PEERS environment variable or setting the initial_peers and multicast_receive_address in your Discovery QoS policy; see the API Reference HTML documentation).

Managing Your Sample Size

Pay attention to your packet sizes to minimize or avoid IP-level fragmentation. Fragmentation can lead to additional repair meta-traffic that competes with the user traffic for bandwidth. Ethernet-like networks typically have a frame size of 1500 bytes; on such networks, sample sizes (or sample fragment sizes, if you've configured Connext DDS to fragment your DDS samples) should be kept to approximately 1400 bytes or less. Other network types will have different fragmentation thresholds.

The exact size of the DDS sample on the wire will depend not only on the size of your data fields, but also on the amount of padding introduced to ensure alignment while serializing the data.

Figure 7 shows how an application's effective throughput (as a percentage of the theoretical capacity of the link) increases as the amount of data in each network packet increases. To put this relationship in another way: when transmitting a packet is expensive, it's advantageous to put as much data into it as possible. However, the trend reverses itself when the packet size becomes larger than the maximum transmission unit (MTU) of the physical network.

Figure 7 Example Throughput Results over VSat Connection

Correlation between sample size and bandwidth usage for a satellite connection with 3% packet loss ratio.

To understand why this occurs, remember that data is sent and received at the granularity of application DDS samples but dropped at the level of transport packets. For example, an IP datagram 10 KB in size must be fragmented into seven (1500-byte) Ethernet frames and then reassembled on the receiving end; the loss of any one of these frames will make reassembly impossible, leading to an effective loss, not of 1500 bytes, but of over 10 thousand bytes.

On an enterprise-class network, or even over the Internet, loss rates are very low, and therefore these losses are manageable. However, when loss rates reach several percent, the risk of losing at least one fragment in a large IP datagram becomes very large1Suppose that a physical network delivers a 1 KB frame successfully 97% of the time. Now suppose that an application sends a 64 KB datagram. The likelihood that all fragments will arrive at their destination is 97% to the 64th power, or less than 15%.. Over an unreliable protocol like UDP, such losses will eventually lead to near-total data loss as data size increases. Over a protocol like TCP, which provides reliability at the level of whole IP datagrams (not fragments), mounting losses will eventually lead to the network filling up with repairs, which will themselves be lost; the result can once again be near-total data loss.

To solve this problem, you need to repair data at the granularity at which it was lost: you need, not message-level reliability, but fragment-level reliability. This is an important feature of Connext DDS. When sending packets larger than the MTU of your underlying link, use RTI's data fragmentation and asynchronous publishing features to perform the fragmentation at the middleware level, hence relieving the IP layer of that responsibility.

<datawriter_qos>
    ...
    <publish_mode>
   	<kind>ASYNCHRONOUS_PUBLISH_MODE_QOS</kind>
    	<flow_controller_name>
    		DDS_DEFAULT_FLOW_CONTROLLER_NAME
    	</flow_controller_name>
    </publish_mode>
    ...
</datawriter_qos>
<participant_qos>
    ...
    <property>
    	<value>
    	    <element>
    	        <name>
    		    dds.transport.UDPv4.builtin.parent.message_size_max
    		</name>
    		<value>1500</value>
    	    </element>
    	</value>
    </property>
    ...
</participant_qos>

The DomainParticipant's dds.transport.UDPv4.builtin.parent.message_size_max property sets the maximum size of a datagram that will be sent by the UDP/IPv4 transport. (If your application interfaces to your network over a transport other than UDP/IPv4, the name of this property will be different.) In this case, it is limiting all datagrams to the MTU of the link (assumed, for the sake of this example, to be equal to the MTU of Ethernet).

At the same time, the DataWriter is configured to send its DDS samples on the network, not synchronously when write() is called, but in a middleware thread. This thread will "flow" datagrams onto the network at a rate determined by the FlowController2FlowControllers are not supported when using Ada Language Support. identified by the flow_controller_name. In this case, the FlowController is a built-in instance that allows all data to be sent immediately. In a real-world application, you may want to use a custom FlowController that you create and configure in your application code. Further information on this topic is beyond the scope of this example. For more information on asynchronous publishing, see Section 6.4.1 in the RTI Connext DDS Core Libraries User's Manual. You can also find code examples demonstrating these capabilities online in the Solutions area of the RTI Support Portal, accessible from https://support.rti.com/. Navigate to Code Examples and search for Asynchronous Publication.

Acknowledge and Repair Efficiently

Piggyback heartbeat with each DDS sample. A DataWriter sends "heartbeats"—meta-data messages announcing available data and requesting acknowledgement—in two ways: periodically and "piggybacked" into application data packets. Piggybacking heartbeats aggressively ensures that the middleware will detect packet losses early, while allowing you to limit the number of extraneous network sends related to periodic heartbeats.

<datawriter_qos>
    ...
    <resource_limits>
        <!-- Used to configure piggybacks w/o batching -->
        <max_samples>
            20 <!-- An arbitrary finite size -->
        </max_samples>
    </resource_limits>

    <writer_resource_limits>
        <!-- Used to configure piggybacks w/ batching; 
             see below -->
        <max_batches>
            20 <!-- An arbitrary finite size -->
        </max_batches>
    </writer_resource_limits>
 
    <protocol>
        <rtps_reliable_writer>
            <heartbeats_per_max_samples>
                20 <!-- Set same as max_samples -->
            </heartbeats_per_max_samples>
        </rtps_reliable_writer>
    </protocol>
    ...
</datawriter_qos>

The heartbeats_per_max_samples parameter controls how often the middleware will piggyback a heartbeat onto a data message: if the middleware is configured to cache 10 samples, for example, and heartbeats_per_max_samples is set to 5, a heartbeat will be piggybacked unto every other DDS sample. If heartbeats_per_max_samples is set equal to max_samples, this means that a heartbeat will be sent with each DDS sample.

Make Sure Repair Packets Don’t Exceed Bandwidth Limitation

Applications can configure the maximum amount of data that a DataWriter will resend at a time using the max_bytes_per_nack_response parameter. For example, if a DataReader sends a negative acknowledgement (NACK) indicating that it missed 20 samples, each 10 KB in size, and max_bytes_per_nack_response is set to 100 KB, the DataWriter will only send the first 10 samples. The DataReader will have to NACK again to receive the remaining 10 samples.

In the following example, we limit the number of bytes so that we will never send more data than a 256 Kb/s, 1-ms latency link can handle over one second:

<datawriter_qos>
	...
	<protocol>
		<rtps_reliable_writer>
			<max_bytes_per_nack_response>
				28000
			</max_bytes_per_nack_response>
		</rtps_reliable_writer>
	</protocol>
	...
</datawriter_qos>

Use Batching to Maximize Throughput for Small Samples

If your application is sending data continuously, consider batching small samples to decrease the per-sample overhead. Be careful not to set your batch size larger than your link’s MTU; see Managing Your Sample Size.

For more information on how to configure throughput for small samples, see High Throughput for Streaming Data.