Why DataWriter::write() execution time can grow with local multicast subscribers

Summary

On Linux/Windows and potentially other operating systems, when a DataWriter sends data samples to a multicast address, the performance of the underlying UDP sendmsg() call varies depending on whether multicast DataReaders are located on the same local host.

If there are no local, multicast DataReaders, the UDP code transmits a single multicast packet over the network. Consequently, the execution time of DataWriter::write() remains largely unaffected by the number of remote multicast DataReaders.

However, if local multicast DataReaders are present, the UDP code directly copies the packet into the sockets of the subscribing processes during the sendmsg() call. In this scenario, the execution time of both sendmsg() and, by extension, DataWriter::write(), will increase proportionally with the number of local, multicast subscribers.

This behavior mirrors that of Shared Memory when used as a transport mechanism between processes on the same host. Shared Memory may offer superior performance compared to UDP multicast for sending data to multiple DataReaders on the same host. 

Background: how Linux handles local multicast

  • IP multicast loopback. By default, when a UDP packet is sent to a multicast group (address/port) and there are UDP sockets on the same host that are members of that group, the IP layer will deliver a copy of the datagram to matching sockets. This behavior is controlled by the IP_MULTICAST_LOOP (IPv4) / IPV6_MULTICAST_LOOP (IPv6) socket option.
  • Per-socket fan-out. The copying of the multicast datagram to local sockets is actually done in the send path…in the context of the UDP sendmsg(), or equivalent, call. So the execution time of sendmsg() can take longer as the number of same-host receivers grows.
  • Remote replication. For multicast receivers on other hosts, the sender generally transmits one packet; replication happens on the network and on receiving hosts, not in the sender’s sendmsg() path. This is consistent with IP multicast host behavior defined in RFC 1112.

What this means for Connext users

  • Using UDP multicast for intra-host distribution is not always faster than using Shared Memory (SHMEM). SHMEM avoids UDP header processing and can use zero-copy fan-out to local consumers.
  • The observation applies to both Discovery traffic and user data: if many local participants join the same groups, the sender’s thread can spend more time in the kernel on each write due to local multicast loopback.
  • When transmitting data to numerous subscribers on the same host, testing Multicast versus Shared Memory is recommended to determine the optimal approach. In some scenarios, Shared Memory transport might offer better performance due to possible lower per-packet processing. Additionally, Connext facilitates zero-copy delivery between processes on the same machine, which is particularly beneficial for large data transfers (megabytes).
  • When configuring DDS Discovery, utilizing multicast for discovering participants on the same host can be more straightforward and efficient than using Shared Memory. With Shared Memory, you must anticipate and configure the maximum number of participants, N, that might run on the same host by setting the DiscoveryQosPolicy.initial_peers value to <N-1>@shmem://. Furthermore, if the number of participants on the host varies, CPU time may be wasted attempting to send discovery packets to participants that do not exist.
Platform: