Typical Reasons for Connext DDS Discovery Failing and Suggested Solutions
Typical Reasons for Connext DDS Discovery Failing and Suggested Solutions
One of the most powerful features of a RTI Connext DDS system is the highly configurable, decentralized, discovery process. DDS has the ability to automatically and quickly discover new DomainParticipants and communication endpoints (DataWriters and DataReaders) allowing users to more easily design their distributed systems. When discovery fails in a DDS system, Connext DDS has many tools and features to help diagnose any issues that may occur. This article will go through the various reasons that discovery might fail and how to debug and correct those issues. Below is a list of reasons that the DDS discovery process may fail listed in order of how common they are per RTI support.
Reasons for Discovery failing:
Firewall or Network Configuration Issues: Probably the most common reason for discovery failing is a Firewall that blocks DDS traffic. For instance a Firewall rule is configured to stop multicast packets, to not route UDP, or (more likely) is defaulting to block unsolicited traffic (in which case you will need to manually enable the DDS discovery ports for the domain in use). These firewall issues can happen more often in complex network topologies or when operating across subnets.
Diagnose: Try to disable the firewalls on the systems being used as a test. Run DDS applications on the same system to remove firewall from the equation. You can use rtiddsping to do this testing if it is easier to run than your complete application.
Solution(s): Configure your firewalls to allow DDS traffic. Keep in mind that firewalls can be everywhere, you may need to enable ports/disable firewalls on the publisher’s machine, on the subscriber’s machine, AND any intervening routers or switches. See this article for instructions: “Statically configure a Firewall to let OMG DDS Traffic through”.
Incompatible QoS Settings: Discovery can fail if the QoS settings between two endpoints are incompatible. For example, if a DataReader is set to reliable communication and a DataWriter is set to best effort, they will not match.
Diagnose: Run the Admin Console and look for flagged topics. These topics will be listed in the Admin Console logical view with a red mark. Then use the match graph and the match analysis window for that topic to see the problem (see diagram below).
Solution: Change one or the other of your QoS settings so that they are compatible. In the Admin Console you can click on the offending QoS value to link to the manual page of that issue. There is an excellent introduction to DDS QoS settings in the Getting Started Guide: https://community.rti.com/static/documentation/connext-dds/current/doc/manuals/connext_dds_professional/getting_started_guide/cpp11/intro_qos.html
Also, there is a QoS cheat-sheet that briefly describes all QoS values, and if they must be compatible during discovery (by being marked “RxO”). https://community.rti.com/static/documentation/connext-dds/current/doc/manuals/connext_dds_professional/qos_reference/qos_reference/qos_guide_all_in_one.htm
Multicast not working: DDS by default uses UDP multicast for participant discovery. If multicast is not working and the participants are on different machines, discovery will fail (if the endpoints are on the same machine they will communicate because they may discover using loopback or SHMEM which are also part of the default initial peers). There are several reasons multicast might not be working:
Not supported/enabled in the underlying networking software
Not supported/enabled by switches or routers being used.
TTL (Time To Live) not set correctly
No IGMP querier on the network or group timeout too low
Diagnose: Use the rtiddsping utility with default settings to see if communication works. If communications fail then use rtiddsping with unicast peers specified so that multicast is not needed:
rtiddsping -peer 10.10.1.192 -peer mars -peer 4@pluto
In this example:
10.10.1.192 is an IP address of a peer.
mars is a hostname of a peer.
4@pluto specifies a peer with a maximum participant index of 4 at the hostname pluto.
See the article “HOW TO Do Basic Debugging for System-Level DDS” for more information.
Solution: Either fix multicast or use unicast addresses using initial_peers list. To fix multicast, check firewall settings to make sure multicast is allowed. Also make sure any NICs and routers being used have multicast enabled. Lastly, make sure that TTL (time to live) is configured correctly. By default, multicast traffic is restricted to the local subnet. This is because the default TTL value is typically set to 1, which means the multicast packets will not be forwarded beyond the local subnet. To allow multicast traffic to traverse multiple subnets, you need to increase the TTL value. This can be done by setting the multicast_ttl property in the Connext DDS configuration. For example:
The article “Configure RTI Connext DDS to not use Multicast” shows you how to disable multicast for discovery and set up your initial peers list.
This article has a good discussion on multicast discovery as well: Does RTI Connext use IGMP messages?
Misconfigured Topic Names or Types: If the Topic names are not equal or the data types are not compatible between DataWriters and DataReaders, discovery will not succeed. First, the Topic names have to match exactly for the Topics to match and communication proceed. Second, if data type information is shared as part of discovery, then the data types must be compatible between a DataReader and a DataWriter.
Type checking might not be done as part of discovery if:
Type information is turned off explicitly by the user.
Size allowed for type information is too small for data type definition
Starting with Connext 6.x, type information is compressed by default so when communicating with pre-6.x Connext versions, only topic names and type names are compared.
Diagnose: Use the Admin Console for this issue. It will show all the topics defined in your system and you can see if there is a name mismatch or, looking at the match analysis window, a type mismatch. You can then look at the type definitions for the reader and the writer and see exactly what doesn’t match.
Solution: Change the topic names and data types to be compatible.
Here is a good article on typecodes and typeobjects (how DDS represents data types during discovery): https://community.rti.com/kb/when-do-i-need-send-typecode-or-typeobject
Mismatched Domain ID, domain tag, or partition: If the DomainParticipants are configured with different domain IDs and domain tags, or with no matching partitions, they will not be able to discover each other since they are essentially operating in different DDS "universes".
Diagnose: Use the Admin Console or rtiddsspy to see what domains, domain tags and partitions are used.
Solution: Change the partitioning of your system so that DDS entities that are meant to communicate have the same domain ID and domain tag and have matching partitions.
Tip: use the RTI chatbot (https://chatbot.rti.com/) to show how to partition your system with the query “Show me the ways to partition a DDS system and example code for each method”.
NAT and Port Forwarding Issues: DomainParticipants running in a LAN that is behind a NAT-enabled router have a local address, for communicating across their LAN, and also a public address, for communicating across the WAN. For communication across the WAN the participant needs to present a public address for discovery to complete but usually do not know their public address, so discovery can fail. Connext DDS provides ways to work around this issue but can be limited by the type of NAT being used.
Diagnose:
Determine if you are behind a NAT enabled router. Normally if your local Network is behind a NAT your router WAN IP will be different from the external IP address you are accessing on the internet.
Identify NAT Type: Use a tool like Symmetric NAT Test (Go to the "Symmetric NAT Test" page (https://tomchen.github.io/symmetric-nat-test/) and see if you are behind a Symmetric NAT or Normal NAT (Full-cone NAT or other non-symmetric). Communication is only possible if the participants are behind Cone NATs.
Check Firewall Rules: Verify that the firewall rules allow traffic on the ports used for DDS discovery and communication.
Solution: Use RTI Cloud Discovery Service and the Real-Time WAN Transport to assist with NAT traversal and public address resolution. The Real-Time Want Transport allows the discovery process to use the public address of a DDS Participant behind a NAT to reach out to it. If both the sending and receiving DDS participants are behind NATs then the RTI Cloud discovery Service will be needed (in addition to the RWT) as a repository for the public address of these endpoints.
RTI has a complete detailed example on how to use the Cloud Discovery Service and Real Time WAN Transport here: https://www.rti.com/developers/case-code/real-time-lan-over-wan
The Connext User’s Manual has a more detailed explanation of how NATs can affect DDS communication: https://community.rti.com/static/documentation/connext-dds/current/doc/manuals/connext_dds_professional/users_manual/users_manual/P2P_Deploy_WAN.htm?Highlight=cone%20nat
You can read more about the RTI Cloud Discovery service here: https://community.rti.com/static/documentation/connext-dds/7.3.0/doc/api/cloud_discovery_service/api_c/index.html
MTU and fragmentation issues: A Connext DDS transport has a MTU (maximum transmission unit) associated with it and the underlying Internet Protocol has its own MTU associated with it. If a DDS sample is larger than the DDS transport MTU then DDS will fragment that packet before sending it to the network. DDS fragments packets safely and efficiently and can resend separate fragments that are dropped. But if the Connext transport MTU (message_size_max) is larger than the network MTU then the network may need to perform IP fragmentation. IP fragmentation can be problematic, especially in WAN environments, because dropping an IP fragment will result in dropping the whole UDP package. Discovery messages, such as DATA(p), DATA(r), and DATA(w), can exceed the typical IP MTU of 1500 bytes and these messages may be fragmented at the IP level as well.
Diagnose:
Compare your RTI MTU (QOS message_size_max value) with network MTU (use ifconfig) and see if RTI MTU is larger. This can result in fragmentation at the IP level.
Monitor Network Traffic: Use network monitoring tools like Wireshark to capture and analyze discovery traffic. Look for fragmented packets and reassembly issues. You can use the filter ip.fragment in wireshark to look for fragmentation.
If you have not configured Connext to handle fragmentation (by enabling asynchronous writer) you could see the message: ERROR COMMENDFacade_canSampleBeSent:NOT SUPPORTED | Reliable fragmented data requires asynchronous writer
Solution: Have DDS do the fragmentation (avoiding fragmentation at the IP level) by setting message_size_max to be smaller than the network MTU value. Also, for Connext to handle fragmentation for reliable communications, an asynchronous writer is needed. If you want to enable asynchronous writing for user data, you have to set DataWriterQos.publish_mode to ASYNCHRONOUS_PUBLISH_MODE_QOS.
See the Connext User’s Guide section “Avoiding IP-Level Fragmentation” for more detailed information on this topic.
For a discussion on how this fragmentation can affect the RTI Real-Time WAN transport see:
Exceeding Domain Participant ID limit: If you have more than 5 DomainParticipants on the same machine, you may notice that your applications are not able to communicate with Participant number 6 and higher. This is because the default participant ID limit for a peer descriptor in RTI Connext is 4,allowing communication with DomainParticipants having participant IDs 0, 1, 2, 3, and 4. This limit applies to unicast locators and is ignored for multicast locators.
Additionally any RTI Connext tools or services like Admin Console or Recording Service will use a participant ID.Diagnose:
- If using tools like Admin Console or Recording service changes the number of participants that are discovered.
- Check if multicast is disabled and also if you have more than 5 participants on the same machine.
Solution: Use multicast for discovery, or increase the number of participants allowed using the following QoS snippet (to 15 participants per host):
See this article for more information: https://community.rti.com/kb/why-cant-more-5-domainparticipants-communicate-within-same-machine
More details about how to format initial peers: https://community.rti.com/static/documentation/connext-dds/7.3.0/doc/manuals/connext_dds_professional/users_manual/users_manual/Peer_Descriptor_Format.htm
Incompatible Shared Memory Segments: Connext DDS will use the shared memory transport by default to communicate between two participants on the same system. If there are incompatible shared memory segments, the discovery process can fail. This can happen if different applications or instances of the same application are using different shared memory transport settings. The error messages might indicate that the shared memory segment found is incompatible with the expected settings.
In addition there was a change to the shared memory transport with Connext 6.0 resulting in shared memory not working with earlier versions.
Diagnose: Shared Memory issues are often difficult to diagnose since shared memory is an OS configured resource. You may get an "incompatible shared memory segment found" error message. Also, disabling SHMEM as a transport so that traffic is done over the local loopback instead can give a sense if SHMEM is the problem.
Solution: This error can be resolved by making the property “dds.transport.shmem.builtin.parent.message_size_max” consistent across all your applications on the same host.
Here is a good article that discusses incompatible shared memory segments: https://community.rti.com/kb/what-causes-error-nddstransportshmemattachwriterincompatible-shared-memory-segment-found
If you need to communicate over shared memory with older versions of DDS the migration guide section “Default shared memory locator has changed” explains how to get that working: https://community.rti.com/static/documentation/connext-dds/current/doc/manuals/migration_guide/600/general600.html#section-general-transport