Hi,
We are having problems when trying to create 7+ processes/participants in the same domain and host. All have 1-2 of the same writer/topic and 1-3 other reader/writers. Everything works flawlessly and discovery is quick when only using 3-4 participants/processes. But after we create more, we are having problem with 1-2 participants not being found by the others. They are all using the same QoS file. It is a bit random which participants that are not discovered. On some computers it all works fine while others always fail in the same way as described.
We have setup StatusCondition of requested_incompatible_qos, subscription_matched and publication_matched and there are no hits on the ones that fail, therefore we think it's discovery related. We have tried using difference transports on discovery and "BuiltinQosSnippetLib::Optimization.Discovery.Common" and variours other settings like initial_peers on shmem, multicast and unicast.
We have also tried enabling all logs and used the logparser but cannot see any errors, we are however no experts of following discovery patterns.
Do you have any suggestions of things we can try?
We are using rti_connext_dds-6.0.1 (modern c++) and rticonnextdds-connector-py 1.0.0 on Ubuntu 18.04 (both docker container and without).
/Alex
It sounds like you're on the right track, and I also expect it is related to discovery.
When a DDS domain participant is started it is given a participant ID.
This participant ID is unique to a domain participant on the same host, so if you start two domain participants on different machines they will likely both use participant ID 0. But if you start them up on the same host the first one will use participant id 0 and the second one will use participant id 1.
During discovery each domain participant will attempt to discover 5 participant IDs on a host (IDs 0-4). The result from this can be confusing because depending on the startup order different domain participants may not be discovered at different times (like if you start up a Connext debugging tool, it will use a participant ID).
The first thing I would suggest doing is modifying your initial peers list to increase the number of participant IDs that are attempted to be discovered. (the number before the @ sign)
More information on how to do this, including QoS configuration can be found here: https://community.rti.com/kb/why-cant-more-5-domainparticipants-communicate-within-same-machine
So, as Ross indicated, this is likely an issue with discovery...and specifically the discovery peers.
When using NON-multicast discovery, e.g. Unicast UDP directly to ip addresses via the initial peers list or using shared memory, the problem is exactlly how Ross described and you should use the link he provided to learn more.
However, by default, Connext DDS also uses multicast for discovery which should not have this participant ID issue...multicast should allow the discovery of any number of participants on a host...and that should also between applications on the same host.
NOW, having said that, when I say "host", I mean an environment where every process shares the same network interfaces through the same OS. 90% of the time, this is exactly what users are running...an OS like Linux/Windows on a machine, and then all processes on that machine should be able to discover each other using multicast without having to consider the participant ID issues.
HOWEVER, increasingly so, the "host" environment gets more and more complicated. Like running multiple VMs on a host. And like using Docker. Processes running in differnent VMs or containers may NOT be able to send multicast packets to each other...and thus may not be able to discover each other via multicast. That entirely depends on how users have configured the networking support of their VM or docker container. bridged versus shared versus ... etc.
So fundamentally, if you are having to set a peer list with unicast addresses, then you will have to modify the peer list using a "<n>@w.x.y.z" format for each peer.
Or, you can explore how to configure your processes to be able to send multicast to other docker containers on the same machine.
Thanks for the fast replies!
When setting 10@<shmem/localhost> it is now working in the environment which did not work previously.
Sorry for the cumpsy usage of host, our processes are created side-by-side with the same network interfaces, no virutalization layers between them.
But I am still confused, seems like multicast isn't always used on all hosts when using the same QoS. I've tested the same compiled setup on 3 difference computers, 1 Windows (A) and 2 Ubuntu18 (B & C), and it only works on A and B. On C I need to configure the peers with number of participants (@). I started off by not configuring initial_peers or any transport setting at all since I though the default communication will be multicast.
What we want is:
Do you have any recommendation how to configure this? I had an idea of using multicast but using TTL (0) setting to limit to host-only when that is required.
Best regards
So...
By default, Discovery will have the following initial peers, "239.255.0.1" multicast, "4@127.0.0.1" localhost unicast, "4@shmem://" local shared memory. If multicast loopback is working then theoretically, all processes using DDS in the same domain on the same host should be able to discover each other without any additional configuration.
Given that you indicate that this worked on Windows and one of your Ubuntu hosts, then this confirms that it can and will work as described. Since this is not working on your other Ubuntu host, and you have to configure the max participant id (<n>@) in the initial peers list, this shows that THAT Ubuntu host is not configured the same way as your other one (and must be uniquely configured since this should work with an out-of-the-box Linux configuration).
So, you'll have to examine the Linux host that the problem occurs on and see how it's NOT the same as the other hosts. Start with an "ipconfig -a", and see if there an UP and RUNNING interface that also supports MULTICAST.
By default, Connext will use shmem for transferring data whenever 2 processes are on the same host. You may want to look into the Zero-Copy feature (does require extra effort to use, but you get superfast data transfer...constant latency no matter what the data size).
Discovery is discovery. There is no choice between different discovery methods...there is only one way. While there are QoS parameters that can configure discovery to be faster (you should search for "discovery" in the community.rti.com website for articles on that), you should only look into changing them when you have found that discovery is too slow. Are you in a situation where discovery is slower than your requirement?
"Reliable" and doesn't discover are two different things. Assuming that two participants discover each other, the discovery information transfered will be done reliably. Whether or not 2 participant are configured to discover each other depends on configuration...I would not mislabel the situation in which 2 participants discover each other sometimes but does not discover each other at other times as being "not reliable". Not configured as needed...yes.
Not sure what you mean here. Specific configuration of the host? DDS doesn't require you to configure any host that supports multicast loopback. And most "ordinary" operating systems (e.g., Linux/Windows/MacOS) support that "out-of-the-box" with no additional configuration...assuming that the host machine has UP and RUNNING interfaces with MULTICAST enabled.
You can of course configure the OS/host to NOT support multicast at all...which then prevents the default mechanism of DDS discovery that uses multicast to work as designed.
There are a variety of ways to do that including:
Finally, you may want to engage RTI's Professional Services group who can work with you directly to provide design and architecture support for your specific requirements.
Thanks a lot, things are much clearer now thanks to your comments and a bit trial-n-error. For now, we have chosen to go with shmem:// and udpv4://127.0.0.1 as initial peers and disabled multicast until we have talked to our IT department. Now, it is fast enough in all our environments. We need a discovery process that does not take longer than ~1 seconds due to user experience. It wasn't reliable nor deterministic with the problems with multicast, since we are very new to DDS it's hard to understand sometimes when things does not work as expected.
We think the hosts which have problems with default settings have a firewall that drops the multicast packets. We can see that sometimes DDS is fast but 2/3 attempts it takes either a long time (10s+) to discover some participants or they are not discovered at all, even with increasing participant count on other transports.
A final question, if lets say that multicast, unicast and shmem are enabled for discovery. What will be used? Is there some logic here? From what I can see, it seems like unicast is preferred over shmem at least?
Yes, firewalls are generally an issue with DDS, multicast AND unicast. So, unless you configure a firewall to allow DDS packets through (usually have to open UDP ports...and allow multicast to the ports), DDS generally won't work. Sometimes the configuration is actually such that the discovery completes, but the firewall is blocking the data. So firewalls have to be dealt with for any IP-based communications.
When discovery is enabled for multicast/unicast/shmem...you mean that the initial_peers has a multicast address, a unicast address and a shared memory address..., then Connext DDS will try to use all methods. It doesn't actually distinguish between them and treats them equally in the discovery phase. It will try to send out discovery information to every locator (aka address) in the initial_peers list.
So, it's possible that an application will receive the same discovery information 3 different ways (multicast loopback, unicast loopback via localhost, and shared memory) from other applications on the same host.
A participant discovery packet contains information about how to contact that participant...it's receive addresses. By default, it's a list of unicast, multicast and shared memory addresses (not the same as it's peer list...since these are its addresses, not the addresses of who it should try to discover).
NOW, after receiving a remote participant's discovery information, if that remote participant is on the same host AND that remote participant advertises a shared memory receive address, the local participant WILL use shared memory instead of unicast to send packets to that remote participant. So, shared memory is "preferred" when 2 participants both have shared memory enabled (default) and are on the same host.