A DIY experiment replicating my problem
After having spent way too much time to track down a real simple problem (Spoiler Alert: I hadn't opened the proper ports in my firewall), I though to document this here in case somebody else stumbles over this, especially as it is easily possible to recreate my setup with the standart RTI tools: the shapes demo, the analyzer, and the monitor.
Setting up the problem
- Turn your firewall on and plug all the holes you punched in for RTI Connext.
- Start eight (8) instances of
rtishapesdemo
. - Have each instance publish a Square topic, each one using a different color.
- Have each instance subscribe to the Square topic.
What I am seeing after this is that not all shape demo instances track all the squares. In my case two instances manage to receive all eight colors, the other six instances only track themselves and two other colors (which are the same amongs all of those six).
Trying to see what is goining on
Using rtianalyzer
- Start the analyzer and configure a Spy Agent: increase the Participant ID Limit on builtin.udpv4://127.0.0.1 to at least ten. (My rational for that was that I would need eight for the shape demos and one each for the analyzer and the monitor.) Start the agent.
- If you want, you can filer your entity tree to exclude "rti/*" Topic Names, that eliminates a lot of the monitoring stuff
- Perfom a complete match analysis on all writers and readers
- Look at the Topic: Square. Mine shows "[8 Readers, 8 Writers, 0 mismatch, 64 matches, 0 potential missmatch]"
Using rtimonitor
- Start the monitor and connect it to domain 0 (the default of the shape demo)
- Go to the "DDS Logical View" tab and expand the Square Topic. Mine only shows two readers and writers where I ignorantly would have expected eight.
- Open a Systems Overview Panel and look at the Matches. Mine shows missmatches for every reader or writer I select in the DDS Logical View.
My wrong conclusions
What threw me off completely here is the fact that the analyzer shows the complete set of 8x8=64 reader/writer combinations on the topic to be matched, from which I for my original problem "erroneously" concluded that RTI is performing as it should and I had an issue somewhere in the way how I interact with the RTI api. Using 20/20 hindsight, this obviously lead me down into a wild goose chase into code that I hadn't touched in years, hence a lot of time was spent in relearning.
Initially I didn't really trust the monitor reports as I hadn't used that tool before, all my readers and writers had little question marks overlayed over their icons (at that point I hadn't enabled the use of the monitoring library), and generally I felt I didn't get what was going on in that tool.
During my debuggin efforst, I enabled the monitorin library (and hence got rid of the question marks), managed to get rtimonitor to show all my participants (orginially, just like in the shapes example, the monitor only showed a subset of my participants. Having the monitor open before instantiating my participants seems to make them show up there, too). Using the "Terse" view, I managed to assert that the monitor actually showed missmatches where I also didn't have communications in my application (the analyzer showed all matched, effectively contradicting the monitor).
At some point while staring at the Descriptions Panel I noticed the unicast locators which showed my local IP and the appended RTPS ports, which made me go "D'Oh", turn off my firewall and reach instant bliss as stuff started to work immedieately just as it should...
Lessons learned...
- If stuff doesn't work, it actually most likely is user error ;)
- Remember your firewall settings, even if you are only using your local machine
... and follow up questions
So after having admitted to this rather stupid error of mine, I am really curious as to where I could have made a better attempt at using the RTI tools to debug this properly:
- Is there any way to detect firewall incompatibilities using the analyzer and the monitor?
- Why do/did/would the analyzer and the monitor show different results for the "matching" state?
- Why did/does stuff partially work, i.e. why did/do some entities match even though no (RTPS) port is open at all? (If you redo the shapes example with only two instances, all is working fine.)
- Is there some 'magic" wrt. the use of shared memory as opposed to using the NIC?
Firewalls are a pain.
There are behaviours in Analyzer that clearly show (if you know what you are looking for [which is the problem, really :)]) that Discovery is not completing. The first one is that the participants are seen, but none of the endpoints, when the connection is via a NIC.
The problem is that from the standpoint of the tools, the Firewall is seen as a passive thing (ie, we don't actively test for one). It might be possible to implement methods that do look for the telltale discovery-isn't-completing problem, but even that would be susceptible to missing the problem (if the firewall is blocking a complete set, we won't see the participant and so we can't know that it is there).
Some things will show some things, others won't show anything, others will show everything...and which will demonstrate which behavior is also depending on what startup order you use and on which machine(s) the Firewall is running on. The thing is that each participant uses a different set of ports, which are determined on start-up. If you start it up fourth, it will have a different set of ports to use, then if you start it up fifth. And some QoS settings only tell a participant to look for the first N+1 port ranges and then stop (default is 4, ie the last one looked for is participant ID #4)
There's no SHMEM magic, except there usually isn't a firewall built into an OS' shared memory setup, except at the level of "yes you can do shared memory"/"no you can't" (except for MacOS X's really, really really, stingy configuration values that only let two DDS participants use shmem, and the third will fall over (there's a how-to elsewhere in community that explains what's happening, and how to reconfigure the OS so that more participants are supported).
My response to "I'm getting weird behavior!" is "shut off the firewalls". If that solves the problem, then it's a question of correctly configuring the Firewalls rather than a question of DDS.
Thanks for the response, and yes, I can clearly second that firewalls are a (self inflicted, in my case) massive pain....
You state that
If you have the time, could you maybe sketch out such a use case? I.e. say I have a blue square shapes publisher and a corresponding subscriber...
How could I use Analyzer to determing whether a NIC is or isn't involved in the discovery process?
Thanks for your time, I really do appreciate it -- especially as it could teach me what to look for the next time I am stuck.
Firewall:
If a writer and a reader are on the same host, and you haven't specifically changed the QoS from the default behavior, they will discover each other via Shared Memory. They will not discover each other via Loopback IP addressing (ie, participants do not enable loopback unicast discovery if shmem is enabled).
It is possible to turn off shared memory for an arbitrary participant, using QoS. In this case, loopback unicast discovery is enabled automatically for that participant -- but if one participant is using loopback and the other is using shmem, they will not discover each other (since the one using shared memory is not watching for loopback traffic). If both have shmem disabled, then they will discover each other via loopback unicast.
Likewise, it is possible to statically *enable* loopback, even on participants using shared memory. So for participants on the same host, they should both be set for shmem://, or both have loopback discovery enabled -- either automatically through disabling shmem://, or by configuration in the QoS.
If the two participants are on different devices, then they will discover each other via normal, UDP/IP multicast discovery.
It is also possible to enable static IP address peers (NDDS_DISCOVERY_PEERS environment variable), in which case participants will discover each other using unicast discovery.
Any time two participants would use either unicast or multicast discovery, you now have the possibility that one or more Firewalls will get in the way. This is the normal network stack Firewall. It is also possible to get "firewall-like" behavior in the OS for shared memory (for example, SE-Linux policies may prevent access to shared memory from an application run by a user without the necessary trust level).
In Analyzer, in the tree view, simply click on the "Expand All" button. If you see a bunch of entries that only show the Participant entry (but none of the Publisher/DataWriter, Subscriber/DataReader entries) then Discovery isn't completing. This is almost always going to be because a Firewall has interfered.
Also:
Analyzer and Admin Console are both participants. I used to say that, as a passive discovery-based tool these were not a disturbance to the DDS system, but this isn't exactly true. While Analyzer/Admin is only passivly scanning discovery information and does not ordinarily disturb a running DDS system, it can be that enabling one or the other takes up a spot in the Participant ID list, and pushes another application up into the area beyond the number of participants being looked for by a new participant, or up into the port range that a Firewall is blocking connections to/through.
If you aren't even seeing an expected Participant, and you have correctly configured for automatic (simple) discovery (which is the default), then it can be a different issue -- probably the Discovery default of "4@..." isn't high enough (see the documentation for how to configure the expected number of participants to be more or less than five at a given address, using NDDS_DISCOVERY_PEERS). If you see only the first five participants on an IP address with more than five participants, then that's a 4@ problem rather than a Firewall (although it may be hiding a Firewall problem also).
Both of these issues can be "seen" via inspection of the Tree lists in Analyzer and Admin Console. But they only jump out at the viewer when the viewer knows what he or she *should* be seeing.
You might also look at this post, which covers initial debug of why a DDS application may not be working in a system.