I have three computers communicating via a VPX switch in a VPX chassis. They are running CentOS 7. On one computer, I have configured the default route for multicast addresses, as follows: route add 224.0.0.0 netmask 240.0.0.0 eth0 . On computer 1, I am running wireshark, on computer 2 I am running Application A, and on computer 3, I am running Application B. Application A is running on the computer with the default route for multicast. Applications A and B are modern C++ applications built against RTI Connext 5.3.1 for the target.
I am using Generic.StrictReliable QoS, with no amendments. When Application B comes up, I see its IGMPv3 Membership Report come through on Wireshark. When Application A comes up, I do not see it. Regardless, I see an indication that Application B has received the initialization message from Application A, and that A has started sending the periodic status message. After 10 seconds, Application A fails with a Timeout Error, waiting for acknowledgement of the first status message.
If I re-run, and with my own instrumentation, I see the intermessage time for a different heartbeat message boucing around from its nominal 200ms time to several seconds. I understand that in StrictReliable, the write call blocks until it is acknowledged. On a separate, non-production target with the same operating system, The system runs nominally. If I run wireshark on the board running Application, I see the IGMPv3 Membership Report for that board, and the report for the board running Application B. If I delete the default route for multicast addresses, my application runs nominally, but can't communicate with other multicast enabled components of the application. I understand that I can add a route for my specific multicast groups. It seems that the outbound messages are fine, but the heartbeat responses of the reliable qos are being lost somehow.
I saw this article, https://community.rti.com/kb/why-are-my-reader-and-writer-applications-unable-communicate , which mentioned the existence of asymmetric discovery. I am interested to know if and how I can instrument the software, or use RTI tools to diagnose this problem. I try to avoid unverified failures, and would like to make a positive conclusion about why I was seeing these effects.
UPDATE: Any route in the routing table blows up RELIABLE reliability, causing timeout errors. Even route add 224.1.1.10 netmask 255.255.255.255 eth0.