Loss of DDS Communication when using Admin Console

3 posts / 0 new
Last post
Offline
Last seen: 4 years 10 months ago
Joined: 12/19/2018
Posts: 4
Loss of DDS Communication when using Admin Console

Hello,

I can use my applications utilizing DDS completely fine; the participants are created, pub/sub starts just fine, DTOs are published and recieved, and liveliness data is what I'd expect. However, recently, the second I start either DDS Spy or the Admin Console, I drop most traffic and liveliness data starts flickering on/off on/off every few seconds. 

I've checked my logs and whenever I start either of these programs, my log is flooded with

NDDS_Transport_UDP_send:send message size count
NDDS_Transport_UDP_sendToMultipleSockets:OS sendmsg() failure, error 0X65: Network is unreachable

with periodic entries (every 80 or so of the above) of

DISCEndpointDiscoveryPlugin_unregisterParticipantRemoteEndpoints:remote endpoint not previously asserted by plugin: 0XA61F5422,0X6E30,0X3,0

I've confirmed the network is reachable during this time, the machines maintain a solid 100% packet successful ping with each other during the issues with DDS.

This happened a few weeks ago and a clean install of RTI DDS fixed it, however, it just came back today and a fresh install didn't work this time. It makes it nearly impossible to debug issues with our applications.

Thanks for any help

Offline
Last seen: 3 years 11 months ago
Joined: 08/20/2012
Posts: 25

Hello,

Here's an imagined scenario that could lead to the kind of issue you are experiencing. Let's say the host running Admin Console has more than one network interface. What if the targets (the nodes running the applications using DDS) believe they have a valid route to reach that destination, when in reality the IP address that the other network has assigned on the Admin Console node is not reachable? I've encountered scenarios of this general nature that lead to ARP requests clogging up the network stack. When the ARP requests time out, some packets flow, but more ARP requests are made and the stall happens again. Rinse, repeat. I'd recommend looking at packet captures for evidence of this, see what IP addresses the targets are looking up, and look at any statistics the kernel may provide about networking-related errors. But first, you may want to see the list of locators advertised in the Discovery announcements from the Admin Console node, and see what happens if you try to ping each address.

If this proves fruitful, you might be able to mitigate this by using the allow/deny lists in your QoS, or by changing addresses or routes in your network.

Regards,
Tom

Offline
Last seen: 4 years 10 months ago
Joined: 12/19/2018
Posts: 4

Tom, 

That was exactly it. I forgot I added a new network interface with a different subnet to my admin console node a few days ago; the other nodes in the system were configured with a netmask which could see both IP addresses. Limiting the netmask to only allow access to the old IP address resolved the issue.

Thanks for your help!