HOWTO Do Basic Debugging for System-Level DDS

There are a series of steps you can perform to do basic debugging of a DDS system, when there is no communication, or sub-optimal between your applications.  

1.  Are they actually on the same network/subnetwork?

Two computers that are on a network may have different subnets (for example, one may be using a WiFi connection, the other wired, and these may be on different subnets, although a local switch is routing unicast without routing multicast, etc, etc, etc.  Be wary of Networking and Network Engineers, for they are crafty and prone to practical jokes).

Linux

$ ifconfig | grep inet

Windows

c:\>ipconfig

Look for the inet addresses and determine if they have anything on the same subnet. 

Warning! Machines that have more than four IP addresses (because of VMs, VPNs, etc) may result in pain and suffering!  RTI Connext DDS chooses only the first four network interfaces it sees.  This is particularly bad on Windows, because Windows will arbitrarily re-order the interface list!  This means that sometimes it works, and sometimes it doesn't. 

2.  You've determined that they have a mutual network subnet.  Now, will DDS work, at all?

Use rtiddsping. The rtiddsping command-line tool is a pristine, known-working DDS application.  On one machine, run rtiddsping[.exe] -publisher.  On the other machine, run rtiddsping[.exe] -subscriber.  If they do not see each other (look for "Found 1 additional ping subscriber(s)" on the publisher side, or "publisher(s)" on the subscriber side), then the problem is the network infrastructure, it is not RTI Connext DDS. 

If rtiddsping instances do not see each other, the possible causes are:

  • Multicast not enabled on either or both machines, or in an intervening router or switch.
  • A Firewall rule is configured to stop multicast packets, to not route UDP, or (more likely) is defaulting to stop unexpected port accesses (in which case you will need to manually enable the DDS discovery ports for the domain in use).  The default domain for rtiddsping is 0, which corresponds to ports 7400, 7401, 7410 and 7411.  Keep in mind that firewalls can be everywhere, you may need to enable ports/disable firewalls on the publisher side, on the subscriber side, AND any intervening routers or switches.

To debug the network infrastructure you may follow these steps:

  • Check basic network (IP) reachability. Run the "ping" command-line tool to ensure you have basic network communications beween the computers. This tool comes with most operating systems including Windows and Unix Variants. Failure to reach an IP address via ping may indicate a bad network connection or the presence of a Firewall (commonly a host-based firewall) that is preventing incoming network communications.
  • Validate basic multicast reachability. Run ping 224.0.0.1.  The 224.0.0.1 is the "all-hosts" group. If you ping that group, all multicast capable hosts on your LAN should answer (every multicast capable host must join that group at start-up on all it's multicast capable interfaces). If one of the hosts on your LAN where you planning to run Connext DSD does not respond it may indicate a network connectivity failure, a firewall (potentially on that host itself), or a operating system configuration that does not support multicast. The out-of-the box discovery settings require multicast so these hosts will not communicate over DDS with the out-of-the-box configuration.

Pings to 224.0.0.1 are never forwarded by the mulicast routers so you will only get replies from the computers on the same LAN. If you are trying to have DDS perform discovery between computers on different LANs (i.e. separated by a router) the intermediate routers need to be configured to relay multicast packets. You can check this by running ping 224.0.0.2. The 224.0.0.2 is "all-routers" group and all multicast-enabled routers should respond. Since the pings to "224.0.0.2" are never forwarded by routers you shoudl expect to see replied only from the routers connected to the same LAN to the machine from which you ping. 

Note that running rtiddsping allows testing multicast communications in situations where the two machines are separated by a router. If the router is multicast-enabled (and configured to forward multicast packets) then it will relay the DDS multicast discovery traffic and the two rtiddsping applications should discover each other. This is because multicast address used by DDS discovery is  239.255.0.1 which should be forwarded by all multicast-enabled routers, allowing DDS discovery to proceed successfully.

If basic network IP connectivity was successful but running rtiddsping is not successful and the two computers are on differnt LANs, then the most likely scenariio is that the routers in between are not configured to route and forward multicast packets.  If the two computers are on the same LAN then there should be no situations where nasic IP connectivity (inlcuding multicast) is working and the example/CPP/Hello_simple is not working.

3.  rtiddsping works, does my application work?

After using -example <arch> with your IDL for MyFooType, you get a default publisher, and a default subscriber.  Compile these, without changing anything.  Run them. 

If they see each other, great!  This means that they are on the same Domain (because you didn't change anything), the Type is assignable (obviously, as they are using the same IDL), the Topic is the same (since you haven't changed it from the generated "Example MyFooType" topic) and QoS settings are compatible (again, because you haven't changed them from the defaults).  DDS is working, so now go forth, and Build Great Things!

If they do NOT see each other... hm.  In this case, run rtiddsspy.  The default domain in the example code is 0, so just run rtiddsspy without the -domainId <int> flag.  I recommend running it on both the publisher and the subscriber machine (one at a time).  In a working distributed system, you should see on either machine a "W +N" line (this is the publisher), and a "R +N" (the subscriber) line:

    1408440707.896907  W +N  C0A80064    Example MyFooType           MyFooType          
    1408440706.509978  R +N  C0A8006F    Example MyFooType           MyFooType

If you don't get those two lines (in this case, what you are probably getting is one line, depending on whether Spy is running on the publisher side ("W +N") or the subscriber side ("R +N").  If this is seen... post that here because that's pretty interesting.

4.  You're well into developing your system, but now there's a publisher not seeing a subscriber (in an isolated case)

Check the following, using the tool described:

  • Can Spy see the publisher (from a machine that the publisher is not running on)?
  • Can Spy see the subscriber (from a machine that the subscriber is not running on)?

If this is the case, my guess is that the Topic is different (one may be "Example MyFooType" because the engineer forgot to change it to the one in use), or the Type is different and not assignable (because an engineer updated the IDL but failed to tell you this, and his new publisher is using the new type and your old subscriber is using the old type).  If the same Topic is given by both the publisher and the subscriber, but the Topic QoS or Type differs sufficiently, Spy will self-configure to agree with the first one it sees (arbitrary in a running system).  This may result in an incompatible datareader QoS.

  • Does Admin Tool show red flags for the Topic, the Type, the QoS or any other thing?  If it is not self-evident what the problem is at this point, post that here.

5. Admin Console (or Analyzer) show DomainParticipants, but does not show any of the sub-entities

Discovery is a process with a handshake, that starts in Multicast but completes in Unicast.  When Discovery can be shown to start (the DomainParticipants are visible to each other), but not complete (there are no Publishers, Subscribers, DataReaders, DataWriters), the problem is again probably a firewall. 

6.  Your system works, but appears to be working sub-optimally, or the QoS behavior appears incorrect

Are you sure that the QoS that you think you are using, is the QoS profile that the application is using?  To quickly determine if the application is picking up an incorrect USER_QOS_PROFILES.xml file, cd to the directory that has the file you want to be using, and insert some bad (malformed) XML.  You can simply stick some text as the first line of the file, not in XML tags.  Save the file and launch your application.  If the DomainParticipants do get created and the application starts running, then the application is not sourcing the USER_QOS_PROFILES.xml file that you broke.

  1. The DomainParticipants use the USER_QOS_PROFILES.xml file in the working directory.  Do launch the application from the directory where the USER_QOS_PROFILES.xml file is (./objs/<arch>/<application>), do not change to the application directory (cd objs/<arch>; ./<application).
  2. If your application is started from a script, verify that the script is not changing to a new working directory (no 'cd' commands).
  3. Verify that the NDDS_QOS_PROFILES environment variable is not set, or if it is set, that it does include the relevant file.

If at any time your steps are forced to diverge from the steps above, post to the forums, and I'll fix the above to agree with what you are seeing.