Linux and Windows interop problem

9 posts / 0 new
Last post
Offline
Last seen: 2 years 9 months ago
Joined: 02/15/2013
Posts: 20
Linux and Windows interop problem

Hello there,

I have a strange symptom when I connect a PC application compiled against version 5.1 with a Raspberry Pi application compiled against version 5.0 (I have no access to the Raspberry 5.1 libs yet).

I use the same QoS file on both sides, the message_size_max and receiver buffer pool size are both set to 9216 (backwards compatible). The only enabled transport is UDPv4.

I have a reader and a writer app on a single topic which can both be compiled on the PC and Raspberry.

Now the fun:

Writer and reader app both started on the PC work, and both started on the Raspberry work also.

Writer started on Rapsberry and reader started on PC works also.

Writer started on PC and reader started on Raspberry do not communicate! There is a strange error messge "MIP(something)!RTPS" (sorry don't remember) shown on the Raspberry console which seems to be a RTPS incompatibilty between 5.1 and 5.0

Any ideas what could be wrong?

Best regards

Josef

 

Organization:
Fernando Garcia's picture
Offline
Last seen: 4 months 6 days ago
Joined: 05/18/2011
Posts: 199

Hello Josef,

Raspberry Pi (armv6vfphLinux3.xgcc4.7.2) is now part of the official RTI Connext DDS release. You should be able to download the 5.1.0 libraries for Raspberry Pi from the RTI Support portal. If you cannot access the libraries ask your distributor or, if you are part of the University Program, send an email to up@rti.com, so they can enable you that architecture.

I have been trying to reproduce your issue running a simple Hello World example on a Linux machine (running 5.1.0 libraries) and a Raspberry Pi (running 5.0.0 libraries). However, they did communicate with the out of the box settings. Could you share your USER_QOS_PROFILES.xml and maybe your IDL file? That would help us debug your problem.

In the meantime, you could try to run RTI Admin Console on your PC and look for possible QoS mismatches. Also, you can try to load the built-in QoS profile BuiltinQosLib::Baseline.5.0.0 in your 5.1.0 application, as suggested in this solution. That will make your 5.1.0 QoS settings and your 5.0.0 applications exactly the same.

<qos_profile name="profiles_Profile" base_name="BuiltinQosLib::Baseline.5.0.0" is_default_qos="true">
  <participant_qos>
    ... 
  </participant_qos> 
  ... 
</qos_profile>  

Best,
Fernando. 

Offline
Last seen: 2 years 9 months ago
Joined: 02/15/2013
Posts: 20

Hello Fernando,

I did some more tests and it seems to be an incompatibility of RTI 5.1.0 between Linux and Windows.

I have now three platforms: Raspberry with 5.1, Ubuntu 12.04 with 5.1 (RTI Live CD) and Windows 7 with RTI 5.1
(the Ubuntu is in a VirtualBox machine in Windows 7.1 and has full network access)

I use the builtin QoS "BuiltinQosLibExp::Generic.KeepLastReliable.TransientLocal" from the library "BuiltinQosLib" on all platforms.

The NDDS_DOSCOVERY_PEERS is NOT set on all platforms!

It is the same source code compiled under the three platforms (with different makefiles)

Now I start the writer app on the Raspberry and the reader app on the Linux and Windows.

When the Raspberry publishes on the topic the Linux machine gets the data and the Windows machine not!

When the Ubuntu virtual machine (x86) publishes on the topic the Windows machine does also not get the data!

So this seems to be definitely an issue between Linux and Windows, not between ARM and x86.

The funny thing is the the reverse works as expected: Reader on Rapsberry get data from writers on Linux and Windows!

The discovery seems to work, when I dump the participant QoS I find that the Windows participant finds two other participants (Raspberry and Ubuntu) so this cannot be the issue.

Strange is also that the Linux is in a VirtualBox machine under Windows and it works whereas the Windows app itself does not work!

In Version 4.5  this has definitely worked...

I have no idea what could be wrong on my side as I´m using the builtin QoS only and compile the sampe applications on the three platforms.

Edit 14-03-01: In the case where the writer on Linux is not communicationg with the reader on Windows:
The reader on Windows gets the Subscription matched & Liveliness changed listener callbacks but the writer on Linux does NOT get the Publication matched & Reader activity changed callbacks. If I start the writer on Windows it gets those callbacks.

Edit 14-03-02: Found the reason, the writer on the Linux side is confused (and won't connect) when there is more than one reader on the Windows side. This can be reproduced by adding a second (dummy) topic and reader to the HelloSubscriber.cpp in the Hello_simple example. This is so easy (just copy&paste) that I don't attach the updated example here. It then no longer prints the messages!
This is a serious bug in RTI NDDS 5.x and should have been covered by a regression test!

Best regards

Josef

Offline
Last seen: 5 years 8 months ago
Joined: 01/31/2011
Posts: 37

Hi Josef,

Can you attach your full example? Before we can confirm a bug, we need to be able to reproduce.  I followed the steps you described and had no problems running any combination of Linux and Windows.  Can you also try running rtiddsspy on both the Linux and Windows nodes? Use the following command:

rtiddsspy -printSample -qosProfile "BuiltinQosLibExp::Generic.KeepLastReliable.TransientLocal"

You should see the data samples published by any DataWriter.  Again, I verified that I was able to see this when running spy on both Windows and Linux (and Mac), and no matter where the DataWriter was running.

-sumeet

Offline
Last seen: 2 years 9 months ago
Joined: 02/15/2013
Posts: 20

Hello Sumeet,

steps to reproduct the bug:

1) Install Live-CD 5.1.0 (32-bit) either in x86 hardware or in virtual machine (needs full network access)

2) Compile Hello_simple in Live CD native using 'Makefile.i86Linux3.gcc4.6'

3) Compile Hello_simple in Windows (I used W7 64-bit) using NDDS 5.1.0 and Visual Studio 2010 (32-bit)

4) Start HelloSubscriber in Windows first

5) Start HelloPublisher in Linux

6) Enter string in HelloPublisher in Linux -> no output in Windows

If you start the Publisher before the Subscriber then it works sometimes because if there is a short deley (100 ms is sufficient) between the creation of the readers on the Windows side then they are discovered/matched independently and then it works. On my machine Analyzer showed the matched writer and reader on the Windows side but on the Linux side the writer had no reader...

This bug seems to show when there is more than one reader 'discovered' at the same time on the writer side, Using the default QoS (as in the Hello example) or the KeepLast as in my application does not make a difference. I could reproduce the bug on different Windows machines in Combination with the Raspberry and x86 Linux. This could be a race condition during transmission of the reader discovery. In shared memory (writer and reader on same machine) it works every time.

Regards

Josef

File Attachments: 
Offline
Last seen: 5 years 8 months ago
Joined: 01/31/2011
Posts: 37

Hi Josef,

I followed your exact steps, but saw no problems with the communication.  

  1. I installed the RTI 5.1.0 Live CD on a virtual machine (VirtualBox).  I gave the machine "bridged network access" so that multicast discovery can operate successfully.
  2. I compiled the Hello_simple that you provided on the Linux machine, for the architecture (i86Linux3gcc4.6.3)
  3. I compiled the Hello_simple that you provided on the Windows machine, for the architecture (i86Win32VS2010). Note that I also run Windows in a separate VM (Parallels for Mac), also with bridged network access.
  4. I started the HelloSubscriber on the Windows machine
  5. I started the HelloPublisher on Linux
  6. I entered a string ... and saw the output on Windows

I altered the startup order, and saw things work as expected. I also switched the roles (Subscriber on Linux, Publisher on Windows) and observed the (correct) expected behavior. Lastly, I altered the subscriber so it creates 10 DataReaders for 10 different topics, and still had everything work correctly (i.e. I received all messages from the publisher).  I also verified via Wireshark that packets were actually making it through as expected, all using the default configurations.

I can assure you that our unit testing does include the creation of multiple DataReaders in the same process, so I'm surprised you are seeing this issue.  Do you have access to wireshark? You can use the 'rtps2' dissector to observe the packets on the wire. When you create two DataReaders in the same process, you should see that a single discovery packet contains both DataReader announcements.

You'll also see that the packet containing the discovery announcements for the two readers is of size 1656 UDP bytes, while all of the other discovery packets are well below 1500 bytes.  This means that the discovery announcement packet for the DataReaders will be fragmented by your IP stack.  Can you verify (using Wireshark) that the fragments are making it through the network?

See this article (http://community.rti.com/kb/why-dont-i-receive-data-when-using-large-data-type-over-1400-bytes-windows-system) for something to verify with your Windows network libraries.  Also, take a look at the recommended registry settings for Windows, in our Platform Notes (http://community.rti.com/rti-doc/510/ndds.5.1.0/doc/pdf/RTI_CoreLibrariesAndUtilities_PlatformNotes.pdf). On page 72, we recommend you increase the value for the FastSendDatagramThreshold.

-sumeet

Offline
Last seen: 2 years 9 months ago
Joined: 02/15/2013
Posts: 20

Hi Sumeet,

the documentation states that setting the 'FastSendDatagramThreshold' is only recommended 'On all Windows systems prior to Windows Vista'.

As I'm using Windows 7 should I still set that? And does it have an influence on Windows 7?

From your explanation fragmentation is the most plausible cause.

Edit: 14-03-04: It is indeed a fragmentation issue. The command 'ping -l 1500 <raspi>' did the trick. There was no response first. On my machine I found out that the 'jumbo-frames' were enabled in the network interface. Disabling jumbo-frames did the trick, it works flawlessly afterwards and the ping gets a response. Now I have to find out why the other combinations don't work, but the ping comamnd was the most helpful to diagnose the problem.

Regards

Josef

Offline
Last seen: 5 years 8 months ago
Joined: 01/31/2011
Posts: 37

Hi Josef,

That makes a bit more sense. Jumbo frames can be tricky because every component on the end-to-end path must support jumbo frames, or you must rely on network components (e.g. switches/routers) to properly manage the transition from jumbo to normal. 

Regarding Windows 7... I haven't been able to determine whether the FastSendDatagramThreshold needs to be configured correctly on Windows 7 and Server 2008.  I always advise setting the registry value, just in case.

Glad we figured out your issue before you went on vacation.

---edit (07:04pm PST, 4Mar2014)

I ran a test to compare the packet size differences between 4.5f and 5.1.0.  For the same application, and creating two DataReaders:

  • 4.5f: packet size is 784 bytes (UDP length)
  • 5.1.0: packet size is 1656 bytes (UDP length)

So things worked fine with 4.5f because the discovery packet was well below the 1500 MTU threshold.  5.1.0 has a bigger discovery packet because the discovery data includes a new field called "TypeObject", which is a new way of describing the data type associated with the topic. This new field comes from the XTypes specification, and it appears in 5.1.0 because we added more support for XTypes (i.e., mutability and optionality).

-sumeet

Offline
Last seen: 2 years 9 months ago
Joined: 02/15/2013
Posts: 20

Hi Sumeet,

I'm back from vacation. I had an idea for a new 'feature' which would help diagnose fragmentation problems.

After discovery of a new domain participant the local participant could do the equivalent of 'ping -l 1500' to check if there is a fragmentation problem. If the ping doeas not succeed it is known that the discovered partiticant won't work and a warning/an error can be given to the log and the discovered participant can be set 'unreachable' because multiple readers/writers will not match. The ping would of course be only needed for network transports.

The feature would be an RTI extension but very helpful in my opinion.

Rwegards

Josef