Intermittent Data Loss / How to Diagnose DataWriter Problems?

7 posts / 0 new
Last post
Offline
Joined: 05/31/2019
Posts: 4
Intermittent Data Loss / How to Diagnose DataWriter Problems?

I am having problems with data loss. Samples are given to DataWriter::write(), but they are not received anywhere (multiple readers/listeners in the network). Also, there is no error message. I would give a minimal reproducible example, but the problem is intermittent and difficult to reproduce. The system works fine for maybe 5 minutes. Then a single participant stops sending data for a few seconds to a minute. Then it "recovers" and everything is back to normal.


I have checked some obvious potential causes. Maybe the participant is not actually sending? Or the network is having problems and droping packets? To eliminate that, every time DataWriter::write() is called, the same message is also sent as a raw UDP packet. Most of the time both the DDS and UDP messages are received. When the DDS samples are missing, the UDP messages are still received. So the network seems to be fine.



Overview of the system:
* One desktop PC, Ubuntu 18.04, rti_connext_dds-5.3.1, using Modern C++ API, connected via Ethernet
* Several (1-4) Raspberry Pi Zero W, Raspbian GNU/Linux 9 (stretch), cross-compiled raspbian-toolchain-gcc-4.7.2-linux64, rti_connext_dds-5.3.1, libraries armv6vfphLinux3.xgcc4.7.2, using Modern C++ API, connected via WiFi


The reliability setting seems to make no difference, messages from "reliable" and "best effort" topics are both lost.

The problem seems to be more frequent with more participants (Raspberries) in the network.

Any tips on how to check what the DataWriter is doing (or not doing)? Thanks!

EDIT, with some additional info:

When looking at the Admin Console Tool, under "Physical View > Process : (ID) > DDS Entities", when the problem occurs ALL of the publishers, subscribers, data writers, data readers disappear. Only the participant remains.

Restarting the problematic process / program fixes the problem.

irene's picture
Offline
Joined: 11/08/2017
Posts: 7

Hi Janis,

Are you doing a lot of processing on the subscribers? Are you using callbacks to process the data (using listeners in the DataReaders) or waitset? if you are doing a lot of processing and the subscribers cannot catch up with the publisher, the samples may be lost depending on the DDS configuration. A minimal reproducer may help.

Also, it will be helpful if you increase the DDS log level, at least to WARNING to see what is going on. Here you have how to do it: https://community.rti.com/howto/useful-tools-debug-dds-issues . I recommend using Logparser to improve the readability of the log (you have the required information in the link)

you can also implement the on_sample_lost method in the DataReader listeners so you can check that the sample is really lost and the reason behind that. In this link, you can find how you to use listeners in your DataReaders using C++11: https://community.rti.com/static/documentation/connext-dds/6.0.0/doc/api/connext_dds/api_cpp2/group__DDSReaderExampleModule.html

I recommend that you perform all these tests using reliable reliability instead of best effort.

I hope this helps,

 Irene

Offline
Joined: 05/31/2019
Posts: 4

New Setup

As previously noted, the problem seems to be more frequent with more participants (Raspberries) in the network. Since we want to eventually operate the system with 20 Raspberries, I decided to increase the number of Raspberries. In the tests I used 12 to 17 Raspberries, the exact number has no noteable effect. This made reproducing the problem much more reliable.

I was able to simplify the setup and still produce the problem. Now there is only one program running. And it only does one thing: Send a DDS sample at a regular interval (10 Hz). Each Raspberry runs one instance of this program. None of our other RTI DDS programs are running. The problem is diagnosed using the RTI Admin Console, which provides the only subscriber / reader in the network.

As for your question about subscribers. We use a mix of AsyncWaitSet and polling on DataReader::take(), no listeners. However, since the simplified setup uses no readers / subscribers this is probably irrelevant.

The logging is set to STATUS_ALL. I will send you the simplified test code and logs via email.


Test and Observations

The test program is started on the Raspberries automatically. The Admin Console is started manually. The setup phase takes around 30 seconds. Everything works as expected for the first few minutes. All Raspberries show frequent new data in the Admin Console.

The problem occurs consistently after 4 to 5 minutes. For 2 to 4 Raspberries, data continues to be received normally in the Admin Console. Which and how many Raspberries continue to operate normally appears to be random. For the other Raspberries (7 to 15) there is no new data received in the Admin Console.

I'm not sure how to interpret the log, but I noted one thing. In the first phase "NDDS_Transport_UDP_send" appears frequently. After the problems starts it appears rarely, and is much rarer than my own log entry "Calling DataWriter::write".

Second Test

To get a better sense of the timeline of the problem, I repeated the test, but running the RTI Recording Console instead of the Admin Console. A Matlab script processes the recording and shows the number of samples per 1 second interval, per Raspberry. See the attached picture.

 

File Attachments: 
irene's picture
Offline
Joined: 11/08/2017
Posts: 7

Hi Janis,

If you are using Admin Console to read the samples, it will be helpful if you send me the Admin Console logs when they are failing to receive the samples.

To do so, you should configure Admin Console to use reliable reliability:

And update the log configuration:

So the log messages will appear in the Console Log tab (you can export it to a file).

We need to see what is going on also in the subscribers (the ones generated by Admin Console instances, in this case).

It will be awesome if you also send me a new log capture from the publisher side, due to we are changing the subscriber reliability to reliable, I expect to have a more useful log.

 

Thanks,

 Irene

 

Offline
Joined: 05/31/2019
Posts: 4

Hi Irene,

I changed the two settings, subscriber to reliable and console logging level to trace. The logging preferences window looks a little different to me (see screenshot).

Both new logs are attached.

Thanks,
Janis

Offline
Joined: 05/31/2019
Posts: 4

I went in a new direction with testing, by swapping out hardware components, and seeing what has an effect on this fault.

 Publisher: Raspberry/ARMPublisher: PC/x64
Router A, via WiFiConnection LostConnection Lost
Router A, via EthernetN/AOk
Router B, via WiFiOkOk

 

Looking at these results, router A's WiFi seems to be the common factor for the fault.

But the confusing part is, that it only seems to affect DDS connections. Other traffic (ssh, ping, http) works fine at all times.

irene's picture
Offline
Joined: 11/08/2017
Posts: 7

Hi Janis,

I've confirmed looking at the logs that, as we suspected, samples are being lost in the subscriber side. It could happen for a lot of reasons, usually related to network issues. It is also interesting that this is happening with a specific router and not with the other you are using.

At this point, two Wireshark captures, one in the subscriber side (when the samples are being lost) and another in the publisher side can be really helpful to see what is going on in the wire. Could you attach these Wireshark captures so I can review it?

Thanks,

 Irene