Intermittent Data Loss / How to Diagnose DataWriter Problems?

15 posts / 0 new
Last post
Offline
Last seen: 4 years 8 months ago
Joined: 05/31/2019
Posts: 9
Intermittent Data Loss / How to Diagnose DataWriter Problems?

I am having problems with data loss. Samples are given to DataWriter::write(), but they are not received anywhere (multiple readers/listeners in the network). Also, there is no error message. I would give a minimal reproducible example, but the problem is intermittent and difficult to reproduce. The system works fine for maybe 5 minutes. Then a single participant stops sending data for a few seconds to a minute. Then it "recovers" and everything is back to normal.


I have checked some obvious potential causes. Maybe the participant is not actually sending? Or the network is having problems and droping packets? To eliminate that, every time DataWriter::write() is called, the same message is also sent as a raw UDP packet. Most of the time both the DDS and UDP messages are received. When the DDS samples are missing, the UDP messages are still received. So the network seems to be fine.



Overview of the system:
* One desktop PC, Ubuntu 18.04, rti_connext_dds-5.3.1, using Modern C++ API, connected via Ethernet
* Several (1-4) Raspberry Pi Zero W, Raspbian GNU/Linux 9 (stretch), cross-compiled raspbian-toolchain-gcc-4.7.2-linux64, rti_connext_dds-5.3.1, libraries armv6vfphLinux3.xgcc4.7.2, using Modern C++ API, connected via WiFi


The reliability setting seems to make no difference, messages from "reliable" and "best effort" topics are both lost.

The problem seems to be more frequent with more participants (Raspberries) in the network.

Any tips on how to check what the DataWriter is doing (or not doing)? Thanks!

EDIT, with some additional info:

When looking at the Admin Console Tool, under "Physical View > Process : (ID) > DDS Entities", when the problem occurs ALL of the publishers, subscribers, data writers, data readers disappear. Only the participant remains.

Restarting the problematic process / program fixes the problem.

irene's picture
Offline
Last seen: 2 years 3 months ago
Joined: 11/08/2017
Posts: 15

Hi Janis,

Are you doing a lot of processing on the subscribers? Are you using callbacks to process the data (using listeners in the DataReaders) or waitset? if you are doing a lot of processing and the subscribers cannot catch up with the publisher, the samples may be lost depending on the DDS configuration. A minimal reproducer may help.

Also, it will be helpful if you increase the DDS log level, at least to WARNING to see what is going on. Here you have how to do it: https://community.rti.com/howto/useful-tools-debug-dds-issues . I recommend using Logparser to improve the readability of the log (you have the required information in the link)

you can also implement the on_sample_lost method in the DataReader listeners so you can check that the sample is really lost and the reason behind that. In this link, you can find how you to use listeners in your DataReaders using C++11: https://community.rti.com/static/documentation/connext-dds/6.0.0/doc/api/connext_dds/api_cpp2/group__DDSReaderExampleModule.html

I recommend that you perform all these tests using reliable reliability instead of best effort.

I hope this helps,

 Irene

Offline
Last seen: 4 years 8 months ago
Joined: 05/31/2019
Posts: 9

New Setup

As previously noted, the problem seems to be more frequent with more participants (Raspberries) in the network. Since we want to eventually operate the system with 20 Raspberries, I decided to increase the number of Raspberries. In the tests I used 12 to 17 Raspberries, the exact number has no noteable effect. This made reproducing the problem much more reliable.

I was able to simplify the setup and still produce the problem. Now there is only one program running. And it only does one thing: Send a DDS sample at a regular interval (10 Hz). Each Raspberry runs one instance of this program. None of our other RTI DDS programs are running. The problem is diagnosed using the RTI Admin Console, which provides the only subscriber / reader in the network.

As for your question about subscribers. We use a mix of AsyncWaitSet and polling on DataReader::take(), no listeners. However, since the simplified setup uses no readers / subscribers this is probably irrelevant.

The logging is set to STATUS_ALL. I will send you the simplified test code and logs via email.


Test and Observations

The test program is started on the Raspberries automatically. The Admin Console is started manually. The setup phase takes around 30 seconds. Everything works as expected for the first few minutes. All Raspberries show frequent new data in the Admin Console.

The problem occurs consistently after 4 to 5 minutes. For 2 to 4 Raspberries, data continues to be received normally in the Admin Console. Which and how many Raspberries continue to operate normally appears to be random. For the other Raspberries (7 to 15) there is no new data received in the Admin Console.

I'm not sure how to interpret the log, but I noted one thing. In the first phase "NDDS_Transport_UDP_send" appears frequently. After the problems starts it appears rarely, and is much rarer than my own log entry "Calling DataWriter::write".

Second Test

To get a better sense of the timeline of the problem, I repeated the test, but running the RTI Recording Console instead of the Admin Console. A Matlab script processes the recording and shows the number of samples per 1 second interval, per Raspberry. See the attached picture.

 

File Attachments: 
irene's picture
Offline
Last seen: 2 years 3 months ago
Joined: 11/08/2017
Posts: 15

Hi Janis,

If you are using Admin Console to read the samples, it will be helpful if you send me the Admin Console logs when they are failing to receive the samples.

To do so, you should configure Admin Console to use reliable reliability:

And update the log configuration:

So the log messages will appear in the Console Log tab (you can export it to a file).

We need to see what is going on also in the subscribers (the ones generated by Admin Console instances, in this case).

It will be awesome if you also send me a new log capture from the publisher side, due to we are changing the subscriber reliability to reliable, I expect to have a more useful log.

 

Thanks,

 Irene

 

Offline
Last seen: 4 years 8 months ago
Joined: 05/31/2019
Posts: 9

Hi Irene,

I changed the two settings, subscriber to reliable and console logging level to trace. The logging preferences window looks a little different to me (see screenshot).

Both new logs are attached.

Thanks,
Janis

Offline
Last seen: 4 years 8 months ago
Joined: 05/31/2019
Posts: 9

I went in a new direction with testing, by swapping out hardware components, and seeing what has an effect on this fault.

 Publisher: Raspberry/ARMPublisher: PC/x64
Router A, via WiFiConnection LostConnection Lost
Router A, via EthernetN/AOk
Router B, via WiFiOkOk

 

Looking at these results, router A's WiFi seems to be the common factor for the fault.

But the confusing part is, that it only seems to affect DDS connections. Other traffic (ssh, ping, http) works fine at all times.

irene's picture
Offline
Last seen: 2 years 3 months ago
Joined: 11/08/2017
Posts: 15

Hi Janis,

I've confirmed looking at the logs that, as we suspected, samples are being lost in the subscriber side. It could happen for a lot of reasons, usually related to network issues. It is also interesting that this is happening with a specific router and not with the other you are using.

At this point, two Wireshark captures, one in the subscriber side (when the samples are being lost) and another in the publisher side can be really helpful to see what is going on in the wire. Could you attach these Wireshark captures so I can review it?

Thanks,

 Irene

Offline
Last seen: 4 years 8 months ago
Joined: 05/31/2019
Posts: 9

Hi Irene,

I recorded with tshark instead of wireshark, since it can be used on the command line, and the Raspberry has no monitor.


Recording on Raspberry (only one of them, 192.168.1.114):

tshark -i wlan0 -w capture_raspberry.pcap

Recording on PC:

tshark -i enp0s31f6 -w capture_PC.pcap


Post processing:

tshark -r capture_raspberry.pcap -T fields -e frame.number -e ip.checksum -e frame.time_epoch -e ip.src -e udp.srcport -e ip.dst -e udp.dstport -e _ws.col.Protocol -e _ws.col.Info -e icmp.type -e rtps.domain_id >capture_raspberry_complete.log


tshark -r capture_PC.pcap -T fields -e frame.number -e ip.checksum -e frame.time_epoch -e ip.src -e udp.srcport -e ip.dst -e udp.dstport -e _ws.col.Protocol -e _ws.col.Info -e icmp.type -e rtps.domain_id >capture_PC_complete.log


grep RTPS capture_PC_complete.log >capture_PC_RTPS.log

grep RTPS capture_raspberry_complete.log >capture_raspberry_RTPS.log

grep 192.168.1.114 capture_PC_RTPS.log >capture_PC_filter_114.log

The results are attached.

Thanks,
Janis

File Attachments: 
irene's picture
Offline
Last seen: 2 years 3 months ago
Joined: 11/08/2017
Posts: 15

Hi Janis,

Are these captures performed in the scenario where data loss occurs?

I do not see any data lost in them.

For some reason, the router A is causing problems, so these captures will help to understand what is happening.

Thanks!

Offline
Last seen: 4 years 8 months ago
Joined: 05/31/2019
Posts: 9

Hi Irene,

I remember that data was lost in the Admin Console. But looking at the logs in detail, I agree, there seems to be no data loss on the IP level.

Looking at capture_raspberry_RTPS.log, it seems like the publisher just stopped sending after epoch 1560866202.3758.

2769    0x0000e5e3    1560866202.173551967    192.168.1.114    45318    192.168.1.249    17913    RTPS    INFO_TS, DATA        42
2770    0x0000e5db    1560866202.274683436    192.168.1.114    45318    192.168.1.249    17913    RTPS    INFO_TS, DATA        42
2771    0x0000e5da    1560866202.375819905    192.168.1.114    45318    192.168.1.249    17913    RTPS    INFO_TS, DATA        42
2772    0x0000a3db    1560866202.431326613    192.168.1.114    38625    239.255.0.1    17900    RTPS    INFO_TS, DATA(p)        42,42
2773    0x0000a3da    1560866202.444471544    192.168.1.114    38625    239.255.0.1    7400    RTPS    INFO_TS, DATA(p)        0,42
2774    0x0000a3d9    1560866202.455887484    192.168.1.114    38625    239.255.0.1    7400    RTPS    INFO_TS, DATA(p)        0,42
2789    0x0000a32c    1560866205.632965767    192.168.1.114    38625    239.255.0.1    7400    RTPS    INFO_TS, DATA(p)        0,42
2904    0x00009ee9    1560866232.431646675    192.168.1.114    38625    239.255.0.1    17900    RTPS    INFO_TS, DATA(p)        42,42
2905    0x00009ee8    1560866232.436874647    192.168.1.114    38625    239.255.0.1    7400    RTPS    INFO_TS, DATA(p)        0,42

This is probably difficult to analyze with just the tshark capture. I will run another experiment where I collect all four logs (publisher DDS, publisher wireshark, subscriber wireshark, subscriber DDS).

Thanks,
Janis

Offline
Last seen: 4 years 8 months ago
Joined: 05/31/2019
Posts: 9

Hi Irene,

I ran another experiment and collected all four logs.

The log archive is too large for the forum, so I'm sharing it via Dropbox: https://www.dropbox.com/s/rl3gztfm5ox4vhs/dds_test4.zip?dl=0


Looking at the DDS log of the publisher (stdout_14.txt, Line 21641):   


Calling DataWriter::write, stamp: 1561633823355920132, vehicle id: 14
NDDS_Transport_UDP_send:U00000000b606c170 sent 132 bytes to 0XF901A8C0|17913


This is the last UDP packet sent to the PC (0XF901A8C0 == 192.168.1.249)
This packet and the packets in the preceding seconds were received at the PC (see capture_PC_filter_114.log).
This does not look like data loss to me. DDS just stops sending UDP packets, even though DataWriter::write() is called.

Any ideas?

Thanks,
Janis

Offline
Last seen: 4 years 8 months ago
Joined: 05/31/2019
Posts: 9

I finally found a fix, but it is not an ideal solution. I suspected the problem to be related to using UDP multicast via WiFi. This would make sense: DDS uses multicast for discovery, but the other unaffected traffic does not.

I set up the RTI Cloud Discovery Service and disabled multicast in all of our applications. This solves the data loss problems, but brings back the inconvenience of having a central server for discovery.

For those looking to do the same, here are a few code snippets I'm using:

On the server:

rticlouddiscoveryservice -transport 25598

 

Bash profile:

export NDDS_DISCOVERY_PEERS=rtps@udpv4://192.168.1.249:25598

 

Setup code, modern C++ API

dds::domain::qos::DomainParticipantQos domainParticipantQos;

auto &property = domainParticipantQos.policy<rti::core::policy::Property>();
property.set(std::make_pair<std::string, std::string>
    ("dds.transport.UDPv4.builtin.multicast_enabled", "0"));

std::vector<std::string> initial_peer_list { "builtin.udpv4://127.0.0.1" };

const std::string initial_peer_cfg = get_dds_initial_peer();
if(!initial_peer_cfg.empty())
{
    initial_peer_list.push_back(initial_peer_cfg);
}

auto &discovery = domainParticipantQos.policy<rti::core::policy::Discovery>();
discovery.initial_peers(initial_peer_list);
discovery.multicast_receive_addresses(std::vector<std::string>{});

dds::domain::DomainParticipant participant(0, domainParticipantQos);

 

The function get_dds_initial_peer() returns the NDDS_DISCOVERY_PEERS variable via a command line argument.

Offline
Last seen: 4 years 5 months ago
Joined: 06/10/2019
Posts: 3

Hello Janis,

I have taken over this case since Irene is currently unavailable.

After analyzing the last logs you sent, I noticed the publisher (PI) stops writing samples because the reader at 192.168.1.249 (PC) is unregistered due to liveliness expiry. In stdout_14.txt, line 21481:

PRESPsService_onReaderLivelinessEvent:check readers liveliness event
DISCSimpleParticipantDiscoveryPluginReaderListener_onDataAvailable:discovered unregistered participant: host=0xC0A801F9, app=0x0000371F, instance=0x00000004
DISCSimpleParticipantDiscoveryPluginReaderListener_onDataAvailable:at {5d14a41f,591DBCA5}
DISCParticipantDiscoveryPlugin_unregisterRemoteParticipantI:plugin unregistered remote participant: 0XC0A801F9,0X371F,0X4,0X1C1

This loss of liveliness is a discovery issue and since discovery happens through multicast by default, the root could be multicast configuration on your router(s). This KB article seems to describe the issue you are experiencing:

https://community.rti.com/kb/why-does-dds-communication-stop-after-some-time-while-using-switches

I suggest checking if the offending router (router A) has IGMP snooping enabled and what IGMP version it supports. You can also check the availability of parameters like “Query Interval” or “IGMP snooping Group Timeout”. If these are available on Router A, it might be possible to keep multicast discovery.

Mihai

 
Offline
Last seen: 4 years 8 months ago
Joined: 05/31/2019
Posts: 9

Hi Mihai,

thanks for the info. Router A's available settings for ICMP and multicast are very limited.
Router B is also not an option. While it does not have this problem, its performance is poor otherwise (high latency, high packet loss rate).

Does RTI have a list of WiFi routers that are known to work well with DDS discovery?

Thanks,
Janis

Offline
Last seen: 4 years 5 months ago
Joined: 06/10/2019
Posts: 3

Hello Janis,

Unfortunately, we don't have a list of recommended WiFi routers.

Mihai