I am having problems with data loss. Samples are given to DataWriter::write(), but they are not received anywhere (multiple readers/listeners in the network). Also, there is no error message. I would give a minimal reproducible example, but the problem is intermittent and difficult to reproduce. The system works fine for maybe 5 minutes. Then a single participant stops sending data for a few seconds to a minute. Then it "recovers" and everything is back to normal.
I have checked some obvious potential causes. Maybe the participant is not actually sending? Or the network is having problems and droping packets? To eliminate that, every time DataWriter::write() is called, the same message is also sent as a raw UDP packet. Most of the time both the DDS and UDP messages are received. When the DDS samples are missing, the UDP messages are still received. So the network seems to be fine.
Overview of the system:
* One desktop PC, Ubuntu 18.04, rti_connext_dds-5.3.1, using Modern C++ API, connected via Ethernet
* Several (1-4) Raspberry Pi Zero W, Raspbian GNU/Linux 9 (stretch), cross-compiled raspbian-toolchain-gcc-4.7.2-linux64, rti_connext_dds-5.3.1, libraries armv6vfphLinux3.xgcc4.7.2, using Modern C++ API, connected via WiFi
The reliability setting seems to make no difference, messages from "reliable" and "best effort" topics are both lost.
The problem seems to be more frequent with more participants (Raspberries) in the network.
Any tips on how to check what the DataWriter is doing (or not doing)? Thanks!
EDIT, with some additional info:
When looking at the Admin Console Tool, under "Physical View > Process : (ID) > DDS Entities", when the problem occurs ALL of the publishers, subscribers, data writers, data readers disappear. Only the participant remains.
Restarting the problematic process / program fixes the problem.
Hi Janis,
Are you doing a lot of processing on the subscribers? Are you using callbacks to process the data (using listeners in the DataReaders) or waitset? if you are doing a lot of processing and the subscribers cannot catch up with the publisher, the samples may be lost depending on the DDS configuration. A minimal reproducer may help.
Also, it will be helpful if you increase the DDS log level, at least to WARNING to see what is going on. Here you have how to do it: https://community.rti.com/howto/useful-tools-debug-dds-issues . I recommend using Logparser to improve the readability of the log (you have the required information in the link)
you can also implement the on_sample_lost method in the DataReader listeners so you can check that the sample is really lost and the reason behind that. In this link, you can find how you to use listeners in your DataReaders using C++11: https://community.rti.com/static/documentation/connext-dds/6.0.0/doc/api/connext_dds/api_cpp2/group__DDSReaderExampleModule.html
I recommend that you perform all these tests using reliable reliability instead of best effort.
I hope this helps,
Irene
New Setup
As previously noted, the problem seems to be more frequent with more participants (Raspberries) in the network. Since we want to eventually operate the system with 20 Raspberries, I decided to increase the number of Raspberries. In the tests I used 12 to 17 Raspberries, the exact number has no noteable effect. This made reproducing the problem much more reliable.
I was able to simplify the setup and still produce the problem. Now there is only one program running. And it only does one thing: Send a DDS sample at a regular interval (10 Hz). Each Raspberry runs one instance of this program. None of our other RTI DDS programs are running. The problem is diagnosed using the RTI Admin Console, which provides the only subscriber / reader in the network.
As for your question about subscribers. We use a mix of AsyncWaitSet and polling on DataReader::take(), no listeners. However, since the simplified setup uses no readers / subscribers this is probably irrelevant.
The logging is set to STATUS_ALL. I will send you the simplified test code and logs via email.
Test and Observations
The test program is started on the Raspberries automatically. The Admin Console is started manually. The setup phase takes around 30 seconds. Everything works as expected for the first few minutes. All Raspberries show frequent new data in the Admin Console.
The problem occurs consistently after 4 to 5 minutes. For 2 to 4 Raspberries, data continues to be received normally in the Admin Console. Which and how many Raspberries continue to operate normally appears to be random. For the other Raspberries (7 to 15) there is no new data received in the Admin Console.
I'm not sure how to interpret the log, but I noted one thing. In the first phase "NDDS_Transport_UDP_send" appears frequently. After the problems starts it appears rarely, and is much rarer than my own log entry "Calling DataWriter::write".
Second Test
To get a better sense of the timeline of the problem, I repeated the test, but running the RTI Recording Console instead of the Admin Console. A Matlab script processes the recording and shows the number of samples per 1 second interval, per Raspberry. See the attached picture.
Hi Janis,
If you are using Admin Console to read the samples, it will be helpful if you send me the Admin Console logs when they are failing to receive the samples.
To do so, you should configure Admin Console to use reliable reliability:
And update the log configuration:
So the log messages will appear in the Console Log tab (you can export it to a file).
We need to see what is going on also in the subscribers (the ones generated by Admin Console instances, in this case).
It will be awesome if you also send me a new log capture from the publisher side, due to we are changing the subscriber reliability to reliable, I expect to have a more useful log.
Thanks,
Irene
Hi Irene,
I changed the two settings, subscriber to reliable and console logging level to trace. The logging preferences window looks a little different to me (see screenshot).
Both new logs are attached.
Thanks,
Janis
I went in a new direction with testing, by swapping out hardware components, and seeing what has an effect on this fault.
Looking at these results, router A's WiFi seems to be the common factor for the fault.
But the confusing part is, that it only seems to affect DDS connections. Other traffic (ssh, ping, http) works fine at all times.
Hi Janis,
I've confirmed looking at the logs that, as we suspected, samples are being lost in the subscriber side. It could happen for a lot of reasons, usually related to network issues. It is also interesting that this is happening with a specific router and not with the other you are using.
At this point, two Wireshark captures, one in the subscriber side (when the samples are being lost) and another in the publisher side can be really helpful to see what is going on in the wire. Could you attach these Wireshark captures so I can review it?
Thanks,
Irene
Hi Irene,
I recorded with tshark instead of wireshark, since it can be used on the command line, and the Raspberry has no monitor.
Recording on Raspberry (only one of them, 192.168.1.114):
tshark -i wlan0 -w capture_raspberry.pcap
Recording on PC:
tshark -i enp0s31f6 -w capture_PC.pcap
Post processing:
tshark -r capture_raspberry.pcap -T fields -e frame.number -e ip.checksum -e frame.time_epoch -e ip.src -e udp.srcport -e ip.dst -e udp.dstport -e _ws.col.Protocol -e _ws.col.Info -e icmp.type -e rtps.domain_id >capture_raspberry_complete.log
tshark -r capture_PC.pcap -T fields -e frame.number -e ip.checksum -e frame.time_epoch -e ip.src -e udp.srcport -e ip.dst -e udp.dstport -e _ws.col.Protocol -e _ws.col.Info -e icmp.type -e rtps.domain_id >capture_PC_complete.log
grep RTPS capture_PC_complete.log >capture_PC_RTPS.log
grep RTPS capture_raspberry_complete.log >capture_raspberry_RTPS.log
grep 192.168.1.114 capture_PC_RTPS.log >capture_PC_filter_114.log
The results are attached.
Thanks,
Janis
Hi Janis,
Are these captures performed in the scenario where data loss occurs?
I do not see any data lost in them.
For some reason, the router A is causing problems, so these captures will help to understand what is happening.
Thanks!
Hi Irene,
I remember that data was lost in the Admin Console. But looking at the logs in detail, I agree, there seems to be no data loss on the IP level.
Looking at capture_raspberry_RTPS.log, it seems like the publisher just stopped sending after epoch 1560866202.3758.
2769 0x0000e5e3 1560866202.173551967 192.168.1.114 45318 192.168.1.249 17913 RTPS INFO_TS, DATA 42
2770 0x0000e5db 1560866202.274683436 192.168.1.114 45318 192.168.1.249 17913 RTPS INFO_TS, DATA 42
2771 0x0000e5da 1560866202.375819905 192.168.1.114 45318 192.168.1.249 17913 RTPS INFO_TS, DATA 42
2772 0x0000a3db 1560866202.431326613 192.168.1.114 38625 239.255.0.1 17900 RTPS INFO_TS, DATA(p) 42,42
2773 0x0000a3da 1560866202.444471544 192.168.1.114 38625 239.255.0.1 7400 RTPS INFO_TS, DATA(p) 0,42
2774 0x0000a3d9 1560866202.455887484 192.168.1.114 38625 239.255.0.1 7400 RTPS INFO_TS, DATA(p) 0,42
2789 0x0000a32c 1560866205.632965767 192.168.1.114 38625 239.255.0.1 7400 RTPS INFO_TS, DATA(p) 0,42
2904 0x00009ee9 1560866232.431646675 192.168.1.114 38625 239.255.0.1 17900 RTPS INFO_TS, DATA(p) 42,42
2905 0x00009ee8 1560866232.436874647 192.168.1.114 38625 239.255.0.1 7400 RTPS INFO_TS, DATA(p) 0,42
This is probably difficult to analyze with just the tshark capture. I will run another experiment where I collect all four logs (publisher DDS, publisher wireshark, subscriber wireshark, subscriber DDS).
Thanks,
Janis
Hi Irene,
I ran another experiment and collected all four logs.
The log archive is too large for the forum, so I'm sharing it via Dropbox: https://www.dropbox.com/s/rl3gztfm5ox4vhs/dds_test4.zip?dl=0
Looking at the DDS log of the publisher (stdout_14.txt, Line 21641):
Calling DataWriter::write, stamp: 1561633823355920132, vehicle id: 14
NDDS_Transport_UDP_send:U00000000b606c170 sent 132 bytes to 0XF901A8C0|17913
This is the last UDP packet sent to the PC (0XF901A8C0 == 192.168.1.249)
This packet and the packets in the preceding seconds were received at the PC (see capture_PC_filter_114.log).
This does not look like data loss to me. DDS just stops sending UDP packets, even though DataWriter::write() is called.
Any ideas?
Thanks,
Janis
I finally found a fix, but it is not an ideal solution. I suspected the problem to be related to using UDP multicast via WiFi. This would make sense: DDS uses multicast for discovery, but the other unaffected traffic does not.
I set up the RTI Cloud Discovery Service and disabled multicast in all of our applications. This solves the data loss problems, but brings back the inconvenience of having a central server for discovery.
For those looking to do the same, here are a few code snippets I'm using:
On the server:
rticlouddiscoveryservice -transport 25598
Bash profile:
export NDDS_DISCOVERY_PEERS=rtps@udpv4://192.168.1.249:25598
Setup code, modern C++ API
dds::domain::qos::DomainParticipantQos domainParticipantQos;
auto &property = domainParticipantQos.policy<rti::core::policy::Property>();
property.set(std::make_pair<std::string, std::string>
("dds.transport.UDPv4.builtin.multicast_enabled", "0"));
std::vector<std::string> initial_peer_list { "builtin.udpv4://127.0.0.1" };
const std::string initial_peer_cfg = get_dds_initial_peer();
if(!initial_peer_cfg.empty())
{
initial_peer_list.push_back(initial_peer_cfg);
}
auto &discovery = domainParticipantQos.policy<rti::core::policy::Discovery>();
discovery.initial_peers(initial_peer_list);
discovery.multicast_receive_addresses(std::vector<std::string>{});
dds::domain::DomainParticipant participant(0, domainParticipantQos);
The function get_dds_initial_peer() returns the NDDS_DISCOVERY_PEERS variable via a command line argument.
Hello Janis,
I have taken over this case since Irene is currently unavailable.
After analyzing the last logs you sent, I noticed the publisher (PI) stops writing samples because the reader at 192.168.1.249 (PC) is unregistered due to liveliness expiry. In stdout_14.txt, line 21481:
This loss of liveliness is a discovery issue and since discovery happens through multicast by default, the root could be multicast configuration on your router(s). This KB article seems to describe the issue you are experiencing:
https://community.rti.com/kb/why-does-dds-communication-stop-after-some-time-while-using-switches
I suggest checking if the offending router (router A) has IGMP snooping enabled and what IGMP version it supports. You can also check the availability of parameters like “Query Interval” or “IGMP snooping Group Timeout”. If these are available on Router A, it might be possible to keep multicast discovery.
Mihai
Hi Mihai,
thanks for the info. Router A's available settings for ICMP and multicast are very limited.
Router B is also not an option. While it does not have this problem, its performance is poor otherwise (high latency, high packet loss rate).
Does RTI have a list of WiFi routers that are known to work well with DDS discovery?
Thanks,
Janis
Hello Janis,
Unfortunately, we don't have a list of recommended WiFi routers.
Mihai