Hi,
In the past month we have been experiencing problems when trying to send data from one machine to another.
When running a simple producer on one machine (one data-writer), and running a simple consumer on another machine (one data-reader), we see that data from the producer is not received by the consumer, thought it is received by anther consumer on the same machine as the producer.
We ran Monitor on both machines to see the system topology, and found out that sometimes to monitor on one machine manages to discover both producer and consumer, while the monitor on the other machine manage to discover both participants, but not the writer/read on the other machine.
This problem is not consistent, but appears on-and-off in all the machines in our development network.
I've recorded the rtps2 data on both machine, and it seems that discovery messages for both participant and reader/writer go from one machine to another correctly (via multicast).
I've attached the recording files from both machine.
The relevant machines are: 190.20.11.247 (producer) and 190.20.11.233 (consumer).
Can someone understand from the recordings what is going on there?
Thanks,
Meir
Attachment | Size |
---|---|
recordings.zip | 1.28 MB |
Hello Meir,
The first thing to double-check is that you don't have any firewalls on - those could be blocking a subset of the traffic and causing incomplete discovery.
Also, from the packet traces, I notice that you have increased the transport sizes above the default of 9K. I can see this because some of the discovery packets that are sent are closer to 20K - and I notice that in one of the packet traces they look like they are sent correctly, but are never seen in the other packet capture.
This could indicate that the transport settings on 190.20.11.233 are different from the transport settings on 190.20.11.247. Is it possible that the 190.20.11.233 machine still has the default sizes? This could cause the problem you are seeing.
Thank you!
Rose
Hi Rose,
There is should be no firewall between the two machines.
Can you tell me which packets did you see that are sent but not recorded in the other computer?
Are you talking about the transport setting in the QoS file?
Both machines are using the same QoS file. And if this was a problem with QoS definitions we should see all the packets that were sent from one machine in the packet capture of the other. Or do you mean that some OS settings are different between the two computers?
Thanks for the help!
Hello Meir,
Packet 789 in subscriber.pcap shows a discovery announcement of Monitor DataWriters from .247 to .233. I've found the missing packet in the publisher.pcap - it is not reassembled correctly in this packet trace - it is packet #1727 in publisher.pcap. (I am not sure what is causing the reassembly problem in WireShark. I do not know if this is an indication of the actual problem, or if this is unrelated).
However, even if this packet is delivered correctly, I never see a discovery announcement of your user data DataWriter - I only see the announcements of Monitor Library DataWriters.
One more question: have you seen any error messages from RTI? Are you logging RTI messages somewhere? My next theory is that the publisher application is failing to send certain packets (for an unknown reason, perhaps having to do with the size of those packets?). If there is a failure that the middleware can detect, it should log it. It's probably also a good idea to increase verbosity to warning level. Here is an example of how to do this in C++:
NDDSConfigLogger::get_instance()->set_verbosity( NDDS_CONFIG_LOG_VERBOSITY_WARNING);
Lastly, if there is no obvious error message or warning, can you turn off the monitoring library and take another packet capture? This might make the problem easier to find.
Thank you!
Rose
Also, to answer your other question: I am talking about the transport settings in the QoS file. If they do not match correctly, one side might send packets that are too large for the other side to receive. If they're both using the same file, this isn't the problem - but I have created several projects where I was accidentally using the wrong file somewhere, so it's always a good idea to check. :)
Thank you,
Rose
Hi,
Looking at the packet captures and Rose's observations there seems like there could be some bug in the product. Perhaps the problem is being exhibited by the interaction of a specific transport-size with the discovery messages we are sending. I recall some of the 4.4 and 4.5 versions of the software had a bug where if the packet size was close to the maximum size configured on the transport the math we used to computed the packet to send on the wire was a bit wrong, resulting on the packet never being sent. This could explain the observed behavior.
I would recommend you try reducing the maximum message size using the QoS Profile XML file as suggested by Rose. You can find an example on how to do this here: http://community.rti.com/content/forum-topic/transport-file-size-message. I would try changing the maximum message size on the transport to a smaller value to see if this causes the problem to go away. This is not a long term solution but it would help determining if this is indeed the problem you are encountering.
The problem that can be seen in the packet captures you sent is a situation where the discovery DataWriter (ENTITYID_SEDP_BUILTIN_PUBLICATIONS_WRITER) is sending HEARTBEATS announcing the availability of certain sequence numbers, the DataReader sends an ACKNACK containing a negative acknowledgment requesting some of them, but the repairs never go out in the wire. The process keeps repeating with the DataWriter sending HEARTBEATS, getting ACKNACKS and not sending the repairs.
So not only the initial discovery messages are missing from the traces, but also the repairs. I am suspecting that this happens due to the packet sizes of these messages being close to the maximum transport size and causing the bug I mentioned. But this is just a suspicion at this point.
I used this filter to isolate the interesting traffic:
ip.addr== 190.20.11.247 && (ip.addr== 190.20.11.233 || ip.addr == 239.255.0.1)
I ended up not filtering for rtps2 packets so that I could see the packet fragments. Also somehow on my mac Wireshark is displaying some strange checksum errors that were preventing all packes from being correctly classified. Oddly the packets only show a checksum error when the originate on the computer that is capturing them, and that same packet when captured on the other computer does not show a checksum error. So I am suspecting this is a Wireshark issue and not indicative of a real problem. In any case I disabled the IP checksum verification in Wireshark so that the packets could be re-assembled by wireshark.
The sequence that exhibits the problem is as follows:
Looking at the publisher.pcap, packet #1623 sent by the ENTITYID_SEDP_BUILTIN_PUBLICATIONS_WRITER contains the announcement for the the topic "TESTS-TOPICS". The PID_ENDPOINT_GUID for this writer ends in 80000002. This packet has sequence number 14. It seems like the previous sequence numbers corresponds to the monitoring Topics. Presumably the publisher was already running when the other applications appeared and this is why the previous sequence number were not pushed. This is not a problem because the reliability protocol should kick-on to send the missing packets. However it is a bit odd that message #1623 shows up without a preceding heartbeat. But it could be caused by the timing of the application creation and it should not be a problem.
Following there is a series of HEARTBEAT messages from 190.20.11.247 to 190.20.11.233 announcing that sequence numbers 1-14.
These are, for example, packets: publisher.pcap#1719, #1832, #2729, #2836, #2843 (count 15) ... These happen at 3 seconds interval indicating the DataWriter is not getting a positive acknowledgements.
In fact the DataWriter is getting a negative acknowledgment to each one of the HEARTBEATS. These are the ACKNACK packets: publisher.pcap#1724, 1838, 2735, 2842. The first one (#1724) has readersSNState of 1/14:1111111111111 indicating a negative acknowledgment asking for the writer to re-send the packets 1-14,
It seems like the first NACK is causing some repair messages to be sent from the DataWriter. These are packets publisher.pcap#1728-1746 [Note: As I mentioned I had to disable the validation of the IP checksum in wireshark as it seems to confuse Wireshark on the mac.]
The repair packets are also showing up on the subscriber.pcap. There are packets subscriber.pcap#770-789.
The repair (publisher.pcap#1746 which shows up as subscriber.pcap #789) only contains sequence numbers 7-14. It seems these are properly received because the next ACKNACK, packet (publisher.pcap#1838) has readersSNState of 1/14:111111 indicating a negative acknowledgment asking for the writer to re-send only packets 1-6.
However. The problem seen here is that that the repair for sequence numbers 1-6 is never seen in either packet capture. So it t is never sent on the wire.
The rest follows a similar pattern. Subsequent HEARTBEATS that still announce sequence number 1-14 These are packets: publisher.pcap#1839, 2736, 2843. The corresponding ACKNACKs ( publisher.pcap#2735, #2842 ) keep asking for packets 1-6 but there are never any repairs sent in response to those NACKs. This keeps happening each 3 seconds.
As mentioned at the beginning it would seem that this is some transport-size issue. Some of the 4.4 and4.5 versions of the software had a bug where if the packet size was close to the maximum size configured by the transport the math that computed the actual size was wrong resulting on the packet never being sent. The observable behavior would be consistent with what is being observed here so perhaps this is precisely what is happening here.
So my suggestion would be to try reducing the maximum message size you configured on the transport to see if this causes the problem to go away...
Gerardo
Hello Meir,
It does seem quite likely that you are encountering the bug I mentioned. If this is the case there is a workaround. You can edit the QoS Profiles to modify the default setting for the "dds.transport.UDPv4.builtin.protocol_overhead_max" property as shown in the snippet below:
In doing this the RTI DDS Core will compute the maximum bytes it can place into a single UDP datagram assuming that the UDP/IP headers will require more bytes (64 versus the 20 it actually needs) as this is more than it is indeed required it will never attempt to send as many bytes as before into a single UDP datagram and thus avoid the problem that was occurring when the message size was close to the limit.
Gerardo