Hi, all,
I'm trying to understand why we are seeing on_sample_lost (reason=LOST_BY_WRITER) errors in our domain.
My understanding is that this error occurs when the receiver detects a skip in the sequence number associated with a data writer.
I would like to examine this problem using Wireshark. Could somebody tell me exactly which rtps.* field contains this incrementing sequence number? Ideally, I would love to be able to see these fields in a packet dump:
- Writer identity - some way to tie the sender of the packet to a specific writer in the domain participant.
- Receiver identity - again, tie the reader to a specific data reader in a participant.
- Sequence #
I can't figure out how to do that. There are SO MANY RTPS fields that I can't sift through all of them.
Any ideas?
Thanks!
Hi Jason,
The notification of samples being LOST_BY_WRITER can be observed in Wireshark depending on the scenario: RELIABLE or BEST_EFFORT. I am attaching 2 Wireshark captures that will help me explain the information both scenarios. In order to see the packets in Wireshark the same way that I am showing them in the image below, I recommend you to download Wireshark's latest version, follow this link to apply the colors to the RTPS packets and Enable the topic information feature. For this, you will need to:
BEST_EFFORT scenario
For this scenario, please take a look at best_effort.pcapng attached. This is what you should see in Wireshark:
In blue, you will see packets for discovery. In red, you will see user data packets. These are the kinds of packets in this image:
Let's take a closer look at packet #4, a sample packet:
An entity in DDS is identified by a GUID (Global Unique Identifier). This GUID is a 16 byte identifier which is made up of:
One way to see the GUID in your application would be following the snippet you see below. For this, I used the Traditional C++ API:
In this BEST_EFFORT scenario, you can see that in the capture there are some samples missing based on their writerSeqNumber: 8-12. This capture was taken in the DataReader side. LOST_BY_WRITER will be triggered because those samples are missing.
As you can see in any of the samples, you can identify the DataWriter that sent them based on the guidPrefix + (submessageId: DATA).writerEntityId.
The DataWriter sends a sample for every DomainParticipant that has a DataReader subscribing to the topic. That is why readerEntityIdis UNKNOWN. In order to get the DataReader's GUID, you would need to identify the DATA(r) packet. For this, you can filter by the topic of your interest and by DATA(r) packets. For instance, for topic SensorData, you can apply the following filter:
(rtps.param.topicName == "SensorData") && (rtps.sm.wrEntityId == 0x000004c2)
With this filter, you will only see the DATA(r) packet. The complete GUID of the DataReader will be found in (submessageId: DATA).serializedData.serializedData.PID_ENDPOINT_GUID.(Endpoint GUID).
Let's extract now the information you need from the capture:
RELIABLE scenario
On the works.
Dear Fran,
Thank you so much! That is an overwhelmingly-helpful response! I'm going to immediately add the entity guid dump code so I can see what's going on. And the "gap" packet information is completely new to me. Ooh, I can't wait to try this out!
SPECTACULAR! I'll let you know what hilarity ensues!
---Jason
Hi Jason,
My answer from yesterday was not correct. I have updated my first post so that you can check it, instead. It only has the BEST_EFFORT scenario part. The RELIABLE scenario is trickier and I am currently working on elaborating it. I will let you know as soon as it is ready.
Best,
Fran
Hi, Fran,
Thank you for your diligence and the update. Are you angling for a job at RTI? :) You certainly seem to know your stuff!
Currently we only ever see LOST_BY_WRITER on_sample_lost on BEST_EFFORT readers. Please, if the RELIABLE scenario is terribly tricky, don't kill yourself. Maybe you should write up one of those community explainer pages on this? It would be easier for future user to find.
Thanks again for kindly donating your time to help me!
---Jason
Hi, again, Fran,
One of the issues we have is that we never capture the discovery phase in Wireshark. Our system is fairly static, and the "action" in the system usually happens many minutes after discovery occurs. So, I don't have access to the really fancy wireshark dissections that you showed me. I REALLY wish there was a way I could manually trigger discovery at the beginning of a capture and then observe the sample lost behavior after that. Then I could take advantage of the full power of Wireshark.
Is this even possible? Would bringing up the RTIAdminConsole trigger discovery so that I could capture all that traffic? Ooh, that would be nice.
---Jason
Hi Jason,
I am glad you find the explanation of the BEST_EFFORT scenario useful. About triggering the discovery, RTI Admin Console would definitely do the trick. When you open Admin Console and it finds the DomainParticipants, the DATA(w) and DATA(r) packets will be sent to Admin Console, giving Wireshark the chance to have the Topic Information feature available.
About angling for a job for RTI, I feel like I need to reveal my secret. I work in the Proffesional Services Group at RTI, so that is why it seems I know my stuff xD
Once my answer is complete, I will do as you suggested and post a Community article. Thanks for the feedback!
--Fran
Hi, Fran!
Well, armed with your excellent information I've made some captures, done some analysis, and come back... confused!
I performed these steps:
The capture was about 2.5mins. I then analyzed the capture as well as the netstat results. Here's what I came up with:
Reported loss from my little app
Our debug prints like so:
[CONSOLE] [RTI] on_sample_lost: Asr::Msg::Common::ProtobufWithEnum, count = 1, reason = 1
occurred 118 times. The first count was 1 - the last count was 2329.
Wireshark Analysis
There were 1,377,659 non-meta (discovery, acknack, etc.) RTPS packets. There were ~3000 discovery packets.
There were no sequence number gaps that I could detect in the data.
I organized the data by this key: (srcGuidPrefix,writerEntityId,dstGuidPrefix). I used rtps.sm.seqNumber for the sequence number.
Using this key, I found that a single sender/object id sent over 83% of the samples via multicast - this is what we expect.
But, to repeat, there were no sequence number gaps in the wireshark capture.
Netstat analysis
The netstat results didn't yield as much useful information as I'd hoped, but here's what I concluded:
Summary
I'm afraid I still don't understand what's happening. I'm not seeing any loss on the wire, but our debug prints are reporting sample loss.
I'm also confused as how to correlate the lost sample count reported in our debug print to the sequence numbers that I see in Wireshark. For reference, here's the code we use to report samples lost in our base class listener:
Is there anything that jumps out at you, Francisco? I have some more testing to do, but I'm not sure if my efforts are going to yield anything.
Thanks!
---Jason
Hi Jason,
Sorry for taking so long to answer. Could you provide me with the Wireshark capture you took and a description of the different IP addresses involved? That is, a diagram showing which application (and DDS entities) are in every host, along with the IP address. That way, I can try and check if there is anything outstanding.
--Fran