on_sample_lost LOST_BY_WRITER debugging: Wireshark

9 posts / 0 new
Last post
Offline
Last seen: 5 months 6 days ago
Joined: 02/02/2018
Posts: 11
on_sample_lost LOST_BY_WRITER debugging: Wireshark

Hi, all,

I'm trying to understand why we are seeing on_sample_lost (reason=LOST_BY_WRITER) errors in our domain.

My understanding is that this error occurs when the receiver detects a skip in the sequence number associated with a data writer.

I would like to examine this problem using Wireshark. Could somebody tell me exactly which rtps.* field contains this incrementing sequence number? Ideally, I would love to be able to see these fields in a packet dump:

  1. Writer identity - some way to tie the sender of the packet to a specific writer in the domain participant.
  2. Receiver identity - again, tie the reader to a specific data reader in a participant.
  3. Sequence #

I can't figure out how to do that. There are SO MANY RTPS fields that I can't sift through all of them.

Any ideas?

Thanks!

Francisco Porcel's picture
Offline
Last seen: 1 week 2 days ago
Joined: 11/07/2016
Posts: 23

Hi Jason,

The notification of samples being LOST_BY_WRITER can be observed in Wireshark depending on the scenario: RELIABLE or BEST_EFFORT. I am attaching 2 Wireshark captures that will help me explain the information both scenarios. In order to see the packets in Wireshark the same way that I am showing them in the image below, I recommend you to download Wireshark's latest version, follow this link to apply the colors to the RTPS packets and Enable the topic information feature. For this, you will need to:

  • Right-click on any of the RTPS packets.
  • Click on Protocol Preferences.
  • Click on Enable Topic Information feature.
  • After this, you should see the DATA -> SensorData topic name in red packets.

BEST_EFFORT scenario 

For this scenario, please take a look at best_effort.pcapng attached. This is what you should see in Wireshark:

In blue, you will see packets for discovery. In red, you will see user data packets. These are the kinds of packets in this image:

  • Packet 1: DATA(w) packet. This is an endpoint discovery packet. With this, the DataWriter lets the DataReader know about the information that is needed for matching.
  • Packet 2: DATA(r) packet. This is also endpoint discovery packet. With this, the DataReader lets the DataWriter know about the information that is needed for matching.
  • Packet 3: GAP packet. With this, the DataWriter informs the DataReader that it hast sent some samples before they matched. This will not trigger LOST_BY_WRITER.
  • Packets 4-14: user data packets (samples). These are samples 2-16 for the topic SensorData.

Let's take a closer look at packet #4, a sample packet:

An entity in DDS is identified by a GUID (Global Unique Identifier). This GUID is a 16 byte identifier which is made up of:

  • guidPrefix
  • rtps_object_id: 4 bytes. Incremental counter of the number of DataWriters / DataReaders in the DomainParticipant.

One way to see the GUID in your application would be following the snippet you see below. For this, I used the Traditional C++ API:

DDS_InstanceHandle_t writer_instance_handle = writer->get_instance_handle();
for (int i = 0; i < writer_instance_handle.keyHash.length; i++) {
    printf("%02X", writer_instance_handle.keyHash.value[i]);
    if (i == 3 || i == 7 || i == 11)
         printf("::");
}
printf("\n");

In this BEST_EFFORT scenario, you can see that in the capture there are some samples missing based on their writerSeqNumber: 8-12. This capture was taken in the DataReader side. LOST_BY_WRITER will be triggered because those samples are missing.

As you can see in any of the samples, you can identify the DataWriter that sent them based on the guidPrefix + (submessageId: DATA).writerEntityId.

The DataWriter sends a sample for every DomainParticipant that has a DataReader subscribing to the topic. That is why readerEntityIdis UNKNOWN. In order to get the DataReader's GUID, you would need to identify the DATA(r) packet. For this, you can filter by the topic of your interest and by DATA(r) packets. For instance, for topic SensorData, you can apply the following filter:

(rtps.param.topicName == "SensorData") && (rtps.sm.wrEntityId == 0x000004c2)

With this filter, you will only see the DATA(r) packet. The complete GUID of the DataReader will be found in (submessageId: DATA).serializedData.serializedData.PID_ENDPOINT_GUID.(Endpoint GUID).

Let's extract now the information you need from the capture:

  1. Writer identity. It is the combination of the first guidPrefix + (submessageId: DATA).writerEntityId.
  2. Reader identity. You will need to get it from the DATA(r) packet.
  3. Sequence number. Missing samples based on the monotonically increasing sequence number.

 

RELIABLE scenario

On the works.

 

File Attachments: 
Offline
Last seen: 5 months 6 days ago
Joined: 02/02/2018
Posts: 11

Dear Fran,

Thank you so much! That is an overwhelmingly-helpful response! I'm going to immediately add the entity guid dump code so I can see what's going on. And the "gap" packet information is completely new to me. Ooh, I can't wait to try this out!

SPECTACULAR! I'll let you know what hilarity ensues!

---Jason

Francisco Porcel's picture
Offline
Last seen: 1 week 2 days ago
Joined: 11/07/2016
Posts: 23

Hi Jason,

My answer from yesterday was not correct. I have updated my first post so that you can check it, instead. It only has the BEST_EFFORT scenario part. The RELIABLE scenario is trickier and I am currently working on elaborating it. I will let you know as soon as it is ready.

Best,

Fran

Offline
Last seen: 5 months 6 days ago
Joined: 02/02/2018
Posts: 11

Hi, Fran,

Thank you for your diligence and the update. Are you angling for a job at RTI? :)  You certainly seem to know your stuff!

Currently we only ever see LOST_BY_WRITER on_sample_lost on BEST_EFFORT readers. Please, if the RELIABLE scenario is terribly tricky, don't kill yourself. Maybe you should write up one of those community explainer pages on this? It would be easier for future user to find.

Thanks again for kindly donating your time to help me!

---Jason

Offline
Last seen: 5 months 6 days ago
Joined: 02/02/2018
Posts: 11

Hi, again, Fran,

One of the issues we have is that we never capture the discovery phase in Wireshark. Our system is fairly static, and the "action" in the system usually happens many minutes after discovery occurs. So, I don't have access to the really fancy wireshark dissections that you showed me. I REALLY wish there was a way I could manually trigger discovery at the beginning of a capture and then observe the sample lost behavior after that. Then I could take advantage of the full power of Wireshark.

Is this even possible? Would bringing up the RTIAdminConsole trigger discovery so that I could capture all that traffic? Ooh, that would be nice.

---Jason

Francisco Porcel's picture
Offline
Last seen: 1 week 2 days ago
Joined: 11/07/2016
Posts: 23

Hi Jason,

I am glad you find the explanation of the BEST_EFFORT scenario useful. About triggering the discovery, RTI Admin Console would definitely do the trick. When you open Admin Console and it finds the DomainParticipants, the DATA(w) and DATA(r) packets will be sent to Admin Console, giving Wireshark the chance to have the Topic Information feature available.

About angling for a job for RTI, I feel like I need to reveal my secret. I work in the Proffesional Services Group at RTI, so that is why it seems I know my stuff xD

Once my answer is complete, I will do as you suggested and post a Community article. Thanks for the feedback!

--Fran

Offline
Last seen: 5 months 6 days ago
Joined: 02/02/2018
Posts: 11

Hi, Fran!

Well, armed with your excellent information I've made some captures, done some analysis, and come back... confused!

I performed these steps:

  1. Started a wireshark capture
  2. Started RTI admin console
  3. Waited for all topics to be enumerated
  4. Closed admin console
  5. Ran my little app that just subscribes to a high-rate multicast message and does nothing in the callback
  6. Ran netstat -su on my laptop a number of times
  7. ssh'ed to our Qnx box
  8. Ran netstat -s a number of times there
  9. Quit my little app
  10. Stopped the wireshark capture

The capture was about 2.5mins. I then analyzed the capture as well as the netstat results. Here's what I came up with:

Reported loss from my little app

Our debug prints like so:

[CONSOLE] [RTI] on_sample_lost: Asr::Msg::Common::ProtobufWithEnum, count = 1, reason = 1

occurred 118 times. The first count was 1 - the last count was 2329.

Wireshark Analysis

There were 1,377,659 non-meta (discovery, acknack, etc.) RTPS packets. There were ~3000 discovery packets.

There were no sequence number gaps that I could detect in the data.

I organized the data by this key: (srcGuidPrefix,writerEntityId,dstGuidPrefix). I used rtps.sm.seqNumber for the sequence number.

Using this key, I found that a single sender/object id sent over 83% of the samples via multicast - this is what we expect.

But, to repeat, there were no sequence number gaps in the wireshark capture.

Netstat analysis

The netstat results didn't yield as much useful information as I'd hoped, but here's what I concluded:

  • Linux: Packet receive errors only changed once and increased by 400 (from 261936 to 262304)
  • Linux: Receive buffer errors tracked "packet receive errors" exactly
  • Qnx: No errors during test

Summary

I'm afraid I still don't understand what's happening. I'm not seeing any loss on the wire, but our debug prints are reporting sample loss.

I'm also confused as how to correlate the lost sample count reported in our debug print to the sequence numbers that I see in Wireshark. For reference, here's the code we use to report samples lost in our base class listener:

    virtual void on_sample_lost(DDSDataReader* reader,
        const DDS_SampleLostStatus& status)
    {
        if (!reader) {
            ASR_LOG_ERROR("[RTI] on_sample_lost must have a reader");
            return;
        }

        DDSTopicDescription* topicDesc = reader->get_topicdescription();
        DDS_SampleLostStatusKind reason = status.last_reason;
        ASR_LOG_WARNING("[RTI] on_sample_lost: %s, count = %d, reason = %d",
            topicDesc->get_name(), status.total_count, reason);

    }

Is there anything that jumps out at you, Francisco? I have some more testing to do, but I'm not sure if my efforts are going to yield anything.

Thanks!

---Jason

Francisco Porcel's picture
Offline
Last seen: 1 week 2 days ago
Joined: 11/07/2016
Posts: 23

Hi Jason,

Sorry for taking so long to answer. Could you provide me with the Wireshark capture you took and a description of the different IP addresses involved? That is, a diagram showing which application (and DDS entities) are in every host, along with the IP address. That way, I can try and check if there is anything outstanding.

--Fran