Timeout on wait_for_acknowledgements

3 posts / 0 new
Last post
Offline
Last seen: 4 years 8 months ago
Joined: 03/25/2015
Posts: 33
Timeout on wait_for_acknowledgements

Hi,

I encountered a strange behaviour- atleast theoretically I felt it strange.

I have a topic which is configured with 'ASYNCHRONOUS_PUBLISH_MODE_QOS' and 'StrictReliable.LargeData'. Due to some application requirement I had to wait on acknowledgement for every write. So, effectively for every sample write I am calling "wait_for_acknowledgments" with a timeout of 2 secs. I am running a pub-sub case with multiple passes. Few times (observed it twice atleast) I see that the wait call returned with TIMEOUT, though the subscriber received the sample - the logs confirm that.

I am assuming it is possible as I have seen it, unless I misunderstood some thing there. If I right, what could be the reason for such a behavior ?

One information, if it helps. The implementation is such a way that there is a common subscriber but multiple publishers. I mean each of the publisher writes it's own samples, but the subscriber has a common logic over each of those samples.

Thanks.

Uday

Organization:
Gerardo Pardo's picture
Offline
Last seen: 3 weeks 1 day ago
Joined: 06/02/2010
Posts: 601

Hello Uday,

Even if the DataReader receives a sample within the TIMEOUT window it can be that the DataWriter does not get an ACK for the sample in that time window and therefore returns the timeout. How big are the samples? What flow controller are you using? Do you have a good estimate of how long it take from starting to write the sample until it is fully received in the DataReader?  Is that close to the 2 second timeout?

If the time is close there are some scenarios where it can happen sporadically based on whether some protocol are randomly lost and need to be retried.

As background note that the DataReader sends ACKS whenever it receives a Heartbeat from the DataWriter. The  Generic.StrictReliable.LargeData profile is configured to piggyback a HeartBeat with every sample. So it the DataReader got the sample it got the HeartBeat and therefore it sent the ACK. However ACKs can be lost as they are sent best-efforts. If the ACK to the last sample/fragment was lost then the DataReader will not send another ACK until it receives another HeartBeat from the DataWriter. Since the DataWriter is waiting for acks and not writing anything anymore this can only happen via the periodic heartbeat mechanism which has a 0.2 second period for the  Generic.StrictReliable.LargeData profile with one sample outstanding. All this can be configired via Qos but I am using the values set by the  Generic.StrictReliable.LargeData profile.

So if the last ACK was lost for that last fragment in the sample it would take and extra 0.2 seconds plus a roundtrip for the DataWriter to know the sample was received whoch could cause the sporadic timeout.

Is this possible in your scenario or does sending all the fragments take significantly less than 0.2 seconds?

Gerardo

Offline
Last seen: 4 years 8 months ago
Joined: 03/25/2015
Posts: 33

Hi Gerardo,

Thanks for a quick response.

Here are the details you asked for:

1. How big are the samples?

TypeObject size is 32660

Max sample size is 75362248. The following are 2 concrete cases for actual sizes. Sizes vary as the topic is made up of unions and sequences..

Case1:

            Sample size = 2635277

 Case2:

            Sample size = 276661

 NOTE: units are in bytes.. output returned by

-       DDS_TypeObject_get_serialized_size

-       MyDataPlugin_get_serialized_sample_max_size

-       MyDataPlugin_get_serialized_sample_size

2. What flow controller are you using?

I realized that my qos had setting for various controllers, but I am not choosing any particular flow controller.. As you can see from the attached qos that my profile (MyDataProfile) is based upon “Generic.StrictReliable.LargeData”, but not any extensions of it like “Generic.StrictReliable.LargeData.xxx_flow”.

3. Do you have a good estimate of how long it take from starting to write the sample until it is fully received in the DataReader?  

            I have manually captured the timing from the start of writing (captured before write() call) to the end of receiving (captured just before processing the data in subscribing application) – it is in the order of msecs..

 4. Is that close to the 2 second timeout? NO

Some additional information after today's testing:

- For some reason I thought increasing max_samples (under data readers resource limits) to 1024 would resolve the issue. It kind of reduced the frequency of the issue (observed once in the last 2 hours). But this seems to have adverse effect - delaying my publisher application where I am waiting for acknowledgement (wait_for_acknowledgements()) after each write() call. Is this direction a right one?

- In all the cases I observed till now, are the ones where number of writes is minimal (1 to 2 data samples). Could this hint you why it happens only in these cases?

Uday

File Attachments: