Timeout for "Reliable reader activity changed -> inactive"

3 posts / 0 new
Last post
Offline
Last seen: 5 years 5 months ago
Joined: 10/29/2015
Posts: 12
Timeout for "Reliable reader activity changed -> inactive"

Hi,

I have a Reliable, TransientLocal keyed topic with one DataWriter and one DataReader. History is KeepAll on the DataWriter and KeepLast(2) on the DataReader. All other QOS are at their respective defaults. To test liveliness behavior, I ran the following protocol (using v5.2.0 on Linux):

  1. Configure discovery and transports so that everything goes via udpv4://localhost.
  2. Write some samples, let them be received by the DataReader
  3. Using the command tc qdisc change dev lo root netem loss 100%, simulate 100% packet loss on the local loopback.
  4. Write more data, then dispose and unregister the instance (all while the link is actually “dead”).
  5. Eventually deactivate the packet loss on the local loopback to see how the transport recovers and receives outstanding samples.

But in the process, I noticed something interesting: 30 seconds after I start dropping all packets on the loopback, the DataWriter reports

DDSLog: Reliable reader activity changed! Topic: keyed_pairInfo: Reliable Reader Activity Changed Status: Active count: 0 Inactive count: 1

It seems that this happens if no ACK is received for any of the samples the DataWriter is trying to send in step 4 (when the network is “down”) for more than 30 seconds, the DataReader is treated as inactive. This potentially has consequences on whether the DataWriter can release resources for the already disposed and unregistered instance.

My question: Which QoS setting, specifically, controls the timeout duration and the resulting behavior by the DataWriter? I would like to configure a larger timeout. I already tried playing with the following:

  • Liveliness: only affects the DataReader's interpretation of whether the DataWriter is there, not the other way around
  • Lifespan: has a similar effect (but not the same) when I activate it, in that it effectively drops outstanding samples on the DataWriter side. But in my original testcase the Lifespan is already infinite, so this is not responsible for the behavior.
  • autopurge_unregistered_instances_delay: Is not responsible for the “inactive” detection, but of course once the DataReader is considered inactive, and the delay has passed, the instance's resources are reclaimed.
  • Setting the Reliability AcknowledgmentKind to APPLICATION_AUTO instead of the default PROTOCOL: makes no difference regarding the 30-second timeout.

Thanks very much and best regards,

Jan

Gerardo Pardo's picture
Offline
Last seen: 1 day 7 hours ago
Joined: 06/02/2010
Posts: 601

Hi Jan,

Before getting to the answer I would like to explain some of features that come to play here.

(a) When a DataWriter unregisters an instance it is saying two things. One that it will not be writing the instance anymore and two that it wants to reclaim the memory resources that the writer is using to handle the state of the instance and associated samples. 

(b) Even if the instance is unregistered, the DataWriter it cannot remove these resources (instance information, last sample data) immediately. In needs to first send that information (including the "unregister message) to all the matched readers and also wait until all the matched active relieable DataReaders acknowledge it.

(c) There two several ways a DataWriter can consider that a reliable DataReader is no longer "matched and active": The DataReader can stop being "matched", or it can stop being "active" :)

c.1) A DataReader can stop being matched if it goes away via discovery. This can be multiple ways. The DataReader can be explicitly deleted, or it can change Qos to an incompatible setting (e.g. Partition), or the liveliness/heartbeating mechanism used by discovery can "timeout" the whole DomainParticipant to which the DataReader belongs.

c.2) A DataReader can stop being "active" if it is realiable and it stopes responding to heartbeats or making progress in the responses (i.e. it does not advance the sequence number it ACKS).

In your particular case where the network is effectively disconnected and no traffic flows both (c.1) and (c.2) are at play. With the default QoS settings (c.2) will be hit before (c.1).
 
Note that the timing for (c.1) is configured via the DiscoveryConfigQosPolicy. Specifically the participant_liveliness_lease_duration which is set to 100 seconds by default.
 
The configuration of (c.2) is done via the DataWriterProtocolQosPolicy, specifically the fielsd sounf in the rtps_reliable_writer attribute which is of type RtpsReliableWriterProtocol_t. There you can find the field  max_heartbeat_retries. Which controls the maximum number of periodic heartbeat that a writer sends before marking a remote unresponsive reader as inactive. More precisely as stated in the on-line documents:

When a remote reader has not acked all the samples the reliable writer has in its queue, and max_heartbeat_retries number of periodic heartbeats has been sent without receiving any ack/nack back, the remote reader will be marked as inactive (not alive) and be ignored until it resumes sending ack/nack.

Note that piggyback heartbeats do NOT count towards this value.

 
By default max_heartbeat_retries is set to 10 and the heartbeat period is 3 seconds. So multiplying 10 times 3 you get the 30 seconds delay to inactivate a non-responsive DataReader that you observed.
 
If you want to increase that the best thing would be to increase the max_heartbeat_retries. You could also increase the heartbeat period. Actually you have two numbers to change that: the heartbeat_period and the fast_heartbeat_period. However this will impact the performance of the reliable protocol when you are indeed connected.
 
I am curious. Why are you "unregistering" the instance?  Do you want to reclaim resources on the DataWriter? If it is not about resoures them maybe you could just delete it but not unregister it. If it is not unregistered the instance information would be retained in the DataWriter and it would be sent to the DataReader once it became active again. 
 
Gerardo
 
Offline
Last seen: 5 years 5 months ago
Joined: 10/29/2015
Posts: 12

Gerardo,

thanks very much for the detailed answer! Yes, the reason for unregistering is that I want to reclaim resources. More precisely, in my use case I get about one new instance per minute. The instance data is typically updated with new information about every 1–2 seconds, goes through a set of states and eventually gets disposed again after a minute or two. Since some participants are on WLAN connections I need to be reasonably robust, hence the KeepAll policy on the DataWriter side. If my program runs for long enough (and it is expected to run for weeks to months), then with the default settings I eventually run out of memory since all samples are kept on the DataWriter side.

It may make sense to be more generous with max_heartbeat_retries, so thank you for that tip. Currently I have increased my autopurge_unregistered_instances_delay to 5 minutes, I'm currently testing my use case to see how it goes.

Thanks again,

Jan