Hi,
I have a Reliable, TransientLocal keyed topic with one DataWriter and one DataReader. History is KeepAll on the DataWriter and KeepLast(2) on the DataReader. All other QOS are at their respective defaults. To test liveliness behavior, I ran the following protocol (using v5.2.0 on Linux):
- Configure discovery and transports so that everything goes via udpv4://localhost.
- Write some samples, let them be received by the DataReader
- Using the command tc qdisc change dev lo root netem loss 100%, simulate 100% packet loss on the local loopback.
- Write more data, then dispose and unregister the instance (all while the link is actually “dead”).
- Eventually deactivate the packet loss on the local loopback to see how the transport recovers and receives outstanding samples.
But in the process, I noticed something interesting: 30 seconds after I start dropping all packets on the loopback, the DataWriter reports
DDSLog: Reliable reader activity changed! Topic: keyed_pairInfo: Reliable Reader Activity Changed Status: Active count: 0 Inactive count: 1
It seems that this happens if no ACK is received for any of the samples the DataWriter is trying to send in step 4 (when the network is “down”) for more than 30 seconds, the DataReader is treated as inactive. This potentially has consequences on whether the DataWriter can release resources for the already disposed and unregistered instance.
My question: Which QoS setting, specifically, controls the timeout duration and the resulting behavior by the DataWriter? I would like to configure a larger timeout. I already tried playing with the following:
- Liveliness: only affects the DataReader's interpretation of whether the DataWriter is there, not the other way around
- Lifespan: has a similar effect (but not the same) when I activate it, in that it effectively drops outstanding samples on the DataWriter side. But in my original testcase the Lifespan is already infinite, so this is not responsible for the behavior.
- autopurge_unregistered_instances_delay: Is not responsible for the “inactive” detection, but of course once the DataReader is considered inactive, and the delay has passed, the instance's resources are reclaimed.
- Setting the Reliability AcknowledgmentKind to APPLICATION_AUTO instead of the default PROTOCOL: makes no difference regarding the 30-second timeout.
Thanks very much and best regards,
Jan
Hi Jan,
Before getting to the answer I would like to explain some of features that come to play here.
(a) When a DataWriter unregisters an instance it is saying two things. One that it will not be writing the instance anymore and two that it wants to reclaim the memory resources that the writer is using to handle the state of the instance and associated samples.
(b) Even if the instance is unregistered, the DataWriter it cannot remove these resources (instance information, last sample data) immediately. In needs to first send that information (including the "unregister message) to all the matched readers and also wait until all the matched active relieable DataReaders acknowledge it.
(c) There two several ways a DataWriter can consider that a reliable DataReader is no longer "matched and active": The DataReader can stop being "matched", or it can stop being "active" :)
c.1) A DataReader can stop being matched if it goes away via discovery. This can be multiple ways. The DataReader can be explicitly deleted, or it can change Qos to an incompatible setting (e.g. Partition), or the liveliness/heartbeating mechanism used by discovery can "timeout" the whole DomainParticipant to which the DataReader belongs.
c.2) A DataReader can stop being "active" if it is realiable and it stopes responding to heartbeats or making progress in the responses (i.e. it does not advance the sequence number it ACKS).
Gerardo,
thanks very much for the detailed answer! Yes, the reason for unregistering is that I want to reclaim resources. More precisely, in my use case I get about one new instance per minute. The instance data is typically updated with new information about every 1–2 seconds, goes through a set of states and eventually gets disposed again after a minute or two. Since some participants are on WLAN connections I need to be reasonably robust, hence the KeepAll policy on the DataWriter side. If my program runs for long enough (and it is expected to run for weeks to months), then with the default settings I eventually run out of memory since all samples are kept on the DataWriter side.
It may make sense to be more generous with max_heartbeat_retries, so thank you for that tip. Currently I have increased my autopurge_unregistered_instances_delay to 5 minutes, I'm currently testing my use case to see how it goes.
Thanks again,
Jan