Communication stops and data loss after one of the subscribers disconnects

5 posts / 0 new
Last post
Offline
Last seen: 12 months 1 day ago
Joined: 03/25/2023
Posts: 3
Communication stops and data loss after one of the subscribers disconnects

Hello,

I'm exploring RTI DDS 6.0.1 for an university research and unfortunately I bumped into a problem.

I have tested a lot of configurations and I'm pretty sure that I either did something completely wrong or the implementation is faulty.

There's one specific configuration where this is the most obvious. This configuration uses the x64 Linux 4 gcc 7.3.0 based target package and C language.
1 publisher and 2 subscribers are connected and when one of the subscribers is disconnected from the network than the publisher always drops 1 or 2 data publication out. In addition, the communication freezes for a couple of seconds (exactly between 7-12).
This only happens when these QOS parameters are set:

<history>
<kind>KEEP_ALL_HISTORY_QOS</kind>
</history>
<reliability>
<kind>RELIABLE_RELIABILITY_QOS</kind>
</reliability>
<durability>
<kind>DDS_TRANSIENT_LOCAL_DURABILITY_QOS</kind>
</durability>

Am I doing something wrong or could it be an implementation error?
Thanks for your help in advance.

Howard's picture
Offline
Last seen: 6 days 2 hours ago
Joined: 11/29/2012
Posts: 567

I'm not sure what you mean by "drops 1 or 2 data publication out".   And what does it mean to disconnect a subscriber? 

Do you mean that the subscribing application is killed?  Or do you mean unplugged from the network?  If unplugged, it is permanently unplugged?  Or do you plug it back in?

Who drops 1 or 2 data samples?  The other subscribing application that is still running?  How do you detect that a data sample is dropped?

How fast is the publishing application sending data?  #samples per second

 

Generally, if the DataWriter's send cache is configured to have a fixed size (resource_limits.max_samples), then when the DataWriter cache is full, the next write() call is blocked.

If a DataReader dies unexpectedly or is otherwise disconnected, then the DataWriter cache will start filling up with data that is sent, but not yet acknowledged by the dead DataReader.  It will do so, until the cached hits the max_samples limit.

How fast this happens depends on on fast you're sending data.

When the send queue is full, the write call will block, and thus look like that the application is "frozen".  The write() call will timeout, depends on the setting of the reliability.max_blocking_time parameter) or if the DDS decides that the DataReader is dead and allow the write() to continue.

What you're describing is expected behavior that's dependent and configurable based on Qos settings.

The question is what do you want DDS to do in such a scenario...

Offline
Last seen: 12 months 1 day ago
Joined: 03/25/2023
Posts: 3

Hi Howard,

Thank you for your reply.

Do you mean that the subscribing application is killed?  Or do you mean unplugged from the network?  If unplugged, it is permanently unplugged?  Or do you plug it back in?

Initially we only unplugged one of the subscribers from the network and later reconnected it. In a later test we also killed one of the subscriber apps. In both cases we observed the same behaviour. (We had one publisher, and two subscriber applications in this test setting)

Who drops 1 or 2 data samples?  The other subscribing application that is still running?  How do you detect that a data sample is dropped?

With the dropping I mean the write method returns with exit code 10. The pubisher is incrementing an integer with the time and publishes the current value 10 times in a second. The subscriber writes the received values to it's output. When we got the exit code 10 as return, neither of the subscribers wrote those values out.

How fast is the publishing application sending data?

10 samples per second, data structure contains one single integer.

All resource settings are inherited from "BuiltinQosLib::Generic.StrictReliable". (The application runs on desktop computers, resources are not restricted.)

As of the cache filling up: It is strange that after the two write calls failed the further ones are succesfull, even if the network is not yet reconnected.

The question is what do you want DDS to do in such a scenario...

In the framework of a university semester project I'm studying the effects on the different QoS settings of the read and write operations, and now I have difficulties  in interpreting my observations.

Howard's picture
Offline
Last seen: 6 days 2 hours ago
Joined: 11/29/2012
Posts: 567

So, you should look at the documentation to understand how the API works.  e.g., a return code of 10 is DDS_RETCODE_TIMEOUT, https://community.rti.com/static/documentation/connext-dds/7.0.0/doc/api/connext_dds/api_cpp/structFooDataWriter.html#abb3770f202340bc819368463987eb055.

If the return code is DDS_RETCODE_TIMEOUT (enumeration value is 10), then the write() call has failed and the data was not accepted for sending.

A DataWriter::write() call will only block if the connection between the DataWriter and DataReader is set to be reliable.

You can/should learn about the Connext DDS Reliability protocol in chapter 34 of the Users Manual here: https://community.rti.com/static/documentation/connext-dds/7.0.0/doc/manuals/connext_dds_professional/users_manual/users_manual/reliable.htm#reliable_1394042328_873873.


A DataWriter::write() call can succeed even if there are no DataReaders.  All that means is that the data that was passed to the write() method was accepted/store in the DataWriter's send cache.  It doesn't imply that the data was sent, nor that any DataReaders actually received the data.

If DataReaders stop acking data sent by DataWriters, if the DataWriter continues to send data, it's writer's cache will be filled causing future writes() to block. However,  after a configurable timeout (different than the writer's max_blocking_time), the DataReader that stopped sending ACKs will be timed out...and any blocked writes() or new calls to write() will be allowed to succeed since the send queue is cleared of any data that was pending on the DataReader and now has space to accept new data.

You should definitely read the docs about this if you want to fully understand how things work in DDS.

Offline
Last seen: 12 months 1 day ago
Joined: 03/25/2023
Posts: 3

I think I have finally found out the problem.

The data loss was caused by the "max_send_window_size" wasn't DDS_LENGTH_UNLIMITED. 

In every documentation I read, it had mentioned that "max_send_window_size" value is unlimited so I didn't bothered with it's value. It turned out, that in "BuiltinQosLib::Generic.StrictReliable" this value isn't unlimited, so I had to set it explicitly.

Thanks for your help. I the documentation what you have linked in was really useful.