Deadlock risk for RTI Connext v. 5.2

4 posts / 0 new
Last post
Offline
Last seen: 2 months 2 weeks ago
Joined: 01/26/2016
Posts: 11
Deadlock risk for RTI Connext v. 5.2

Hi,

I'm facing a problem of sample sending. My usecase is the following:

 

A binary that contains a DataReader is launched, it received images, does some processing on them, creates a DataWriter and send the images on another topic (QoS used: BuiltinQosLibExp::Generic.StrictReliable.LargeData.FastFlow).

Another binary contains a DataReader that takes the images sent by the DataWriter. 

 

My problem is that after a while, the images sent by the DataWriter are no more received by the DataReader. The write() method returns RETCODE_OK, so I guess it means that the sample was correctly sent. The DataReader receives nothing (it uses a WaitSet that is not awaken).

The only error I have is the following:

 

REDAWorker_enterExclusiveArea:worker rDsp deadlock risk: cannot enter 0x7fc98c006510 of level 30 from level 30
REDACursor_modifyReadWriteArea:!enter worker (rDsp)'s exclusive area
COMMENDSrWriterService_agentFunction:!modify srw writer
REDAWorker_enterExclusiveArea:worker rDsp deadlock risk: cannot enter 0x7fc8f0493a00 of level 0 from level 0
RTIEventJobDispatcherThread_spawnedFnc:!entering eaBeforeAgentFncs EA

 

I have like hundreds of them, but I dont have the timestamp so I can't know if these error occured during the sending of images or during the binary shutdown.

I read this: https://community.rti.com/kb/what-does-deadlock-error-message-mean concerning the deadlock error message, but it's written that the error occured on versions 4.0 and below, and I'm using the 5.2.

 

My question is: do you think that these errors can be related to the fact that the Reader did not receive any sample? If not, what can cause these errors?

 

Thanks :-)

Lucie

irwin's picture
Offline
Last seen: 2 days 7 hours ago
Joined: 08/14/2010
Posts: 15

Lucie,

   I presume that you are getting this eror from the publishing application. You mention that you are using BuiltinQosLibExp::Generic.StrictReliable.LargeData.FastFlow. That is an aggresive flow controller and your system needs to be fast enough to handler it. Its token period is 10 ms. I would ask a few questions on your use case.

1- If your sample size is less than 65K, use synchronous publication.

2- What is your data rate requirements. The fast_token is 100MB/sec. Adjust accordingly.

3- On your publisher, wait for on_publication_matched event, then start your writing. After each write do a wait_for_acknowledgemt() call.

 

                      Irwin

Offline
Last seen: 2 months 3 weeks ago
Joined: 06/03/2019
Posts: 1

Hello Irwin,

I'm a colleague of Lucie. I'd like to add more information that I hope could help you understand our use case.

The samples we are working with size over 100Mb, and we have tuned the FlowController to have it delivered within 30ms. The publisher and subscriber are on the same host. They are created on the fly for a short time period (~30 seconds) before being deleted. That is because we also want to avoid using an unique topic.

We do know a little about the exclusive area. And we made sure that the creation of publisher and subscriber are never in the same thread as the one awakened by a waitset. It worked for months until this one time, where the samples were not processed by the subcriber, and where the errors mentioned by Lucie were traced

REDAWorker_enterExclusiveArea:worker rDsp deadlock risk: cannot enter 0x7fc98c006510 of level 30 from level 30 REDACursor_modifyReadWriteArea:!enter worker (rDsp)'s exclusive area COMMENDSrWriterService_agentFunction:!modify srw writer REDAWorker_enterExclusiveArea:worker rDsp deadlock risk

You are right about those logs belonging to the publisher application. We certainly have overlooked something about this exclusive area. Maybe the publisher failed to be created. But in which case, we would have expected a failure to happen upon calling write(). It has returned DDS::RETCODE_OK.

I did'nt know about the on_publication_matched event. However we are setting durability to TRANSIENT_LOCAL. To my understanding, this setting allow late joiners to be able to read samples published prior to their creation. And we didn't notice any weird behavior or error traced in the subscriber application, so we assume it was created successfully.

Now my assumptions may be wrong. My undesrtanding may be flawed. That is why I'm kindly asking for your advice. Any help here would be greatly appreciated. Thank you a lot for your time.

Thanh-Ha

bobby's picture
Offline
Last seen: 2 months 2 weeks ago
Joined: 07/25/2011
Posts: 20

Hi Thanh-Ha,

I have sent an email directly to Lucie to discuss her question. Your organization has access to RTI support and I would like to add you to that support email chain. Can you please email me at bobby@rti.com and provide me your contact info. 

Thank you,

Bobby