DDSTheParticipantFactory->delete_participant hangs

25 posts / 0 new
Last post
Offline
Last seen: 5 years 8 months ago
Joined: 02/14/2014
Posts: 41
DDSTheParticipantFactory->delete_participant hangs

On exiting an application (using DDS version 4.5f) I am cleaning up using:

            DDS_ReturnCode_t rc = m_participant->delete_contained_entities();

            rc = DDSTheParticipantFactory->delete_participant(m_participant);

            rc = DDSDomainParticipantFactory::finalize_instance();

The return code for delete_contained_entities is successful but delete_participant never returns.

rose's picture
Offline
Last seen: 3 years 5 months ago
Joined: 08/22/2011
Posts: 148

Hello gandriotakis, 

One possible culprit for this:

Our receive threads are blocked on sockets, and they are woken up when the delete_participant() call sends a message over loopback to each socket our receive threads are blocking on.  This allows the thread to wake up, realize that it's shutting down, and clean itself up.  

In some cases, this message doesn't get received – especially when a firewall is running – and one or more recieve thread fails to wake up to be shut down.

So, can you tell me a bit more about your system?  Do you have a firewall running?

Thank you!

Rose

Offline
Last seen: 5 years 8 months ago
Joined: 02/14/2014
Posts: 41

I get the same behavior with the firewall off.  We are currently in the process of a redesign moving towards more portable code.  The same pattern works in the old code.

I should note that delete_contained_entities returns DDS_RETCODE_OK.  If I do not call delete_contained_entities then delete_participant returns DDS_RETCODE_ERROR and finalize_instance returns DDS_RETCODE_PRECONDITION_NOT_MET which is seems to make sense.  Thus I suspect DDS does not get cleaned up but at lease the applicaiton exits.

rose's picture
Offline
Last seen: 3 years 5 months ago
Joined: 08/22/2011
Posts: 148

Hello gandriotakis, 

I checked, and we have a known issue on certain versions of Linux where the delete_participant call may hang forever.  I am checking whether there is a workaround.  

Edit: I just checked the details of the bug report that I mentioned, and it looks like it is on systems where that last shutdown packet doesn't get through to the DataReader (such as being blocked by a firewall).  I doubt that this problem is due to your code, but I am not sure why it is happening in 4.5f but not in the previous version.  What was the previous version you were using? What OS are you running on?

Thank you!

Rose

Offline
Last seen: 5 years 8 months ago
Joined: 02/14/2014
Posts: 41

Hi Rose,

We are running 4.5f on Windows 7.  The point of the exercise is to move to Linux.  It worked with 4.5f with our previous code but not with the current code which has been redisigned to not use the Windows message pump.

rose's picture
Offline
Last seen: 3 years 5 months ago
Joined: 08/22/2011
Posts: 148

Hello gandriotakis,

We have only seen this problem on Linux.  But two more questions for you:  Do you have multiple DomainParticipants in your application?  Is multicast enabled in your application?

We have reproduced this problem on Linux, but only if there are two DomainParticipants in an application, and if you delete them in a different order than they were created. 

One other thing worth exploring:  It would be helpful to see which threads are still running and preventing the call from completing.  Do you have Process Explorer?  If you do, it will show you more details about which threads are running.

Thank you!

Rose

Offline
Last seen: 5 years 8 months ago
Joined: 02/14/2014
Posts: 41

Hello Rose,

We have only one domain participant.  Since multicast is not explicitly disabled I am assuming that it is enabled.  Process Explorer shows only on thread associated with the task (WinMainCRTStartup.

rose's picture
Offline
Last seen: 3 years 5 months ago
Joined: 08/22/2011
Posts: 148

Hello gandriotakis, 

It would be very strange if the other threads shut down correctly, but the call is still hanging.  Can you increase the verbosity of DDS?  That will slow down your execution time, but it might give me a hint where it's hanging.

1
2
NDDSConfigLogger::get_instance()->set_output_file( File *yourfile );
NDDSConfigLogger::get_instance()->set_verbosity( NDDS_CONFIG_LOG_VERBOSITY_STATUS_ALL );

Also, can you attach the QoS configuration you are using?  One other place that can have problems is if you set the DomainParticipant's shutdown_cleanup_period to 0.  This usually only causes problems on RTOSes, though, so it is a long shot. 

Thank you!

Rose

Offline
Last seen: 5 years 8 months ago
Joined: 02/14/2014
Posts: 41

I have attached my QoS and a verbose log.

rose's picture
Offline
Last seen: 3 years 5 months ago
Joined: 08/22/2011
Posts: 148

Hello gandriotakis, 

As a test, can you comment out this property?

1
2
3
4
           <element>
                   <name>dds.transport.UDPv4.builtin.ignore_loopback_interface</name>
                   <value>0</value>
           </element>

Thank you!

Rose

Offline
Last seen: 5 years 8 months ago
Joined: 02/14/2014
Posts: 41

That did not help.

rose's picture
Offline
Last seen: 3 years 5 months ago
Joined: 08/22/2011
Posts: 148

There is nothing else in the QoS that looks like it could be responsible.

(There is a mismatch in some of the profiles between the UDP sizes that you  might want to change: 65530 vs. 65535.  That is unlikely to be related to this, though.)

Can you build a debug version, open the application in Visual Studio, and double-check which threads are running when it hangs?  I believe you might be able to get our thread stack traces that way.  (I am still reading through the log file, but in retrospect the stack trace may be faster).

Thank you!

Rose

Offline
Last seen: 5 years 8 months ago
Joined: 02/14/2014
Posts: 41

Sorry for the delay but I got diverted.  When the app hangs the call stack is:

ntdll.dll!_NtDelayExecution@8()  + 0x15 bytes 
  ntdll.dll!_NtDelayExecution@8()  + 0x15 bytes 
  KernelBase.dll!_Sleep@4()  + 0xf bytes 
  nddscore.dll!00baf405()  
  [Frames below may be incorrect and/or missing, no symbols loaded for nddscore.dll] 
  nddscore.dll!00b1c806()  
  nddscore.dll!00a131cb()  
  nddsc.dll!0083100a()  
  nddsc.dll!0080e422()  

rose's picture
Offline
Last seen: 3 years 5 months ago
Joined: 08/22/2011
Posts: 148

The delay isn't a problem – glad to know this isn't holding you up too badly.

Can you reproduce the same problem when building against our debug libraries?  I'm hoping that gives a more detailed stack trace.

Thank you!

Rose

Offline
Last seen: 5 years 8 months ago
Joined: 02/14/2014
Posts: 41

  ntdll.dll!_NtDelayExecution@8()  + 0x15 bytes 
  ntdll.dll!_NtDelayExecution@8()  + 0x15 bytes 
  KernelBase.dll!_Sleep@4()  + 0xf bytes 
  nddscored.dll!021c46eb()  
  [Frames below may be incorrect and/or missing, no symbols loaded for nddscored.dll] 
  nddscored.dll!02165715()  
  nddscored.dll!0216590a()  
  nddscored.dll!01fac103()  
  nddscored.dll!021b0481()  
  nddscored.dll!021b062c()  
  nddscored.dll!021b0213()  
  nddscored.dll!021b3b0e()  
  nddscored.dll!01e71e0c()  
  nddscd.dll!018b7e05()  
  nddscppd.dll!0158b242()  
  nddscppd.dll!015c2d6e()  
  nddscppd.dll!01586c6b()  
  nddscppd.dll!01554af0()  
  nddscppd.dll!0155d5b5()  
> TcmIpc.dll!DDSInterface::cleanup()  Line 320 + 0x1e bytes C++
  TcmIpc.dll!DDSInterface::~DDSInterface()  Line 260 C++
  TcmIpc.dll!DDSInterface::`vector deleting destructor'()  + 0x57 bytes C++
  TcmIpc.dll!Poco::SingletonHolder<DDSInterface>::~SingletonHolder<DDSInterface>()  Line 69 + 0x24 bytes C++
  TcmIpc.dll!`DDSInterface::Instance'::`2'::`dynamic atexit destructor for 'sh''()  + 0xd bytes C++
  TcmIpc.dll!_CRT_INIT(void * hDllHandle, unsigned long dwReason, void * lpreserved)  Line 415 C
  TcmIpc.dll!__DllMainCRTStartup(void * hDllHandle, unsigned long dwReason, void * lpreserved)  Line 526 + 0x11 bytes C
  TcmIpc.dll!_DllMainCRTStartup(void * hDllHandle, unsigned long dwReason, void * lpreserved)  Line 476 + 0x11 bytes C
  ntdll.dll!_LdrpCallInitRoutine@16()  + 0x14 bytes 
  ntdll.dll!_LdrShutdownProcess@0()  + 0x141 bytes 
  ntdll.dll!_RtlExitUserProcess@4()  + 0x74 bytes 
  kernel32.dll!754a79c5()  
  msvcr100d.dll!___crtExitProcess()  + 0x1b bytes 
  msvcr100d.dll!___freeCrtMemory()  + 0x317 bytes 
  msvcr100d.dll!_exit()  + 0x12 bytes 
  TCMHostAdapterApp.exe!__tmainCRTStartup()  Line 568 C
  TCMHostAdapterApp.exe!wmainCRTStartup()  Line 371 C
  kernel32.dll!@BaseThreadInitThunk@12()  + 0x12 bytes 
  ntdll.dll!___RtlUserThreadStart@8()  + 0x27 bytes 
  ntdll.dll!__RtlUserThreadStart@8()  + 0x1b bytes 

Gerardo Pardo's picture
Offline
Last seen: 2 weeks 6 days ago
Joined: 06/02/2010
Posts: 602

 

MODIFIED: I had not noticed you had already attached the verbosity output to a previous posting. Please ignore the request below...

Hi,

We are still shooting in the dark. It seems like for some reason whe the operating system dumps the stack it not able to find the symbols for the nddscppd.dll

Maybe you can call the operation:

1
NDDSConfigLogger::set_verbosity(NDDS_CONFIG_LOG_VERBOSITY_STATUS_ALL);

to enable verbose output prior to calling participant->delete_contained_entities() operation or before calling DDSTheParticipantFactory->delete_participant(). That may shed some light into what the RTIDDS core is doing when it hangs.

Gerardo

Offline
Last seen: 5 years 8 months ago
Joined: 02/14/2014
Posts: 41

Here is the stack dump after I hooked in the symbols

 

  ntdll.dll!_NtDelayExecution@8()  + 0x15 bytes 
  ntdll.dll!_NtDelayExecution@8()  + 0x15 bytes 
  KernelBase.dll!_Sleep@4()  + 0xf bytes 
  nddscored.dll!RTIOsapiThread_sleep(const RTINtpTime * timeIn)  Line 696 + 0xc bytes C
  nddscored.dll!COMMENDActiveFacade_preShutdownWakeup(COMMENDActiveFacade * me, REDAWorker * worker)  Line 929 + 0xe bytes C
  nddscored.dll!PRESParticipant_preShutdownWakeup(PRESParticipant * me, int * failReason, REDAWorker * worker)  Line 3259 + 0x1f bytes C
  nddscd.dll!DDS_DomainParticipantPresentation_wakeup(DDS_DomainParticipantPresentation * self, REDAWorker * worker)  Line 324 + 0x11 bytes C
  nddscd.dll!DDS_DomainParticipant_destroyI(DDS_DomainParticipantImpl * ddsParticipant)  Line 9129 + 0x10 bytes C
  nddscd.dll!DDS_DomainParticipantFactory_delete_participant(DDS_DomainParticipantFactoryImpl * self, DDS_DomainParticipantImpl * a_participant)  Line 1690 + 0x9 bytes C
  nddscppd.dll!DDSDomainParticipant_impl::destroyI(DDSDomainParticipant_impl * ddsParticipant)  Line 1080 + 0x49 bytes C++
  nddscppd.dll!DDSDomainParticipantFactory_impl::delete_participant(DDSDomainParticipant * a_participant)  Line 772 + 0x9 bytes C++
> TcmIpc.dll!DDSInterface::cleanup()  Line 318 + 0x1e bytes C++
  TcmIpc.dll!DDSInterface::~DDSInterface()  Line 258 C++
  TcmIpc.dll!DDSInterface::`vector deleting destructor'()  + 0x57 bytes C++
  TcmIpc.dll!Poco::SingletonHolder<DDSInterface>::~SingletonHolder<DDSInterface>()  Line 69 + 0x24 bytes C++
  TcmIpc.dll!`DDSInterface::Instance'::`2'::`dynamic atexit destructor for 'sh''()  + 0xd bytes C++
  TcmIpc.dll!_CRT_INIT(void * hDllHandle, unsigned long dwReason, void * lpreserved)  Line 415 C
  TcmIpc.dll!__DllMainCRTStartup(void * hDllHandle, unsigned long dwReason, void * lpreserved)  Line 526 + 0x11 bytes C
  TcmIpc.dll!_DllMainCRTStartup(void * hDllHandle, unsigned long dwReason, void * lpreserved)  Line 476 + 0x11 bytes C
  ntdll.dll!_LdrpCallInitRoutine@16()  + 0x14 bytes 
  ntdll.dll!_LdrShutdownProcess@0()  + 0x141 bytes 
  ntdll.dll!_RtlExitUserProcess@4()  + 0x74 bytes 
  kernel32.dll!754a79c5()  
  msvcr100d.dll!___crtExitProcess()  + 0x1b bytes 
  msvcr100d.dll!___freeCrtMemory()  + 0x317 bytes 
  msvcr100d.dll!_exit()  + 0x12 bytes 
  TCMHostAdapterApp.exe!__tmainCRTStartup()  Line 568 C
  TCMHostAdapterApp.exe!wmainCRTStartup()  Line 371 C
  kernel32.dll!@BaseThreadInitThunk@12()  + 0x12 bytes 
  ntdll.dll!___RtlUserThreadStart@8()  + 0x27 bytes 
  ntdll.dll!__RtlUserThreadStart@8()  + 0x1b bytes 

rose's picture
Offline
Last seen: 3 years 5 months ago
Joined: 08/22/2011
Posts: 148

Thank you for the additional details!

That stack trace says that we are definitely waiting for a thread to shut down.  It's still not 100% certain that it's a receive thread, but in combination with the error messages in the log file you sent, I am willing to make a strong guess that that is the reason why the main thread never shuts down.

Here is what I see:  At about the time the application is calling delete_contained_entities(), there are error messages in your log file indicate that the WSASendTo call gets error 10093, WSANOTINITIALISED.

[D0101|DELETE_CONTAINED]NDDS_Transport_UDPv4_send:OS WSASendTo() failure, error 0X276D

[D0101|DELETE_CONTAINED]NDDS_Transport_UDPv4_send:U0000136c sent 88 bytes to 0X100007F:7660 

This error is being printed as DDS is sending the final messages at shutdown.  We send these final messages for two reasons:

  1. To notify other applications about reader/writer deletions.  
  2. To wake up our own receive threads for a clean shutdown

Given the error messages, there is a very good chance that wake up data is not being received, causing the main thread to wait forever for the receive threads to unblock and shutdown cleanly.  

Looking closely at this WSANOTINITIALISED error, it looks like this error could happen in two cases:

  1. if we didn't call WSAStartup (in which case we would see this error at the beginning of the file, too), or
  2. if somebody called WSACleanup() before these final messages are sent.  

I just took a look through our code, and if you are using the default UDP transport we should not be calling WSACleanup() at all in 4.5f.  Is it possible that your application is calling WSACleanup() somewhere, or using another library that calls WSACleanup()?  (Also, if there is anything else you can think of that might cause a network disconnection during the shutdown process, let me know.)

Thank you!

Rose

Offline
Last seen: 5 years 8 months ago
Joined: 02/14/2014
Posts: 41

Is there anything I can wait for that will let me know when you are done with the connection?

rose's picture
Offline
Last seen: 3 years 5 months ago
Joined: 08/22/2011
Posts: 148

After all DomainParticipants are deleted in your application, we don't need to use the network anymore.  So really, just watiting for delete_participant() call (or calls) to complete should be enough.  You don't have to finalize the DomainParticipantFactory before making a WSACleanup() call.

If for some reason you call WSACleanup() and then need to create/enable a new DomainParticipant, it calls the WSAStartup() API again, which should be safe.

Thank you!

Rose

Offline
Last seen: 5 years 8 months ago
Joined: 02/14/2014
Posts: 41

We are using a third party library that I suspect calls WSACleanup when cleaning up a static variable on exit.  This appears to be happening on exit before DDS is done cleaning up.  Ideally I there would be some indication that DDS cleanup was complete before attempting to finish exiting (e.g. some event I could wait for).

rose's picture
Offline
Last seen: 3 years 5 months ago
Joined: 08/22/2011
Posts: 148

The only indication that we have succesfully shut down is when all delete_participant() calls complete and return a return code saying the calls were successful.  It should be possible to build logic around this to signal the other code that DDS has finished cleaning up, but there is no callback mechanism for that.

Thank you!

Rose

Offline
Last seen: 5 years 8 months ago
Joined: 02/14/2014
Posts: 41

That will not help because the call to delete_participant never returns which gets us back to the original problem.  I am not sure WSACleanup is the problem. 

As I understand it the expectation is that every WSAStartup must be matched with a WSACleanup and only when all start ups have been cleaned will the actual cleanup take place.

I added an extraneous/unmatched call to WSAStartup that should have incremented the reference count so that WSACleanup never actually cleaned up but delete_participant is still hanging.

Offline
Last seen: 5 years 8 months ago
Joined: 02/14/2014
Posts: 41

I have moved the cleanup to be explicitly called outside of the destructor before exiting.  This seems to have worked around the issue.  Thank you for your help.

rose's picture
Offline
Last seen: 3 years 5 months ago
Joined: 08/22/2011
Posts: 148

That's great that it solved the problem!  

Thank you,

Rose