On exiting an application (using DDS version 4.5f) I am cleaning up using:
DDS_ReturnCode_t rc = m_participant->delete_contained_entities();
rc = DDSTheParticipantFactory->delete_participant(m_participant);
rc = DDSDomainParticipantFactory::finalize_instance();
The return code for delete_contained_entities is successful but delete_participant never returns.
Hello gandriotakis,
One possible culprit for this:
Our receive threads are blocked on sockets, and they are woken up when the delete_participant() call sends a message over loopback to each socket our receive threads are blocking on. This allows the thread to wake up, realize that it's shutting down, and clean itself up.
In some cases, this message doesn't get received – especially when a firewall is running – and one or more recieve thread fails to wake up to be shut down.
So, can you tell me a bit more about your system? Do you have a firewall running?
Thank you!
Rose
I get the same behavior with the firewall off. We are currently in the process of a redesign moving towards more portable code. The same pattern works in the old code.
I should note that delete_contained_entities returns DDS_RETCODE_OK. If I do not call delete_contained_entities then delete_participant returns DDS_RETCODE_ERROR and finalize_instance returns DDS_RETCODE_PRECONDITION_NOT_MET which is seems to make sense. Thus I suspect DDS does not get cleaned up but at lease the applicaiton exits.
Hello gandriotakis,
I checked, and we have a known issue on certain versions of Linux where the delete_participant call may hang forever. I am checking whether there is a workaround.
Edit: I just checked the details of the bug report that I mentioned, and it looks like it is on systems where that last shutdown packet doesn't get through to the DataReader (such as being blocked by a firewall). I doubt that this problem is due to your code, but I am not sure why it is happening in 4.5f but not in the previous version. What was the previous version you were using? What OS are you running on?
Thank you!
Rose
Hi Rose,
We are running 4.5f on Windows 7. The point of the exercise is to move to Linux. It worked with 4.5f with our previous code but not with the current code which has been redisigned to not use the Windows message pump.
Hello gandriotakis,
We have only seen this problem on Linux. But two more questions for you: Do you have multiple DomainParticipants in your application? Is multicast enabled in your application?
We have reproduced this problem on Linux, but only if there are two DomainParticipants in an application, and if you delete them in a different order than they were created.
One other thing worth exploring: It would be helpful to see which threads are still running and preventing the call from completing. Do you have Process Explorer? If you do, it will show you more details about which threads are running.
Thank you!
Rose
Hello Rose,
We have only one domain participant. Since multicast is not explicitly disabled I am assuming that it is enabled. Process Explorer shows only on thread associated with the task (WinMainCRTStartup.
Hello gandriotakis,
It would be very strange if the other threads shut down correctly, but the call is still hanging. Can you increase the verbosity of DDS? That will slow down your execution time, but it might give me a hint where it's hanging.
Also, can you attach the QoS configuration you are using? One other place that can have problems is if you set the DomainParticipant's shutdown_cleanup_period to 0. This usually only causes problems on RTOSes, though, so it is a long shot.
Thank you!
Rose
I have attached my QoS and a verbose log.
Hello gandriotakis,
As a test, can you comment out this property?
Thank you!
Rose
That did not help.
There is nothing else in the QoS that looks like it could be responsible.
(There is a mismatch in some of the profiles between the UDP sizes that you might want to change: 65530 vs. 65535. That is unlikely to be related to this, though.)
Can you build a debug version, open the application in Visual Studio, and double-check which threads are running when it hangs? I believe you might be able to get our thread stack traces that way. (I am still reading through the log file, but in retrospect the stack trace may be faster).
Thank you!
Rose
Sorry for the delay but I got diverted. When the app hangs the call stack is:
> ntdll.dll!_NtDelayExecution@8() + 0x15 bytes
ntdll.dll!_NtDelayExecution@8() + 0x15 bytes
KernelBase.dll!_Sleep@4() + 0xf bytes
nddscore.dll!00baf405()
[Frames below may be incorrect and/or missing, no symbols loaded for nddscore.dll]
nddscore.dll!00b1c806()
nddscore.dll!00a131cb()
nddsc.dll!0083100a()
nddsc.dll!0080e422()
The delay isn't a problem – glad to know this isn't holding you up too badly.
Can you reproduce the same problem when building against our debug libraries? I'm hoping that gives a more detailed stack trace.
Thank you!
Rose
ntdll.dll!_NtDelayExecution@8() + 0x15 bytes
ntdll.dll!_NtDelayExecution@8() + 0x15 bytes
KernelBase.dll!_Sleep@4() + 0xf bytes
nddscored.dll!021c46eb()
[Frames below may be incorrect and/or missing, no symbols loaded for nddscored.dll]
nddscored.dll!02165715()
nddscored.dll!0216590a()
nddscored.dll!01fac103()
nddscored.dll!021b0481()
nddscored.dll!021b062c()
nddscored.dll!021b0213()
nddscored.dll!021b3b0e()
nddscored.dll!01e71e0c()
nddscd.dll!018b7e05()
nddscppd.dll!0158b242()
nddscppd.dll!015c2d6e()
nddscppd.dll!01586c6b()
nddscppd.dll!01554af0()
nddscppd.dll!0155d5b5()
> TcmIpc.dll!DDSInterface::cleanup() Line 320 + 0x1e bytes C++
TcmIpc.dll!DDSInterface::~DDSInterface() Line 260 C++
TcmIpc.dll!DDSInterface::`vector deleting destructor'() + 0x57 bytes C++
TcmIpc.dll!Poco::SingletonHolder<DDSInterface>::~SingletonHolder<DDSInterface>() Line 69 + 0x24 bytes C++
TcmIpc.dll!`DDSInterface::Instance'::`2'::`dynamic atexit destructor for 'sh''() + 0xd bytes C++
TcmIpc.dll!_CRT_INIT(void * hDllHandle, unsigned long dwReason, void * lpreserved) Line 415 C
TcmIpc.dll!__DllMainCRTStartup(void * hDllHandle, unsigned long dwReason, void * lpreserved) Line 526 + 0x11 bytes C
TcmIpc.dll!_DllMainCRTStartup(void * hDllHandle, unsigned long dwReason, void * lpreserved) Line 476 + 0x11 bytes C
ntdll.dll!_LdrpCallInitRoutine@16() + 0x14 bytes
ntdll.dll!_LdrShutdownProcess@0() + 0x141 bytes
ntdll.dll!_RtlExitUserProcess@4() + 0x74 bytes
kernel32.dll!754a79c5()
msvcr100d.dll!___crtExitProcess() + 0x1b bytes
msvcr100d.dll!___freeCrtMemory() + 0x317 bytes
msvcr100d.dll!_exit() + 0x12 bytes
TCMHostAdapterApp.exe!__tmainCRTStartup() Line 568 C
TCMHostAdapterApp.exe!wmainCRTStartup() Line 371 C
kernel32.dll!@BaseThreadInitThunk@12() + 0x12 bytes
ntdll.dll!___RtlUserThreadStart@8() + 0x27 bytes
ntdll.dll!__RtlUserThreadStart@8() + 0x1b bytes
MODIFIED: I had not noticed you had already attached the verbosity output to a previous posting. Please ignore the request below...
Hi,
We are still shooting in the dark. It seems like for some reason whe the operating system dumps the stack it not able to find the symbols for the nddscppd.dll
Maybe you can call the operation:
to enable verbose output prior to calling participant->delete_contained_entities() operation or before calling DDSTheParticipantFactory->delete_participant(). That may shed some light into what the RTIDDS core is doing when it hangs.
Gerardo
Here is the stack dump after I hooked in the symbols
ntdll.dll!_NtDelayExecution@8() + 0x15 bytes
ntdll.dll!_NtDelayExecution@8() + 0x15 bytes
KernelBase.dll!_Sleep@4() + 0xf bytes
nddscored.dll!RTIOsapiThread_sleep(const RTINtpTime * timeIn) Line 696 + 0xc bytes C
nddscored.dll!COMMENDActiveFacade_preShutdownWakeup(COMMENDActiveFacade * me, REDAWorker * worker) Line 929 + 0xe bytes C
nddscored.dll!PRESParticipant_preShutdownWakeup(PRESParticipant * me, int * failReason, REDAWorker * worker) Line 3259 + 0x1f bytes C
nddscd.dll!DDS_DomainParticipantPresentation_wakeup(DDS_DomainParticipantPresentation * self, REDAWorker * worker) Line 324 + 0x11 bytes C
nddscd.dll!DDS_DomainParticipant_destroyI(DDS_DomainParticipantImpl * ddsParticipant) Line 9129 + 0x10 bytes C
nddscd.dll!DDS_DomainParticipantFactory_delete_participant(DDS_DomainParticipantFactoryImpl * self, DDS_DomainParticipantImpl * a_participant) Line 1690 + 0x9 bytes C
nddscppd.dll!DDSDomainParticipant_impl::destroyI(DDSDomainParticipant_impl * ddsParticipant) Line 1080 + 0x49 bytes C++
nddscppd.dll!DDSDomainParticipantFactory_impl::delete_participant(DDSDomainParticipant * a_participant) Line 772 + 0x9 bytes C++
> TcmIpc.dll!DDSInterface::cleanup() Line 318 + 0x1e bytes C++
TcmIpc.dll!DDSInterface::~DDSInterface() Line 258 C++
TcmIpc.dll!DDSInterface::`vector deleting destructor'() + 0x57 bytes C++
TcmIpc.dll!Poco::SingletonHolder<DDSInterface>::~SingletonHolder<DDSInterface>() Line 69 + 0x24 bytes C++
TcmIpc.dll!`DDSInterface::Instance'::`2'::`dynamic atexit destructor for 'sh''() + 0xd bytes C++
TcmIpc.dll!_CRT_INIT(void * hDllHandle, unsigned long dwReason, void * lpreserved) Line 415 C
TcmIpc.dll!__DllMainCRTStartup(void * hDllHandle, unsigned long dwReason, void * lpreserved) Line 526 + 0x11 bytes C
TcmIpc.dll!_DllMainCRTStartup(void * hDllHandle, unsigned long dwReason, void * lpreserved) Line 476 + 0x11 bytes C
ntdll.dll!_LdrpCallInitRoutine@16() + 0x14 bytes
ntdll.dll!_LdrShutdownProcess@0() + 0x141 bytes
ntdll.dll!_RtlExitUserProcess@4() + 0x74 bytes
kernel32.dll!754a79c5()
msvcr100d.dll!___crtExitProcess() + 0x1b bytes
msvcr100d.dll!___freeCrtMemory() + 0x317 bytes
msvcr100d.dll!_exit() + 0x12 bytes
TCMHostAdapterApp.exe!__tmainCRTStartup() Line 568 C
TCMHostAdapterApp.exe!wmainCRTStartup() Line 371 C
kernel32.dll!@BaseThreadInitThunk@12() + 0x12 bytes
ntdll.dll!___RtlUserThreadStart@8() + 0x27 bytes
ntdll.dll!__RtlUserThreadStart@8() + 0x1b bytes
Thank you for the additional details!
That stack trace says that we are definitely waiting for a thread to shut down. It's still not 100% certain that it's a receive thread, but in combination with the error messages in the log file you sent, I am willing to make a strong guess that that is the reason why the main thread never shuts down.
Here is what I see: At about the time the application is calling delete_contained_entities(), there are error messages in your log file indicate that the WSASendTo call gets error 10093, WSANOTINITIALISED.
[D0101|DELETE_CONTAINED]NDDS_Transport_UDPv4_send:OS WSASendTo() failure, error 0X276D
[D0101|DELETE_CONTAINED]NDDS_Transport_UDPv4_send:U0000136c sent 88 bytes to 0X100007F:7660
This error is being printed as DDS is sending the final messages at shutdown. We send these final messages for two reasons:
Given the error messages, there is a very good chance that wake up data is not being received, causing the main thread to wait forever for the receive threads to unblock and shutdown cleanly.
Looking closely at this WSANOTINITIALISED error, it looks like this error could happen in two cases:
I just took a look through our code, and if you are using the default UDP transport we should not be calling WSACleanup() at all in 4.5f. Is it possible that your application is calling WSACleanup() somewhere, or using another library that calls WSACleanup()? (Also, if there is anything else you can think of that might cause a network disconnection during the shutdown process, let me know.)
Thank you!
Rose
Is there anything I can wait for that will let me know when you are done with the connection?
After all DomainParticipants are deleted in your application, we don't need to use the network anymore. So really, just watiting for delete_participant() call (or calls) to complete should be enough. You don't have to finalize the DomainParticipantFactory before making a WSACleanup() call.
If for some reason you call WSACleanup() and then need to create/enable a new DomainParticipant, it calls the WSAStartup() API again, which should be safe.
Thank you!
Rose
We are using a third party library that I suspect calls WSACleanup when cleaning up a static variable on exit. This appears to be happening on exit before DDS is done cleaning up. Ideally I there would be some indication that DDS cleanup was complete before attempting to finish exiting (e.g. some event I could wait for).
The only indication that we have succesfully shut down is when all delete_participant() calls complete and return a return code saying the calls were successful. It should be possible to build logic around this to signal the other code that DDS has finished cleaning up, but there is no callback mechanism for that.
Thank you!
Rose
That will not help because the call to delete_participant never returns which gets us back to the original problem. I am not sure WSACleanup is the problem.
As I understand it the expectation is that every WSAStartup must be matched with a WSACleanup and only when all start ups have been cleaned will the actual cleanup take place.
I added an extraneous/unmatched call to WSAStartup that should have incremented the reference count so that WSACleanup never actually cleaned up but delete_participant is still hanging.
I have moved the cleanup to be explicitly called outside of the destructor before exiting. This seems to have worked around the issue. Thank you for your help.
That's great that it solved the problem!
Thank you,
Rose