DDSTheParticipantFactory->delete_participant hangs

Mon, 02/17/2014 - 11:00

#2

rose

Offline

Last seen: 4 years 5 months ago

Joined: 08/22/2011

Posts: 148

Hello gandriotakis,

One possible culprit for this:

Our receive threads are blocked on sockets, and they are woken up when the delete_participant() call sends a message over loopback to each socket our receive threads are blocking on. This allows the thread to wake up, realize that it's shutting down, and clean itself up.

In some cases, this message doesn't get received – especially when a firewall is running – and one or more recieve thread fails to wake up to be shut down.

So, can you tell me a bit more about your system? Do you have a firewall running?

Thank you!

Rose

Tue, 02/18/2014 - 05:45

#3

gandriotakis

Offline

Last seen: 6 years 7 months ago

Joined: 02/14/2014

Posts: 41

I get the same behavior with the firewall off. We are currently in the process of a redesign moving towards more portable code. The same pattern works in the old code.

I should note that delete_contained_entities returns DDS_RETCODE_OK. If I do not call delete_contained_entities then delete_participant returns DDS_RETCODE_ERROR and finalize_instance returns DDS_RETCODE_PRECONDITION_NOT_MET which is seems to make sense. Thus I suspect DDS does not get cleaned up but at lease the applicaiton exits.

Wed, 02/19/2014 - 11:33

#4

rose

Offline

Last seen: 4 years 5 months ago

Joined: 08/22/2011

Posts: 148

Hello gandriotakis,

I checked, and we have a known issue on certain versions of Linux where the delete_participant call may hang forever. I am checking whether there is a workaround.

Edit: I just checked the details of the bug report that I mentioned, and it looks like it is on systems where that last shutdown packet doesn't get through to the DataReader (such as being blocked by a firewall). I doubt that this problem is due to your code, but I am not sure why it is happening in 4.5f but not in the previous version. What was the previous version you were using? What OS are you running on?

Thank you!

Rose

Thu, 02/20/2014 - 12:39

#5

gandriotakis

Offline

Last seen: 6 years 7 months ago

Joined: 02/14/2014

Posts: 41

Hi Rose,

We are running 4.5f on Windows 7. The point of the exercise is to move to Linux. It worked with 4.5f with our previous code but not with the current code which has been redisigned to not use the Windows message pump.

Mon, 02/24/2014 - 14:57

#6

rose

Offline

Last seen: 4 years 5 months ago

Joined: 08/22/2011

Posts: 148

Hello gandriotakis,

We have only seen this problem on Linux. But two more questions for you: Do you have multiple DomainParticipants in your application? Is multicast enabled in your application?

We have reproduced this problem on Linux, but only if there are two DomainParticipants in an application, and if you delete them in a different order than they were created.

One other thing worth exploring: It would be helpful to see which threads are still running and preventing the call from completing. Do you have Process Explorer? If you do, it will show you more details about which threads are running.

Thank you!

Rose

Tue, 02/25/2014 - 14:52

#7

gandriotakis

Offline

Last seen: 6 years 7 months ago

Joined: 02/14/2014

Posts: 41

Hello Rose,

We have only one domain participant. Since multicast is not explicitly disabled I am assuming that it is enabled. Process Explorer shows only on thread associated with the task (WinMainCRTStartup.

Wed, 02/26/2014 - 09:53

#8

rose

Offline

Last seen: 4 years 5 months ago

Joined: 08/22/2011

Posts: 148

Hello gandriotakis,

It would be very strange if the other threads shut down correctly, but the call is still hanging. Can you increase the verbosity of DDS? That will slow down your execution time, but it might give me a hint where it's hanging.

NDDSConfigLogger::get_instance()->set_output_file( File *yourfile );
NDDSConfigLogger::get_instance()->set_verbosity( NDDS_CONFIG_LOG_VERBOSITY_STATUS_ALL );

Also, can you attach the QoS configuration you are using? One other place that can have problems is if you set the DomainParticipant's shutdown_cleanup_period to 0. This usually only causes problems on RTOSes, though, so it is a long shot.

Thank you!

Rose

Fri, 02/28/2014 - 13:29

#9

gandriotakis

Offline

Last seen: 6 years 7 months ago

Joined: 02/14/2014

Posts: 41

I have attached my QoS and a verbose log.

File Attachments:

user_qos_profiles.xml

verbose.txt

Sun, 03/02/2014 - 15:17

#10

rose

Offline

Last seen: 4 years 5 months ago

Joined: 08/22/2011

Posts: 148

Hello gandriotakis,

As a test, can you comment out this property?

           <element>
                   <name>dds.transport.UDPv4.builtin.ignore_loopback_interface</name> 
                   <value>0</value> 
           </element>

Thank you!

Rose

Mon, 03/03/2014 - 08:53

#11

gandriotakis

Offline

Last seen: 6 years 7 months ago

Joined: 02/14/2014

Posts: 41

That did not help.

Wed, 03/05/2014 - 07:00

#12

rose

Offline

Last seen: 4 years 5 months ago

Joined: 08/22/2011

Posts: 148

There is nothing else in the QoS that looks like it could be responsible.

(There is a mismatch in some of the profiles between the UDP sizes that you might want to change: 65530 vs. 65535. That is unlikely to be related to this, though.)

Can you build a debug version, open the application in Visual Studio, and double-check which threads are running when it hangs? I believe you might be able to get our thread stack traces that way. (I am still reading through the log file, but in retrospect the stack trace may be faster).

Thank you!

Rose

Mon, 03/17/2014 - 05:04

#13

gandriotakis

Offline

Last seen: 6 years 7 months ago

Joined: 02/14/2014

Posts: 41

Sorry for the delay but I got diverted. When the app hangs the call stack is:

> ntdll.dll!_NtDelayExecution@8() + 0x15 bytes
  ntdll.dll!_NtDelayExecution@8() + 0x15 bytes
  KernelBase.dll!_Sleep@4() + 0xf bytes
  nddscore.dll!00baf405()
  [Frames below may be incorrect and/or missing, no symbols loaded for nddscore.dll]
  nddscore.dll!00b1c806()
  nddscore.dll!00a131cb()
  nddsc.dll!0083100a()
  nddsc.dll!0080e422()

Mon, 03/17/2014 - 09:55

#14

rose

Offline

Last seen: 4 years 5 months ago

Joined: 08/22/2011

Posts: 148

The delay isn't a problem – glad to know this isn't holding you up too badly.

Can you reproduce the same problem when building against our debug libraries? I'm hoping that gives a more detailed stack trace.

Thank you!

Rose

Tue, 03/18/2014 - 06:35

#15

gandriotakis

Offline

Last seen: 6 years 7 months ago

Joined: 02/14/2014

Posts: 41

  ntdll.dll!_NtDelayExecution@8() + 0x15 bytes
  ntdll.dll!_NtDelayExecution@8() + 0x15 bytes
  KernelBase.dll!_Sleep@4() + 0xf bytes
  nddscored.dll!021c46eb()
  [Frames below may be incorrect and/or missing, no symbols loaded for nddscored.dll]
  nddscored.dll!02165715()
  nddscored.dll!0216590a()
  nddscored.dll!01fac103()
  nddscored.dll!021b0481()
  nddscored.dll!021b062c()
  nddscored.dll!021b0213()
  nddscored.dll!021b3b0e()
  nddscored.dll!01e71e0c()
  nddscd.dll!018b7e05()
  nddscppd.dll!0158b242()
  nddscppd.dll!015c2d6e()
  nddscppd.dll!01586c6b()
  nddscppd.dll!01554af0()
  nddscppd.dll!0155d5b5()
> TcmIpc.dll!DDSInterface::cleanup() Line 320 + 0x1e bytes C++
  TcmIpc.dll!DDSInterface::~DDSInterface() Line 260 C++
  TcmIpc.dll!DDSInterface::`vector deleting destructor'() + 0x57 bytes C++
  TcmIpc.dll!Poco::SingletonHolder<DDSInterface>::~SingletonHolder<DDSInterface>() Line 69 + 0x24 bytes C++
  TcmIpc.dll!`DDSInterface::Instance'::`2'::`dynamic atexit destructor for 'sh''() + 0xd bytes C++
  TcmIpc.dll!_CRT_INIT(void * hDllHandle, unsigned long dwReason, void * lpreserved) Line 415 C
  TcmIpc.dll!__DllMainCRTStartup(void * hDllHandle, unsigned long dwReason, void * lpreserved) Line 526 + 0x11 bytes C
  TcmIpc.dll!_DllMainCRTStartup(void * hDllHandle, unsigned long dwReason, void * lpreserved) Line 476 + 0x11 bytes C
  ntdll.dll!_LdrpCallInitRoutine@16() + 0x14 bytes
  ntdll.dll!_LdrShutdownProcess@0() + 0x141 bytes
  ntdll.dll!_RtlExitUserProcess@4() + 0x74 bytes
  kernel32.dll!754a79c5()
  msvcr100d.dll!___crtExitProcess() + 0x1b bytes
  msvcr100d.dll!___freeCrtMemory() + 0x317 bytes
  msvcr100d.dll!_exit() + 0x12 bytes
  TCMHostAdapterApp.exe!__tmainCRTStartup() Line 568 C
  TCMHostAdapterApp.exe!wmainCRTStartup() Line 371 C
  kernel32.dll!@BaseThreadInitThunk@12() + 0x12 bytes
  ntdll.dll!___RtlUserThreadStart@8() + 0x27 bytes
  ntdll.dll!__RtlUserThreadStart@8() + 0x1b bytes

Tue, 03/18/2014 - 23:42

#16

Gerardo Pardo

Offline

Last seen: 7 months 1 week ago

Joined: 06/02/2010

Posts: 603

MODIFIED: I had not noticed you had already attached the verbosity output to a previous posting. Please ignore the request below...

Hi,

We are still shooting in the dark. It seems like for some reason whe the operating system dumps the stack it not able to find the symbols for the nddscppd.dll

Maybe you can call the operation:

 NDDSConfigLogger::set_verbosity(NDDS_CONFIG_LOG_VERBOSITY_STATUS_ALL);

to enable verbose output prior to calling participant->delete_contained_entities() operation or before calling DDSTheParticipantFactory->delete_participant(). That may shed some light into what the RTIDDS core is doing when it hangs.

Gerardo

Wed, 03/19/2014 - 05:17

#17

gandriotakis

Offline

Last seen: 6 years 7 months ago

Joined: 02/14/2014

Posts: 41

Here is the stack dump after I hooked in the symbols

  ntdll.dll!_NtDelayExecution@8() + 0x15 bytes
  ntdll.dll!_NtDelayExecution@8() + 0x15 bytes
  KernelBase.dll!_Sleep@4() + 0xf bytes
  nddscored.dll!RTIOsapiThread_sleep(const RTINtpTime * timeIn) Line 696 + 0xc bytes C
  nddscored.dll!COMMENDActiveFacade_preShutdownWakeup(COMMENDActiveFacade * me, REDAWorker * worker) Line 929 + 0xe bytes C
  nddscored.dll!PRESParticipant_preShutdownWakeup(PRESParticipant * me, int * failReason, REDAWorker * worker) Line 3259 + 0x1f bytes C
  nddscd.dll!DDS_DomainParticipantPresentation_wakeup(DDS_DomainParticipantPresentation * self, REDAWorker * worker) Line 324 + 0x11 bytes C
  nddscd.dll!DDS_DomainParticipant_destroyI(DDS_DomainParticipantImpl * ddsParticipant) Line 9129 + 0x10 bytes C
  nddscd.dll!DDS_DomainParticipantFactory_delete_participant(DDS_DomainParticipantFactoryImpl * self, DDS_DomainParticipantImpl * a_participant) Line 1690 + 0x9 bytes C
  nddscppd.dll!DDSDomainParticipant_impl::destroyI(DDSDomainParticipant_impl * ddsParticipant) Line 1080 + 0x49 bytes C++
  nddscppd.dll!DDSDomainParticipantFactory_impl::delete_participant(DDSDomainParticipant * a_participant) Line 772 + 0x9 bytes C++
> TcmIpc.dll!DDSInterface::cleanup() Line 318 + 0x1e bytes C++
  TcmIpc.dll!DDSInterface::~DDSInterface() Line 258 C++
  TcmIpc.dll!DDSInterface::`vector deleting destructor'() + 0x57 bytes C++
  TcmIpc.dll!Poco::SingletonHolder<DDSInterface>::~SingletonHolder<DDSInterface>() Line 69 + 0x24 bytes C++
  TcmIpc.dll!`DDSInterface::Instance'::`2'::`dynamic atexit destructor for 'sh''() + 0xd bytes C++
  TcmIpc.dll!_CRT_INIT(void * hDllHandle, unsigned long dwReason, void * lpreserved) Line 415 C
  TcmIpc.dll!__DllMainCRTStartup(void * hDllHandle, unsigned long dwReason, void * lpreserved) Line 526 + 0x11 bytes C
  TcmIpc.dll!_DllMainCRTStartup(void * hDllHandle, unsigned long dwReason, void * lpreserved) Line 476 + 0x11 bytes C
  ntdll.dll!_LdrpCallInitRoutine@16() + 0x14 bytes
  ntdll.dll!_LdrShutdownProcess@0() + 0x141 bytes
  ntdll.dll!_RtlExitUserProcess@4() + 0x74 bytes
  kernel32.dll!754a79c5()
  msvcr100d.dll!___crtExitProcess() + 0x1b bytes
  msvcr100d.dll!___freeCrtMemory() + 0x317 bytes
  msvcr100d.dll!_exit() + 0x12 bytes
  TCMHostAdapterApp.exe!__tmainCRTStartup() Line 568 C
  TCMHostAdapterApp.exe!wmainCRTStartup() Line 371 C
  kernel32.dll!@BaseThreadInitThunk@12() + 0x12 bytes
  ntdll.dll!___RtlUserThreadStart@8() + 0x27 bytes
  ntdll.dll!__RtlUserThreadStart@8() + 0x1b bytes

Wed, 03/19/2014 - 15:25

#18

rose

Offline

Last seen: 4 years 5 months ago

Joined: 08/22/2011

Posts: 148

Thank you for the additional details!

That stack trace says that we are definitely waiting for a thread to shut down. It's still not 100% certain that it's a receive thread, but in combination with the error messages in the log file you sent, I am willing to make a strong guess that that is the reason why the main thread never shuts down.

Here is what I see: At about the time the application is calling delete_contained_entities(), there are error messages in your log file indicate that the WSASendTo call gets error 10093, WSANOTINITIALISED.

[D0101|DELETE_CONTAINED]NDDS_Transport_UDPv4_send:OS WSASendTo() failure, error 0X276D

[D0101|DELETE_CONTAINED]NDDS_Transport_UDPv4_send:U0000136c sent 88 bytes to 0X100007F:7660

This error is being printed as DDS is sending the final messages at shutdown. We send these final messages for two reasons:

To notify other applications about reader/writer deletions.
To wake up our own receive threads for a clean shutdown

Given the error messages, there is a very good chance that wake up data is not being received, causing the main thread to wait forever for the receive threads to unblock and shutdown cleanly.

Looking closely at this WSANOTINITIALISED error, it looks like this error could happen in two cases:

if we didn't call WSAStartup (in which case we would see this error at the beginning of the file, too), or
if somebody called WSACleanup() before these final messages are sent.

I just took a look through our code, and if you are using the default UDP transport we should not be calling WSACleanup() at all in 4.5f. Is it possible that your application is calling WSACleanup() somewhere, or using another library that calls WSACleanup()? (Also, if there is anything else you can think of that might cause a network disconnection during the shutdown process, let me know.)

Thank you!

Rose

Thu, 03/20/2014 - 04:41

#19

gandriotakis

Offline

Last seen: 6 years 7 months ago

Joined: 02/14/2014

Posts: 41

Is there anything I can wait for that will let me know when you are done with the connection?

Thu, 03/20/2014 - 09:33

#20

rose

Offline

Last seen: 4 years 5 months ago

Joined: 08/22/2011

Posts: 148

After all DomainParticipants are deleted in your application, we don't need to use the network anymore. So really, just watiting for delete_participant() call (or calls) to complete should be enough. You don't have to finalize the DomainParticipantFactory before making a WSACleanup() call.

If for some reason you call WSACleanup() and then need to create/enable a new DomainParticipant, it calls the WSAStartup() API again, which should be safe.

Thank you!

Rose

Thu, 03/20/2014 - 10:29

#21

gandriotakis

Offline

Last seen: 6 years 7 months ago

Joined: 02/14/2014

Posts: 41

We are using a third party library that I suspect calls WSACleanup when cleaning up a static variable on exit. This appears to be happening on exit before DDS is done cleaning up. Ideally I there would be some indication that DDS cleanup was complete before attempting to finish exiting (e.g. some event I could wait for).

Thu, 03/20/2014 - 11:26

#22

rose

Offline

Last seen: 4 years 5 months ago

Joined: 08/22/2011

Posts: 148

The only indication that we have succesfully shut down is when all delete_participant() calls complete and return a return code saying the calls were successful. It should be possible to build logic around this to signal the other code that DDS has finished cleaning up, but there is no callback mechanism for that.

Thank you!

Rose

Thu, 03/20/2014 - 11:54

#23

gandriotakis

Offline

Last seen: 6 years 7 months ago

Joined: 02/14/2014

Posts: 41

That will not help because the call to delete_participant never returns which gets us back to the original problem. I am not sure WSACleanup is the problem.

As I understand it the expectation is that every WSAStartup must be matched with a WSACleanup and only when all start ups have been cleaned will the actual cleanup take place.

I added an extraneous/unmatched call to WSAStartup that should have incremented the reference count so that WSACleanup never actually cleaned up but delete_participant is still hanging.

Thu, 03/20/2014 - 12:44

#24

gandriotakis

Offline

Last seen: 6 years 7 months ago

Joined: 02/14/2014

Posts: 41

I have moved the cleanup to be explicitly called outside of the destructor before exiting. This seems to have worked around the issue. Thank you for your help.

Thu, 03/20/2014 - 16:28

#25

rose

Offline

Last seen: 4 years 5 months ago

Joined: 08/22/2011

Posts: 148

That's great that it solved the problem!

Thank you,

Rose

Secondary menu

Navigation

RTI Community Portal Terms of Use

Search

Secondary menu

You are here

Navigation

User login

DDSTheParticipantFactory->delete_participant hangs

RTI Community Portal Terms of Use