Creating a Domain Participant fails in worker thread.

19 posts / 0 new
Last post
Offline
Last seen: 4 years 3 weeks ago
Joined: 02/17/2016
Posts: 21
Creating a Domain Participant fails in worker thread.

Hi.  I am using  

RTI Connext C++ API  Version 5.1.0  with x64Linux2.6gcc4.1.1
 

I have an App that dynamically creates Domain Participants, Topics, Publishers and subscribers as needed and stores the pointers to these objects in Maps.

If I create my domain participants up front in the the Main Management Object constructor at startup, everything works fine.

If i create my domain participants dynamically (as in wait till the first item for this domain needs to be published, I get a seg fault inside of the creation of the participant.  Currenty I am not configuring QOS, and using defaults.  I dont thin this is a problem since it works fine if I move the creation calls.  I have also used the News app example with no QOS and things work fine.  The only thing I can think that is different is that I am moving the call to a worker thread.  It is the only active thread though in my App.

Here is the stack trace.

Thread [5] 1296 (Suspended : Signal : SIGSEGV:Segmentation fault)
do_lookup_x() at 0x38c7e09891
_dl_lookup_symbol_x() at 0x38c7e09f0a
_dl_fixup() at 0x38c7e0e0c0
_dl_runtime_resolve() at 0x38c7e148f5
RTIOsapiSemaphore_new() at 0xddac45
RTIEventSmartTimer_new() at 0xd8f4d1
DDS_DomainParticipantDatabase_initialize() at 0xb6cfab
DDS_DomainParticipant_createI() at 0x98efb6
DDS_DomainParticipantFactory_create_participant_disabledI() at 0x96d7e3
DDSDomainParticipant_impl::create_disabledI() at 0x8d49f9
DDSDomainParticipant_impl::createI() at 0x8d4ca7
DDSDomainParticipantFactory_impl::create_participant() at 0x8ce706

Organization:
Offline
Last seen: 4 years 3 weeks ago
Joined: 02/17/2016
Posts: 21

Here is the std out attached.  It wouldn't let me paste it in here.  

File Attachments: 
Gerardo Pardo's picture
Offline
Last seen: 19 hours 36 min ago
Joined: 06/02/2010
Posts: 601

Hello,

This would appear to be error in the dynamic linking. It happens before DDS has a chance to do much. The segfault occurs during the creation of a semaphore when the dynamic linker is trying to lookup that function from glibc.

Searching on google I found some people speculating that it can be caused by a memory heap corruption elsewhere in the program or a mis-configured glibc http://stackoverflow.com/questions/10578315/segmentation-fault-in-dl-runtime-resolve

How are you configuring your compiler and load library path? Any chance that you may be getting a wrong version of glibc?

This other link provides some options for forcing a pre-loading and linking of the dynamically-loaded symbols: http://stackoverflow.com/questions/3049445/dl-runtime-resolve-when-do-the-shared-objects-get-loaded-in-to-memory

Gerardo

 

Offline
Last seen: 4 years 3 weeks ago
Joined: 02/17/2016
Posts: 21

I am using the static dds libraries.  I currently do not have a an LD_LIBRARY_PATH set.  I am using the -static-libgcc call at compile time as well.  I tried the LD_LOAD_NOW variable, but that didn't seem to help.  I would figure as well that if my issue is happening at link time, then I wouldn't just be able to move a block of code somewhere else and get it to work.

This example I am running is just creating one participant as well.

Gerardo Pardo's picture
Offline
Last seen: 19 hours 36 min ago
Joined: 06/02/2010
Posts: 601

Hello Davey,

Yes I do not have any good explanation for why the problem happens when you do it from a different thread. I am shooting in the dark a bit here...

Can you provide here your full link command so that we can see all the parameters to the linker as well as the Connext DDS and system libraries being linked?

Thanks!

Gerardo

Offline
Last seen: 4 years 3 weeks ago
Joined: 02/17/2016
Posts: 21

The Stack is a little different if I switch to the dynamic RTI libraries

 

#0 0x00000038c7e09290 in check_match.12445 () from /lib64/ld-linux-x86-64.so.2
#1 0x00000038c7e09c82 in do_lookup_x () from /lib64/ld-linux-x86-64.so.2
#2 0x00000038c7e09f0a in _dl_lookup_symbol_x () from /lib64/ld-linux-x86-64.so.2
#3 0x00000038c7e0e0c0 in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
#4 0x00000038c7e148f5 in _dl_runtime_resolve () from /lib64/ld-linux-x86-64.so.2
#5 0x00007f095a2d0cb0 in DDS_DomainParticipantConcurrency_initialize (self=0x7f095abf52d0,
workerFactory=0x7f095006ec20, qos=0x7f095abf7668) at DomainParticipantConcurrency.c:248
#6 0x00007f095a307fe7 in DDS_DomainParticipant_createI (factory=0xb253f0, domain_id=1, qos=0x7f095abf6c60,
listener=0x0, mask=0, sharedState=0x7f095a874b60, threadListener=0x0, db_thread_factory=0x0,
recv_thread_factory=0x0, event_thread_factory=0x0, asynch_pub_thread_factory=0x0) at DomainParticipant.c:8767
#7 0x00007f095a2df075 in DDS_DomainParticipantFactory_create_participant_disabledI (self=0xb253f0,
need_enable=0x7f095abf78c7 "", domainId=1, qos=0x7f095abf6c60, listener=0x0, mask=0, db_thread_factory=0x0,
recv_thread_factory=0x0, event_thread_factory=0x0, asynch_pub_thread_factory=0x0, register_builtin_types=0 '\000',
register_sql_filter=1 '\001') at DomainParticipantFactory.c:1950
#8 0x00007f095a95a1a3 in DDSDomainParticipant_impl::create_disabledI (owner=0xb334a0, needEnable=0x7f095abf78c7 "",
registerBuiltinTypes=1 '\001', domainId=1, qos=..., listener=0x0, mask=0) at DomainParticipant.cxx:921
#9 0x00007f095a95a5bd in DDSDomainParticipant_impl::createI (owner=0xb334a0, domainId=1, qos=..., listener=0x0,
mask=0) at DomainParticipant.cxx:1021
#10 0x00007f095a951903 in DDSDomainParticipantFactory_impl::create_participant (this=0xb334a0, domainId=1, qos=...,
listener=0x0, mask=0) at DomainParticipantFactory.cxx:710

Offline
Last seen: 4 years 3 weeks ago
Joined: 02/17/2016
Posts: 21

Here is my link command with the new dynamic library config..

/usr/bin/c++   -g   -Wall -pedantic -std=c++98 -Wextra -Wmissing-declarations -Wcast-align -Wcast-qual -Wconversion    -m64 CMakeFiles/hoplite.dir/src/hoplite.cpp.o  -o ../../bin/mybinary  -L/opt/share/RTI/ndds.5.1.0/lib/x64Linux2.6gcc4.1.1 -rdynamic -lpthread ../../lib/libBinext.a ../../lib/libBincore.a ../../lib/libBintarget.a ../../lib/libapc.a -L/opt/share/RTI/ndds.5.1.0/lib/x64Linux2.6gcc4.1.1 -lnddscppd -lnddscd -lnddscored -ldl -lnsl -lm -lpthread -lrt -lrt -Wl,-rpath,/opt/share/RTI/ndds.5.1.0/lib/x64Linux2.6gcc4.1.1 

Thank you for looking into this with me.  I have been using DDS for years and love the product.  I am pulling my hair out on this one though.  This is supposed to be the easy stuff.  The .a files referenced here are part of my software suite.

an LDD shows paths correctly to everything as well.

 

Offline
Last seen: 4 years 3 weeks ago
Joined: 02/17/2016
Posts: 21

The problem sounds very similiar to the following thread...

http://stackoverflow.com/questions/10430624/replace-a-dynamic-shared-library-in-run-time

The only problem with that theory though is that these problems would be happening deep within the DDS library.  Also regardless of wether I use Static or dynamic libraries, I can not link without the -ldl flag.  

 

Gerardo Pardo's picture
Offline
Last seen: 19 hours 36 min ago
Joined: 06/02/2010
Posts: 601

Hi,

This may have nothing to do with it, but I notice that you are linking -lpthread twice. I do not immediately see how this would cause a problem given that you are not modifying the load-path in between, but just in case, I would try removing the first one.

Also, I do not see the -static-libgcc in your linker line. AFAIK this is a linker flag, not a compiler flag.

Did you paste the correct link to the stack overflow thread? I am not seeing the similarity... 

Gerardo 

-lpthread 

Offline
Last seen: 4 years 3 weeks ago
Joined: 02/17/2016
Posts: 21

I thought -static-libgcc was only required when building with the static rti libraries.   The Stack overflow thread references using the dynamic loader in gcc to do loads and unloads and apparently that is something that the RTI uses in some of its packages, because it is requiring it at link time for some of its packages.  It is also directly linking to a similar segfault behaviour rather than a bad malloc, or some other type of issue.

Offline
Last seen: 4 years 3 weeks ago
Joined: 02/17/2016
Posts: 21

So i just downloaded the latest copy of Connext 5.2 and compiled against that instead.  My stack trace was a little different.

#0 0x0000000000cfb73b in RTILMUtil_des_expand_keyI (userKey=0x0, ks=0x0) at Des.c:317
#1 0x0000000000c77f96 in NDDS_LM_validate (p_outParams=0x7f84b42b3380, p_inOutParams=0x7f84b42b33d0,
p_inParams=0x7f84b42b3220) at Util.c:726
#2 0x00000000009ce0cb in DDS_DomainParticipantFactory_create_participant_disabledI (self=0x2463590,
need_enable=0x7f84b42b58a7 "", domainId=1, qos=0x7f84b42b4c10, listener=0x0, mask=0, db_thread_factory=0x0,
recv_thread_factory=0x0, event_thread_factory=0x0, asynch_pub_thread_factory=0x0, register_builtin_types=0 '\000',
register_sql_filter=1 '\001') at DomainParticipantFactory.c:2164
#3 0x00000000008e17fe in DDSDomainParticipant_impl::create_disabledI (owner=0x2470f70, needEnable=0x7f84b42b58a7 "",
registerBuiltinTypes=1 '\001', domainId=1, qos=..., listener=0x0, mask=0) at DomainParticipant.cxx:952
#4 0x00000000008e1bd6 in DDSDomainParticipant_impl::createI (owner=0x2470f70, domainId=1, qos=..., listener=0x0,
mask=0) at DomainParticipant.cxx:1052
#5 0x00000000008df5e9 in DDSDomainParticipantFactory_impl::create_participant (this=0x2470f70, domainId=1, qos=...,
listener=0x0, mask=0) at DomainParticipantFactory.cxx:718

 

Gerardo Pardo's picture
Offline
Last seen: 19 hours 36 min ago
Joined: 06/02/2010
Posts: 601

Hi,The 

The  -static-libgcis a linker flag that only impacts how the libgcc library is being linked. It is independent on whether you use static or dynamic libraries DDS libraries. In fact the standard makefile that rtiddsgen creates uses the   -static-libgcc flag despite the fact that it links the DDS libraries dynamically as shown in this compile line which uses the makefile that was generated by rtiddsgen:

g++ -m32 -static-libgcc   -o objs/i86Linux3.xgcc4.6.3/NewType_publisher \
    objs/i86Linux3.xgcc4.6.3/NewType_publisher.o objs/i86Linux3.xgcc4.6.3/NewTypeSupport.o \
    objs/i86Linux3.xgcc4.6.3/NewTypePlugin.o \
    objs/i86Linux3.xgcc4.6.3/NewType.o -L/home/gerardo/rti_connext_dds-5.2.0/lib/i86Linux3.xgcc4.6.3 \
    -lnddscppz -lnddscz -lnddscorez -ldl -lnsl -lm -lpthread -lrt

I am not saying that you need to use   -static-libgcc I was just responding to your earlier comment that stated that you were linking using it.

I understand now what you meant with the stackoverflow thread. We do dynamic loading with dlopen() but only for additional libraries that loaded like the non-builtin transports (e.g. TCP), which is specified in the XML Qos... But it does not appear you are doing any of this, are you?

Did you try to re-link removing the duplicate -lpthread  I do have a full theory on how that duplicate could cause the problem you are seeing... But it is very suspicious because your objects (and the main) appear in the link line ahead of the first - lpthreadwhereas the DDS libraries are linked after and may pull additional code from the second -lpthread so if somehow this causes two different libraries to be loaded or placed on different areas of memory it would cause problems.

Gerardo

 

Offline
Last seen: 4 years 3 weeks ago
Joined: 02/17/2016
Posts: 21

I did try relinking without the second -lpthread and it didn't change anything.

I am currently not specifying any QOS and only using defaults.

Does that new stack trace from connext 5.2 change anything?  The reference to NDDS_LM_validate seems like a license call.  I just got an evalutate license to quickly update and test with the new version.  Does DDS need to have a special license in order to create Domain Participants outside of the main thread?  I know a lot of license manages utilize the pid id in license checking.  If I am using worker threads, the pid is going to change.  I believe we are using an IR&D license right now.    We are using Connext 5.1 for x64Linux2.6gcc4.1.1.  I thought it was worth updating since our gcc version is actually 4.4.7.  Thats why I installed 5.2 for x64Linux2.6gcc4.4.5 which was the default download for Redhat 6.5.

 

As far as the linking order its all cobbled together from CMake.  So I can try changing that around if there is something you recommend.  

Fernando Garcia's picture
Offline
Last seen: 4 months 6 days ago
Joined: 05/18/2011
Posts: 199

Hi Davey,

Do you have a simple reproducer that we can build to help you debug the problem? We have experience with CMake so I we could help you tweak the CMakeLists.txt files.

The new stack trace change much, it may fail at different points but just because the creation of the DomainParticipant may be following a different code paths in the different libraries you have built against. I can test your reproducer against unlicensed libraries to see if code path to check the code path followed in the unlicensed version makes a different.

Thanks,
Fernando.

Offline
Last seen: 4 years 3 weeks ago
Joined: 02/17/2016
Posts: 21

There is nothing that I could easily give you that would compile.  I tried modifying the News app to make this work but didn't have any luck.  Do you have a lookup script for CMake that you use?

I am using the one off this site, https://github.com/gbiggs/rtcpcl/blob/master/cmake/Modules/FindDDS.cmake

However I modified the order of the libraries a bit because that seemed to matter for the static libraries, and we made some mods to the CMakeLists.txt to not build in the DDS based code if the library wasn't found.

Please keep in mind I want to be able to compile with the static DDS libraries.

Offline
Last seen: 4 years 3 weeks ago
Joined: 02/17/2016
Posts: 21

Here is another stack trace where I created one participant on domain 0 in the startup thread.

Later on my worker thread creates a participant on domain 1 and crashes.

 

#0 0x00000038c867a4cb in _int_malloc () from /lib64/libc.so.6
#1 0x00000038c867a7ed in calloc () from /lib64/libc.so.6
#2 0x0000000000fd5a9b in RTIOsapiHeap_reallocateMemoryInternal (voidPtrPtr=0x7f68d4130278, size=112, alignment=16,
reallocFlag=0, forceHeapHeader=0, METHOD_NAME=0x12ef838 "RTIOsapiHeap_allocateStructure",
signature=RTI_OSAPI_STRUCT_ALLOC) at heap.c:404
#3 0x0000000000fdb7ef in RTIOsapiSemaphore_new (kindIn=RTI_OSAPI_SEMAPHORE_KIND_MUTEX, pIn=0x0) at Semaphore.c:1180
#4 0x0000000000fbcf9e in REDAWorkerFactory_createExclusiveArea (m=0x1e8e1f0, level=10) at Worker.c:451
#5 0x0000000000c5b1f0 in DDS_DomainParticipantConcurrency_initialize (self=0x7f68d41332d0, workerFactory=0x1e8e1f0,
qos=0x7f68d4135668) at DomainParticipantConcurrency.c:248
#6 0x00000000009dafc3 in DDS_DomainParticipant_createI (factory=0x1e343f0, domain_id=1, qos=0x7f68d4134c60,
listener=0x0, mask=0, sharedState=0x165e3c0, threadListener=0x0, db_thread_factory=0x0, recv_thread_factory=0x0,
event_thread_factory=0x0, asynch_pub_thread_factory=0x0) at DomainParticipant.c:8767
#7 0x00000000009b5f15 in DDS_DomainParticipantFactory_create_participant_disabledI (self=0x1e343f0,
need_enable=0x7f68d41358c7 "", domainId=1, qos=0x7f68d4134c60, listener=0x0, mask=0, db_thread_factory=0x0,
recv_thread_factory=0x0, event_thread_factory=0x0, asynch_pub_thread_factory=0x0, register_builtin_types=0 '\000',
register_sql_filter=1 '\001') at DomainParticipantFactory.c:1950
#8 0x00000000008d9073 in DDSDomainParticipant_impl::create_disabledI (owner=0x1e424a0, needEnable=0x7f68d41358c7 "",
registerBuiltinTypes=1 '\001', domainId=1, qos=..., listener=0x0, mask=0) at DomainParticipant.cxx:921
#9 0x00000000008d948d in DDSDomainParticipant_impl::createI (owner=0x1e424a0, domainId=1, qos=..., listener=0x0,
mask=0) at DomainParticipant.cxx:1021
#10 0x00000000008d212b in DDSDomainParticipantFactory_impl::create_participant (this=0x1e424a0, domainId=1, qos=...,
listener=0x0, mask=0) at DomainParticipantFactory.cxx:710

Do you have any insight on to what those RTI calls are trying to do?

Offline
Last seen: 4 years 3 weeks ago
Joined: 02/17/2016
Posts: 21

I figure it out.  Apparently we set the stack size on new worker threads we create, and it was causing the memory issue.  I commented out that line of code, and it works now..  Do you have a recommended minimum stack size that is required by NDDS?

 

Gerardo Pardo's picture
Offline
Last seen: 19 hours 36 min ago
Joined: 06/02/2010
Posts: 601

Wow, yes that would cause the problem, of course.  The thought of the stack size actually crossed my mind but I discarded it since Linux by default has a very large size... But in hindsight I should have mentioned that possibility as well, sorry that I led you the wrong way and you had to figure it out yourself... 

Glad you were able to get it all running!

Gerardo

Offline
Last seen: 4 years 3 weeks ago
Joined: 02/17/2016
Posts: 21

Not a problem.  I should of checked out how the worker threads were getting created from the getgo.  I guess I would of A) expected the Posix thread to some how throw an error that its stack got full, and B) didn't expect our application to set that value.  I typically dont configure those types of settings unless a problem occcurs.  This is a new application I am working with.  Thanks though for looking into it with me.