Dear All,I am getting a segmentation fault in one of the RTI's background threads when using a ContentFilteredTopic with DynamicData (see concrete, simplified example below). Basically, in my example I initiate one Topic using DynamicData as a type and one DataWriter. Then I initiate two DataReaders, each with an own ContentFilteredTopic that are attached to the first topic. When now the writer sends two samples and then the two readers try to get their individually filtered data, some background thread (probably within the DomainParticipant) causes a segmentation fault and my program crashes badly. Unfortunately, I don't have the source-code to RTI's internal implementation as I am using the RTI UP license, but when I run the example with a debugger, at least I get this stack trace before the crash:
Thread 9 "FilterBug" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe67fc700 (LWP 15429)]
0x00005555559690ca in DDS_DynamicData2TypePlugin_return_sample (endpoint_data=0x555556d8e130, sample=0x1, handle=0x555556d9bc40) at DynamicData2TypePlugin.c:1195
1195 DynamicData2TypePlugin.c: No such file or directory.
(gdb) bt
#0 0x00005555559690ca in DDS_DynamicData2TypePlugin_return_sample (endpoint_data=0x555556d8e130, sample=0x1, handle=0x555556d9bc40) at DynamicData2TypePlugin.c:1195
#1 0x000055555605ab44 in PRESPsReaderQueue_returnQueueSample (me=0x555556da1140, entry=0x555556da55d0, entrySample=0x555556da5668) at PsReaderQueue.c:3156
#2 0x000055555605f4ac in PRESPsReaderQueue_addQueueEntryToPolled (me=0x555556da1140, lostCount=0x7fffe67fb3b4, lostReason=0x555556da1190, rejectedCount=0x7fffe67fb3b8, rejectedReason=0x555556da1198,
entry=0x555556da55d0, receptionTsIn=0x7fffe67fb8a0, now=0x7fffe67fb8a0, remoteWriterQueue=0x555556db73c0, readConditionState=0x555556da11b0, queryConditionState=0x555556da11b4)
at PsReaderQueue.c:4266
#3 0x0000555556068437 in PRESPsReaderQueue_newData (me=0x555556da1140, dataAvailable=0x555556da1188, lostCount=0x555556da118c, lostReason=0x555556da1190, rejectedCount=0x555556da1194,
rejectedReason=0x555556da1198, receivedInlineQosBitmap=0x7fffe67fb5bc, remoteWriterQueue=0x555556db73c0, firstRelevantSn=0x0, nextRelevantRangeStartSn=0x0, isMatching=1, data=0x7fffe67fba00,
localData=0x0, decodingKeyHandle=0x0, strength=0, reservedCount=-1, timestamp=0x7fffe67fb8a0, now=0x7fffe67fb8a0, readConditionState=0x555556da11b0, queryConditionState=0x555556da11b4,
worker=0x555556a3a630) at PsReaderQueue.c:6596
#4 0x0000555555cd0f97 in PRESPsService_readerSampleListenerOnNewData (listener=0x55555685d310, firstRelevantSn=0x0, nextRelevantRangeStartSn=0x0, data=0x7fffe67fba00, reservedCount=-1,
timestamp=0x7fffe67fb8a0, storage=0x555556a9abb8, worker=0x555556a3a630) at PsServiceImpl.c:2756
#5 0x0000555555e2bf8c in COMMENDBeReaderService_onSubmessage (listener=0x555556a96b30, context=0x7fffcc000be0, timestamp=0x7fffe67fb8a0, storage=0x5555568d2db0, worker=0x555556a3a630)
at BeReaderService.c:1350
#6 0x0000555555edef4d in MIGInterpreter_parse (me=0x5555568d2830, context=0x7fffcc000be0, msg=0x7fffe67fbd40, worker=0x555556a3a630) at Interpreter.c:680
#7 0x0000555555e22776 in COMMENDActiveFacadeReceiver_loop (param=0x555556a3a600) at ActiveFacade.c:605
#8 0x000055555601209c in RTIOsapiThreadChild_onSpawned (param=0x555556716b00) at Thread.c:1435
#9 0x00007ffff79b96db in start_thread (arg=0x7fffe67fc700) at pthread_create.c:463
#10 0x00007ffff6f3988f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
The crash happens after the first reader received his data and when the second reader tries to read his samples. The error message does not make much sense to me and I cannot debug deeper other than what I did so far, but maybe somebody who has the source code might have a clue whether I am using RTI in some unintended way (I can't see an obvious missuse on my side) or is it a real nasty bug in RTI? Attached is the example code that triggers the above error (see main.cpp file).
It took me a while to simplify the example to this state that still triggers the error. The error happens somewhere after reader 1 successfully received his sample(s) and when the second reader starts taking his samples. Is there an obvious error in the code that I miss? Does anybody has an idea how I can avoid this error? Honestly, I ran out of ideas, because my use-case seems to be rather basic and I don't know how to implement it differently so it does the same thing.
Btw, I am using Connext DDS 6.0.0 with a UP license, and my OS is Ubuntu 18.04.2 and my compiler is gcc 7.4.0. I appreciate any help!
Cheers,
Alex
Hi Alex,
I wanted to let you know that I'm looking into this and will update you as soon as I figure out the issue. Thank you very much for the reproducer, I was able to see the crash myself and will take a look.
Hi Erin,
I am relieved that you could reproduce the error as it could also have been the case that the error only happens with my PC configuration.
Here are some further observations that may or may not be helpful. I could only reproduce the error when I used filtering in general. My first guess was that it is somehow related to the filter itself so I implemented a CustomFilter. However, I have been able to reproduce the same error with a CustomFilter as well. So it seems not be the filter by itself but maybe some side effect or some strange coincident (even unrelated to filtering). But for me it is hard to tell without debugging the real code.
I hope you will be able to find the cause of the problem. If so, for me, a temporal workaround would also help a lot to continue with my implementation (other than not using filters at all, as I then would need to reimplement quite some code).
Really appreciate your efforts. Thanks!
Alex
Hi Alex,
I'm sorry, unfortunately my attention got pulled elsewhere last week, but I wanted to let you know this is my top priority this week and I will update you as soon as I know what the problem is (and hopefully be able to provide you with a workaround at that time). I expect that I'll have something for you by the end of the day tomorrow (Monday). I really appreciate your patience.
Thank you,
Erin
Hi Alex,
I have confirmed bug CORE-9653. The problem you are running into has to do with a combination of having multiple readers at the same locator, and using an unkeyed type, DynamicData, and a content filter with the content filtering happening on the writer-side.
As a work around until the nextGeneral Access Maintenance Release later in the year, you can set the DataWriterResourceLimits.max_remote_reader_filters = 0. This will effectively disable writer-side filtering and workaround the issue.
Thanks again for your patience and for reporting this issue.
- Erin
Dear all,
if anybody has a similar problem, I can report that the workaround proposed by Erin works. Attached is the fixed examples with the proposed workaround that now does not cause the segmentation fault anymore.
The relevant code-lines are:
This effectively disables the server-side filtering, however, the client now does the filtering instead so the behavior is similar. The only difference might be in terms of efficiency (or rather inefficiency), but this should be fixed when the next update with the patch comes out.
@Erin: Thanks again for the help!
Cheers,
Alex
The new RTI Connext DDS version 6.0.1 has fixed this bug and the workaround is not required anymore.
Cheers,
Alex