RTI Recording service problems

8 posts / 0 new
Last post
Offline
Last seen: 6 years 11 months ago
Joined: 04/04/2014
Posts: 27
RTI Recording service problems

Hello,

I'm trying to get RTI Recording service running but have some problems.

CASE 1: Following setup works:

  • 3 DDS applications on two hosts (one QNX, one Linux)
  • RTI Recording service running on Linux node
  • 27 topics and 44 publications (5 topics publish data at 1kHz, 1 topic publishes data at ~50 Hz)

In this setup it's possible to record data with rtirecord in RTIDDS_DESERIALIZEMODE_NEVER and RTIDDS_DESERIALIZEMODE_ALWAYS modesl. The only problem is that rtirecconv crashes trying to deserialize serialized sqlite3 database:

Program received signal SIGSEGV, Segmentation fault.
0xb7e1c106 in memcpy () from /lib/libc.so.6
(gdb) bt
#0 0xb7e1c106 in memcpy () from /lib/libc.so.6
#1 0xbff96e40 in ?? ()
#2 0x0809c691 in DRT_DynamicTypeMemberInfo_copy ()
#3 0x0811f54e in RTIConverterModel_DRTCallback_getMembers ()
#4 0x0809a0f8 in DRT_DynamicType_expand_sequence ()
#5 0x0809b729 in DRT_DynamicType_get_matching_fields_internal ()
#6 0x0809c1bd in DRT_DynamicType_get_matching_fields_internal ()
#7 0x0809c4a0 in DRT_DynamicType_get_matching_fields ()
#8 0x08122870 in RTIConverterModel_deserializeTable ()
#9 0x08122ed5 in RTIConverterModel_serializedToDeserialized ()
#10 0x0812c85c in RTIConverterSql_convert ()
#11 0x0808ae1b in main ()

At the same moment it's possible to convert serialized file to XML without any errors.

CASE 2

If I'm trying to add more topics (61 topic, 79 publications - 10 topics publish data at 1kHz, 1 topic publishes data at ~50 Hz) rtirecord crashes after some time. In RTIDDS_DESERIALIZEMODE_ALWAYS mode memory usage ("Resident size" in top) of rtirecord continiously grows from the start and when it reaches 1.5 - 2GB following error messages are displayed:

REDAFastBufferPool_growEmptyPoolEA: !allocate buffer of 134348800 bytes
PRESPsReader_takeI:!queue take

After some time following is added:

REDAFastBufferPool_growEmptyPoolEA: !allocate buffer of 8388608 bytes
PRESPsReaderQueue_getQueueEntry:!create entry
PRESPsReaderQueue_storeSampleToEntry:!get entry
PRESPsReaderQueue_newData:!get entries

Same error occurs if recording in RTIDDS_DESERIALIZEMODE_NEVER and enabling rollover. Shortly after the first file is complete rtirecord consumes much memory and prints same error messages as above. This happens even if max_file_size is set to very small value (e.g. 10MB)

RTIDDS_DESERIALIZEMODE_NEVER with disabled rollover produces a valid 2GB file and stops.

I've tried to get some backtraces of the behavior when rtirecors starts eating memory. Not sure that it shows the problem location but could give some hints.

(gdb) info threads
  Id   Target Id         Frame
  17   Thread 0xaecd9b70 (LWP 4466) "rtirecord" 0xffffe424 in __kernel_vsyscall ()
  16   Thread 0xaf4dab70 (LWP 4465) "rtirecord" 0xffffe424 in __kernel_vsyscall ()
  15   Thread 0xafdecb70 (LWP 4464) "rtirecord" 0xffffe424 in __kernel_vsyscall ()
  14   Thread 0xb05edb70 (LWP 4463) "rtirecord" 0xffffe424 in __kernel_vsyscall ()
  13   Thread 0xb0effb70 (LWP 4462) "rtirecord" 0x087d45a5 in DDS_DynamicDataTypePlugin_parametrized_cdr_to_cdr ()
  12   Thread 0xb2cd9b70 (LWP 4461) "rtirecord" 0xffffe424 in __kernel_vsyscall ()
  11   Thread 0xb34dab70 (LWP 4460) "rtirecord" 0xffffe424 in __kernel_vsyscall ()
  10   Thread 0xb3decb70 (LWP 4459) "rtirecord" 0xffffe424 in __kernel_vsyscall ()
  9    Thread 0xb45edb70 (LWP 4458) "rtirecord" 0xffffe424 in __kernel_vsyscall ()
  8    Thread 0xb4effb70 (LWP 4457) "rtirecord" 0xffffe424 in __kernel_vsyscall ()
  7    Thread 0xb59ffb70 (LWP 4456) "rtirecord" 0xffffe424 in __kernel_vsyscall ()
  6    Thread 0xb635bb70 (LWP 4455) "rtirecord" 0xffffe424 in __kernel_vsyscall ()
  5    Thread 0xb6c7db70 (LWP 4454) "rtirecord" 0xffffe424 in __kernel_vsyscall ()
  4    Thread 0xb759fb70 (LWP 4453) "rtirecord" 0xffffe424 in __kernel_vsyscall ()
  2    Thread 0xb7da0b70 (LWP 4451) "rtirecord" 0xffffe424 in __kernel_vsyscall ()
* 1    Thread 0xb7da18d0 (LWP 4447) "rtirecord" 0xb7f17779 in clock_nanosleep () from /lib/librt.so.1
(gdb) bt
#0  0xb7f17779 in clock_nanosleep () from /lib/librt.so.1
#1  0x08a4190b in RTIOsapiThread_sleep ()
#2  0x0867f441 in NDDS_Utility_sleep ()
#3  0x080db57c in RTIDRTManagerDb_busyHandler ()
#4  0x081251e5 in btreeInvokeBusyHandler ()
#5  0x0811f3b9 in pager_wait_on_lock ()
#6  0x0811f460 in sqlite3PagerExclusiveLock ()
#7  0x08157a72 in sqlite3VdbeHalt ()
#8  0x0815c2de in sqlite3VdbeExec ()
#9  0x08165c6c in sqlite3_step ()
#10 0x0816610f in sqlite3_exec ()
#11 0x080d8378 in RTIDRTManagerDb_copyDiscoveryTables ()
#12 0x080da472 in RTIDRTManagerDb_checkSegmentSizeAndSwapIfFull ()
#13 0x080db84d in RTIDRTManagerEvent_processDB ()
#14 0x0818299b in RTIRecorder_event_process ()
#15 0x08186b48 in RTIRecorder_main ()
#16 0x08088cda in main ()

The main problem is that it's impossible to log data for the longer period. Any hints what could be causing these problems? I've no special settings in QoS, just several profiles for reliable and besteffort communication with history set in some profiles up to 25. All limits are not modified.

Thanks a lot in advance!

Organization:
Keywords:
Offline
Last seen: 6 years 11 months ago
Joined: 04/04/2014
Posts: 27

I've tried to get some information with valgrind and it gives 2 huge possible memory leaks

==27515== 330,318,976 bytes in 1,056 blocks are possibly lost in loss record 5,172 of 5,173
==27515==    at 0x40286FF: calloc (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==27515==    by 0x8A3DFE5: RTIOsapiHeap_reallocateMemoryInternal (in /rticonnext_5.1.0/RTI_Recording_Service_5.1.0/bin/i86Linux2.6gcc4.4.5/rtirecord)
==27515==    by 0x87DACD0: DDS_DynamicDataPluginSupport_create_data (in /rticonnext_5.1.0/RTI_Recording_Service_5.1.0/bin/i86Linux2.6gcc4.4.5/rtirecord)
==27515==    by 0x87DAE1C: DDS_DynamicDataTypePlugin_create_sampleI (in /rticonnext_5.1.0/RTI_Recording_Service_5.1.0/bin/i86Linux2.6gcc4.4.5/rtirecord)
==27515==    by 0x8A24274: REDAFastBufferPool_growEmptyPoolEA (in /rticonnext_5.1.0/RTI_Recording_Service_5.1.0/bin/i86Linux2.6gcc4.4.5/rtirecord)
==27515==    by 0x8A24DB0: REDAFastBufferPool_newWithNotification (in /rticonnext_5.1.0/RTI_Recording_Service_5.1.0/bin/i86Linux2.6gcc4.4.5/rtirecord)
==27515==    by 0x882F74A: PRESTypePluginDefaultEndpointData_newInternal (in /rticonnext_5.1.0/RTI_Recording_Service_5.1.0/bin/i86Linux2.6gcc4.4.5/rtirecord)
==27515==    by 0x882F7BA: PRESTypePluginDefaultEndpointData_newWithNotification (in /rticonnext_5.1.0/RTI_Recording_Service_5.1.0/bin/i86Linux2.6gcc4.4.5/rtirecord)
==27515==    by 0x87D9EA9: DDS_DynamicDataTypePlugin_on_endpoint_attached (in /rticonnext_5.1.0/RTI_Recording_Service_5.1.0/bin/i86Linux2.6gcc4.4.5/rtirecord)
==27515==    by 0x88A15DB: PRESPsService_enableLocalEndpointWithCursor (in /rticonnext_5.1.0/RTI_Recording_Service_5.1.0/bin/i86Linux2.6gcc4.4.5/rtirecord)
==27515==    by 0x88D9618: PRESPsService_enableLocalEndpoint (in /rticonnext_5.1.0/RTI_Recording_Service_5.1.0/bin/i86Linux2.6gcc4.4.5/rtirecord)
==27515==    by 0x85F6E24: DDS_DataReader_enableI (in /rticonnext_5.1.0/RTI_Recording_Service_5.1.0/bin/i86Linux2.6gcc4.4.5/rtirecord)
==27515== 
==27515== 610,802,216 bytes in 77 blocks are possibly lost in loss record 5,173 of 5,173
==27515==    at 0x40286FF: calloc (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==27515==    by 0x8A3DFE5: RTIOsapiHeap_reallocateMemoryInternal (in /rticonnext_5.1.0/RTI_Recording_Service_5.1.0/bin/i86Linux2.6gcc4.4.5/rtirecord)
==27515==    by 0x8A2411E: REDAFastBufferPool_growEmptyPoolEA (in /rticonnext_5.1.0/RTI_Recording_Service_5.1.0/bin/i86Linux2.6gcc4.4.5/rtirecord)
==27515==    by 0x8A244C2: REDAFastBufferPool_getBuffer (in /rticonnext_5.1.0/RTI_Recording_Service_5.1.0/bin/i86Linux2.6gcc4.4.5/rtirecord)
==27515==    by 0x8A60D28: PRESPsReaderQueue_readOrTake (in /rticonnext_5.1.0/RTI_Recording_Service_5.1.0/bin/i86Linux2.6gcc4.4.5/rtirecord)
==27515==    by 0x8A63944: PRESPsReaderQueue_take (in /rticonnext_5.1.0/RTI_Recording_Service_5.1.0/bin/i86Linux2.6gcc4.4.5/rtirecord)
==27515==    by 0x887AE61: PRESPsReader_takeI (in /rticonnext_5.1.0/RTI_Recording_Service_5.1.0/bin/i86Linux2.6gcc4.4.5/rtirecord)
==27515==    by 0x887B1B4: PRESPsReader_take (in /rticonnext_5.1.0/RTI_Recording_Service_5.1.0/bin/i86Linux2.6gcc4.4.5/rtirecord)
==27515==    by 0x85F4CF9: DDS_DataReader_read_or_take_untypedI (in /rticonnext_5.1.0/RTI_Recording_Service_5.1.0/bin/i86Linux2.6gcc4.4.5/rtirecord)
==27515==    by 0x86C4E1D: DDS_DynamicDataReader_read_or_takeI (in /rticonnext_5.1.0/RTI_Recording_Service_5.1.0/bin/i86Linux2.6gcc4.4.5/rtirecord)
==27515==    by 0x86C4EE8: DDS_DynamicDataReader_take (in /rticonnext_5.1.0/RTI_Recording_Service_5.1.0/bin/i86Linux2.6gcc4.4.5/rtirecord)
==27515==    by 0x8105A1C: RTIDRTUserDataListener_on_data_available (in /rticonnext_5.1.0/RTI_Recording_Service_5.1.0/bin/i86Linux2.6gcc4.4.5/rtirecord)
==27515== 
==27515== LEAK SUMMARY:
==27515==    definitely lost: 0 bytes in 0 blocks
==27515==    indirectly lost: 0 bytes in 0 blocks
==27515==      possibly lost: 1,168,764,899 bytes in 295,830 blocks
==27515==    still reachable: 1,351,487 bytes in 8,233 blocks

 

rip
rip's picture
Offline
Last seen: 1 day 1 min ago
Joined: 04/06/2012
Posts: 324

Hi Sergey,

What version of the software are you using?  There's a known issue that I think you describe in the case 1 description. 

For the case 2, our implementation mallocs (or, as you see, callocs) memory and then uses its own heap management.  So long as the application is running, that space taken will never be returned to the system heap. 

So it looks to valgrind as if we are leaking memory, because we never free it.

You state "some limits up to 25", that means the other QoS profiles in use are the default, which are keep all.  History is one of the "non-RxO" fields, so you can set the default QoS setting for readers in the QoS file used by the recorder to keep-last, and some value, and this does not affect any of the writers on the system.  This will cap the amount of memory needed by the recorder to some value (determined by the number of readers needed due to the number of Topics found).  The use of deeper keep-last is necessary maybe for the 1KHz topics, so that there is space available to hold the data while it is being written to disk.

Note that writing to disk is /slow/.  Slower than what the production rate is, apparently.  This means that if you cap the readers' history depths, you will start losing samples.  That's probably why the memory starts ballooning, the write-to-disk can't keep up with the production rate.

rip

 

Offline
Last seen: 6 years 11 months ago
Joined: 04/04/2014
Posts: 27

Hello rip,

thank you for a quick answer!

I'm using 5.1.0 version. I've tried modifying DataReader History QoS of the recorder in different ways, but rtirecord still consumes up two 2GB memory and prints endless error messages. Am I correctly applying QoS for recording service?

via recorder config file:

<topic_group name="All">
  <topics>
    <topic_expr> * </topic_expr>
  </topics>
  <field_expr> * </field_expr>
  <datareader_qos>
    <history>
      <depth>1</depth>
      <kind>KEEP_LAST_HISTORY_QOS</kind></history></datareader_qos>
</topic_group>

or via profiles.xml:

<qos_profile name="default_qos" is_default_qos="true">
  <datawriter_qos>
    <durability>
      <kind>VOLATILE_DURABILITY_QOS</kind>
    </durability>
    <reliability>
      <kind>BEST_EFFORT_RELIABILITY_QOS</kind>
    </reliability>
    <history>
      <depth>1</depth>
      <kind>KEEP_LAST_HISTORY_QOS</kind>
    </history>
  </datawriter_qos>
  <datareader_qos>
    <durability>
      <kind>VOLATILE_DURABILITY_QOS</kind>
    </durability>
    <reliability>
      <kind>BEST_EFFORT_RELIABILITY_QOS</kind>
    </reliability>
    <history>
      <depth>1</depth>
      <kind>KEEP_LAST_HISTORY_QOS</kind>
    </history>
  </datareader_qos>
</qos_profile>

 

I've tried to sum up the raw data sent and it shouldn't exceed several megabytes per second, e.g. in CASE 2 it is around 3 MB of data in 10800 messages per second. Together with all sample information and service data it shouldn't exceed 6MB. Saving to disk should be able to manage up to 50 MB of data per second, so I'm not sure that the problem is about not being able to save all data in time. This can be confirmed by the size of recorded serialized data log - 135 MB for 30 seconds.


I've made some more experiments and set TimeBaseFilter QoS for rtirecord to 1ms and it can log data without high memory usage. Short summary (topics from case 2):

  • recording serialized data works well: resulting database size is 135 MB for ~30s
  • recording deserialized data, timebasedfilter set to 1ms, works well: resulting database size is 23 MB for ~30s
  • recording deserialized data, timebasedfilter set to 0, eats memory: resulting database size is 23 MB for ~30s

 

The issue is likely related to the message count, but it cannot be resolved by setting History QoS and saving to the HDD shouldn't be the bottleneck. Can it be something related to the listener callbacks called too often? There is also another error message generated:

exception:[RTIDRTUserDataListener_on_data_available@106]:Failed to allocate new DRT event

rip
rip's picture
Offline
Last seen: 1 day 1 min ago
Joined: 04/06/2012
Posts: 324

The QoS looks reasonable.

I'm going to ask Engineering, I don't know enough about the internals of the recorder to be able to supply a reasonably valid answer about what's happening.

Do you think you could create some Prototyper configuration files that recreate the system you have?

r

 

Offline
Last seen: 6 years 11 months ago
Joined: 04/04/2014
Posts: 27

I've manage to limit test to 3 topics publishing data from 3 computers at 1 Hz to reproduce the problem. It takes longer to consume all available memory but behavior is still reproducible. rtirecord generates 1.75GB of data for ~50 minutes less than 1MB of data per second. I think it should be easy to create Prototyper configuration files to recreate this case. However it may require some time, since I've not used prototyper before.

I've attached two plot for on of the recorded topics:

  1. reception timestamp (only short part after memory is exhausted)
  2. plot of reception timestamp difference to previous value.

After every 65534 samples data of 1-2s is lost. This happens on two of three recorded topics. On the third topic max 300ms are lost several times but there is no fix period.

 

You've mentioned a known issue, which could explain rtirecconv crashes. Does it exist in 5.1.0?

rip
rip's picture
Offline
Last seen: 1 day 1 min ago
Joined: 04/06/2012
Posts: 324

Thank you.  I'm not able to set this up at the moment (ask me for anything but time...).  I may be able to later this week.

for your original case 1, the sigsegv bug I remembered was in rtirecord, not in rtirecconv, so that appears to be something new.  I'll follow up on that.

for the case 2, there are two open bugs that seem to be related.  "limiting the memory usage with better defaults", and "should not segfault if we run out of memory". We implemented partial improvements but I won't say the bugs were fixed, more like we now just have a thinner edge case horizon.  There is at least one inefficient use of sqlite3 commands, so the problem may be in how we are using sqlite, rather than how we are doing DDS.

The default flush period (how much time to wait between db inserts) is 1 second, which is also the minimum period allowed.  There is another enhancement request to support sub-1 sec resolution.

There is a known user case where they manually set resource limits to prevent recorder from trying to allocate more memory over and over.  In the Recorder's QoS can you set the default profile's datareader qos, to include manually configured resource limits on the max_total_instances or max_instances policies.

One other question that came up in internal discussions:  Are you accessing the sqlite db file at the same time as you are writing too it?  That would also cause hte behavior you are seeing (I don't think this is the case, though, as that seems like something that would be worth remarking on).

Offline
Last seen: 6 years 11 months ago
Joined: 04/04/2014
Posts: 27

If you need the example for the case 1 I  can provide a small file causing problems with rtirecconv. But since I never managed to create a serialized file which can be deserialized, it should be easy to reproduce. I try to perform deserialization with a -deserialize switch, e.g.:

rtirecconv -deserialize raw.dat_0_0

I've tried more tests with different QoS settings and found following:

  • max_instances doesn't have any effect, memory is still consumed
  • max_samples prevent from consuming too much memory.
    • If limiting to 1000 then every 1000 samples data are recorded continuosly and the next data of ~500ms is lost
    • Warning is printed often: warning:[RTIDRTUserDataListener_on_sample_rejected@196]:sample rejected:DDS_SampleLostStatus_from_presentation_status:ERROR: Failed to get SampleLostStatus (unknown kind)
            <datareader_qos>
                <history>
                    <depth>1</depth>
                    <kind>KEEP_LAST_HISTORY_QOS</kind>
                </history>
                <time_based_filter>
                    <minimum_separation>
                        <sec>0</sec>
                        <nanosec>0</nanosec>
                    </minimum_separation>
                </time_based_filter>
                <resource_limits>
                    <max_samples>1000</max_samples>
                    <initial_samples>1</initial_samples>
                    <max_instances>DDS_LENGTH_UNLIMITED</max_instances>
                    <initial_instances>1</initial_instances>
                </resource_limits>
            </datareader_qos>

 

I've done some more experiments with gdb and noticed that DDS_DynamicDataReader_return_loan is called more rarely than DDS_DynamicDataReader_take. I don't know if it's expected behaviour, but could it be the reason? That DataReader is forced to allocate new samples because it doesn't get them back? Gdb output in the case with consuming memory:

5 breakpoint keep y 0x086c3f73 <DDS_DynamicDataReader_return_loan+6>

breakpoint already hit 29335 times
ignore next 70665 hits
6 breakpoint keep y 0x086c4eb4 <DDS_DynamicDataReader_take+6>
breakpoint already hit 100001 times

 I'm not accessing the sqlite db file during writing.