Performance of RTI's internal memory pool vs tcmalloc?

6 posts / 0 new
Last post
Offline
Last seen: 2 years 9 months ago
Joined: 06/15/2021
Posts: 3
Performance of RTI's internal memory pool vs tcmalloc?

We are using unbound sequences and hence need to set the QoS property:
memory_manager.fast_pool.pool_buffer_max_size

Then the question came up, what happens if the fast_pool is dropped entirely, and memory management is handed over to an external memory manager?

Has anybody done any performance measurements comparing fast_pool memory allocation with for instance tcmalloc?

We are trying to reduce the number of QoS profiles that we need to maintain, and still get reasonable performance. We run a mix of frequent small samples, and less frequent very big samples.

Best regards / Bjerker

Organization:
Keywords:
Howard's picture
Offline
Last seen: 1 day 13 hours ago
Joined: 11/29/2012
Posts: 565

I don't think that any sort of performance comparison of RTI's internal fast memory management implementation and other memory management implementations have been made.   A major constraint is that whatever we use has to be available across the 10 or so operating systems and even more compilers that we support for our customers.

Offline
Last seen: 2 years 9 months ago
Joined: 06/15/2021
Posts: 3

Thank you for your answer. I do understand why you have to provide an internal memory manager.

We are using Linux on our systems, and tcmalloc is as far as I know a linker time replacement for the heap manager. I'm curios to explore what will happen with the performance if we set the QoS to always allocate dynamic memory in combination with tcmalloc.

If we manage to do such comparison I will come back with the result.

Howard's picture
Offline
Last seen: 1 day 13 hours ago
Joined: 11/29/2012
Posts: 565

So, during operation, Connext DDS needs to allocate and free fixed sized buffers for different datatypes.  And this can happen alot over the span of possibly days/weeks/years of operation without rebooting.  In addition, RTI has many customers working on systems that need to run safely...some of them actually require different formal safety certification of the software.  In ALL of those cases, memory leaks or dynamic system memory allocation is an area of concern and even forbidden after initialization in the most critical cases.

So, in order to both

1) provide a much more performant memory allocation/deallocation scheme than the system heap

2) a method to not allocate from the system heap after system initialization

RTI implements its own memory management (which can be quite fast since each fast_pool only handles memory of a fixed size) scheme.  But note, this isn't used exclusively by Connext DDS.  Connext DDS will also call system heap memory management routines during initialization...as well as to add memory to the fast_pol if allowed to do so.

I will say that our approach is not unique to RTI.  I've seen many products/software packages that do exactly the same type of thing...for basically the same reason.

Offline
Last seen: 2 years 9 months ago
Joined: 06/15/2021
Posts: 3

I understand and agree with all the reasons for the existence of fast_pool, it is just that we have broken the constraints by introducing unbound sequences, which requires heap allocation. To gain speed for those cases, it might be worth evaluating an alternative heap manager, such as tcmalloc. Under such circumstances it might be interesting to see what happens if the limit for heap allocation is set to 0. Perhaps the performance worsen and the heap gets fragmented, or it will just continue working (in our case, which is a non-critical support system during product development).

Howard's picture
Offline
Last seen: 1 day 13 hours ago
Joined: 11/29/2012
Posts: 565

So, for unbounded sequences, or for Topics that have large variability in the size of a data sample such that most samples are about the same size, but a few will be much large, 10x, 100x, than the average size, RTI introduced a QoS that allows Connext to take advantage of the performance of the fast_pool in managing fixed size buffers, but at the same time use the system heap to deal with the larger data samples.

The boundary between "large" dynamically allocated data and the "fixed" size memory blocks is set by DataWriter and DataReader Properties

dds.data_writer.history.memory_manager.fast_pool.pool_buffer_max_size

dds.data_reader.history.memory_manager.fast_pool.pool_buffer_max_size

So, users can trade off the performance of using fixed-sized blocks to hold the data of any size up to the fixed size (pool_buffer_max_size) versus the memory wasted when using a fixed-sized block larger than the actual size of the data.

From a performance aspect, if you had sufficient memory, and knew how big your biggest data is, then setting pool_buffer_max_size to the size of the largest data sample that will ever exist, will give you a much better performance than using malloc (and even possibly tcmalloc) at the expense that alot of memory could be wasted.

If you set pool_buffer_max_size to 0, then that means Connext DDS won't be using fixed buffer pools at all for those DataWriter and DataReaders when dealing with data samples, and use only malloc()/free().  So, now no memory is wasted allocating memory > data size, but at the expense of less performance and possible heap fragmentation.

However, as you say, in a real system, setting pool_buffer_max_size to 0, may or may not have an adverse effect on overall system requirements, and perform just fine.

But the performance is not going to be faster than if you enable Connext DDS to use a fixed size pool of memory via the fast_pool.

I think that your initial premise was using an alternative memory manager like tcmalloc may be better than malloc and sufficient for some use cases instead of using fast_pool.  Sure, that might be true.  I would think that fast_pool > tcmalloc > malloc in terms of performance. 

(I don't think that a variable-sized memory allocation scheme can be faster than one that only operates on fixed sized memory blocks....and no risk of "heap" fragmentation with a fixed sized scheme)

And for many systems, malloc() might be fast enough.  For some systems, malloc is not good enough, but tcmalloc may be.  And then for other systems, they will need to use fast_pools to get the performance needed.

I guess, I'm not sure what your thesis is...what are you suggesting?  In your first post, you wrote:

Has anybody done any performance measurements comparing fast_pool memory allocation with for instance tcmalloc?

 

We are trying to reduce the number of QoS profiles that we need to maintain, and still get reasonable performance. We run a mix of frequent small samples, and less frequent very big samples.

What do you expect to gain by not using fast_pool and only tcmalloc for dynamic memory allocation?  For a Topic with many small samples and only a few big samples, then setting fast_pool to the the size of the largest of the "small" samples, would give you the best tradeoff of performance and memory usage.

Sure you could change the use of malloc to tcmalloc for the "very big samples", but since they are less frequent, you'll see less impact on overall system performance for any performance gain that that tcmalloc has over malloc.  However, as stated previously, I would doubt that you would see performance gain for the "frequent small samples" with tcmalloc working with variable sized memory allocation versus fast_pool's fixed sized scheme.

Fundamentally, if you are using unbounded sequence, then you have to configure Connext DDS to use system memory allocation for data > a fixed size.  That could be 0, so that for unbounded sequences Connext only uses a system heap memory allocation.

I'm not entirely familiar with tcmalloc, from what I skimmed, it seems like this is something that the end user can decide to use instead of the malloc() provided by the compiler/system libs, and does not require RTI to make any modifications to our software for end users to use. 

If that's the case, then if it is useful and works for your system, go for it!