Message Loss over Shared Memory

2 posts / 0 new
Last post
Offline
Last seen: 1 year 11 months ago
Joined: 02/20/2014
Posts: 10
Message Loss over Shared Memory

We started noticing earlier this week some odd issues with high-rate messaging over DDS.  We're running on a multi-core one participant system with threads with a variety of pub/subs talking back and forth to each other over shared memory.

What we're noticing is that data writer's talking over shared memory to data readers are dropping messages.  An analysis of the threads at the times shows a period where three messages get sent BEST EFFORT.  Pretty much at the moment that the second message is sent, the shared memory receive thread picks up and starts processing.  While that's processing the third message is sent, which the subscriber never receives.  The data reader's lost message callback is called and states that a sample was LOST_BY_WRITER.   The returns from the write_w_params() call is always RETCODE_OK.

Switching the QoS to RELIABLE, it appears that this loss is still happening.  Though instead of a message loss, the subscriber has a gap of ~40ms before it receives anything, which I assume is the time that works out for the repair traffic to identify and fix the missing message.  In this case too the write_w_params() call also always returns RETCODE_OK, which is confusing because I initially had expected it to timeout on a block.

Switching on DDS's verbosity to WARNING, no messages are printed out at the time of the message drop.

So we're kind of confused here.  Is message loss really expected to occur when the shared memory thread's busy?  What are some of the mitigations we can do?  Figure it might be more receive threads or maybe switching entirely to UDPv4 over localhost.

Offline
Last seen: 3 months 6 days ago
Joined: 02/11/2016
Posts: 144

Whether you use UDPv4 or shared memory, your QoS and your system settings could be causing this issue.

This looks like a classic case of some buffer being too small to keep a certain amount of messages at the same time.

 

Adding to this, I've heard (although this is not an official RTI statement) that there are relatively more issues with shared memory.

I would recommend looking into system properties / qos settings that have to do with buffer sizes (also, you may want to look into performance tricks recommended by rti: https://community.rti.com/best-practices/tune-your-os-performance

 

Good luck,

Roy.