DataReader::read blocking on internal mutex for 10+ milliseconds

9 posts / 0 new
Last post
Offline
Last seen: 3 years 1 month ago
Joined: 02/08/2021
Posts: 21
DataReader::read blocking on internal mutex for 10+ milliseconds

Hello,

We have a jitter-sensitive application which has DataReaders and DataWriters. Some of them fech local data via SHMEM, others remote. The process has minimal own processing. It is communication dominated.

The problem is that we sometimes notice calls blocking on internal mutexes for long times (10+ milliseconds).

On the stack trace below, The blocked DataReader is configured with History::KeepLast = 1 and "Reliability::Reliable". A stack trace on the profiler looks like this:

Call time: 14,091ms

libpthread-2.27.so: __pthread_mutex_lock
libnddscore.so !RTIOsapiSemaphore_take
libnddscore.so !REDAWorker_enterExclusiveArea
libnddscore.so !REDACursor_modifyReadWriteArea
libnddscore.so !PRESPsReader_readOrTakel
libnddscore.so !PRESPsReader_read
libnddsc.so !DDS_DataReader_read_or_take_untypedl
libndsscpp2.so: !rti::sub::UntypedDataReader::read_or_take_untyped(...)

This is a minimal test case, on a low CPU usage scenario with very low DDS traffic. The process is running on Linux with "CPUSchedulingPolicy": "rr" and "CPUSchedulingPriority": "49", so I guess the likelihood for this beeing preempted that long by the OS for doing low priority stuff is low. Seems like a contention scenario inside the library.

Is this normal/expected? Can it be mitigated by using another configuration?

Howard's picture
Offline
Last seen: 1 week 2 days ago
Joined: 11/29/2012
Posts: 618

So, what is the full call stack?  What is the actual call to a DataReader object that is blocking?  What thread is making that call?  And what version of RTI Connext DDS are you using?

Offline
Last seen: 3 years 1 month ago
Joined: 02/08/2021
Posts: 21

You are right! I missed two RTI-related symbols on our stack trace.

libpthread-2.27.so: __pthread_mutex_lock
libnddscore.so !RTIOsapiSemaphore_take
libnddscore.so !REDAWorker_enterExclusiveArea
libnddscore.so !REDACursor_modifyReadWriteArea
libnddscore.so !PRESPsReader_readOrTakel
libnddscore.so !PRESPsReader_read
libnddsc.so !DDS_DataReader_read_or_take_untypedl
libndsscpp2.so: !rti::sub::UntypedDataReader::read_or_take_untyped(...)
[our-binary]: !rti::sub::DataReaderImpl<...>::read_or_take(...)
[our-binary]: !dds::sub::DataReader<[our-type], rti::sub::DataReaderImpl>::ManipulatorSelector::operator>>(...)

We are using 6.0.1. Migrating to 6.1.0 soon.

This program uses heartbeats sent/received by a DDS service to trigger an event loop that does the processing, other DDS sending and receiving, etc. all on the same thread. This same thread is expected to use non-blocking functions (hence this forum thread).

On this special case there is a local thread emulating the heartbeats at a fixed rate by sleeping some milliseconds and then sending (trying to compensate sleep syscall jitter, but not relevant).

So this progam is just two threads, this low frequency heartbeat and the main thread touching DDS.

Howard's picture
Offline
Last seen: 1 week 2 days ago
Joined: 11/29/2012
Posts: 618

Are you using listeners on your DDS Entities?  Specifically, the DataReader?

What do you mean by "DDS service"?

How many DataWriters and/or DataReaders does the application have?

If you have multiple DataReaders, how is the data taken or processed for each DataReader?  In the same thread? Different threads?  Listeners?

By default API calls to DataReader objects created by the same Subscriber object will contend for the same mutex.  When in a DataReaderListener, the DDS internal thread executing the callback code will have taken mutex...and won't release it until the user processing code returns.

 

Offline
Last seen: 3 years 1 month ago
Joined: 02/08/2021
Posts: 21

We don't create subscribers or listeners explicitly. Just DataReaders.

By "DDS service" I wanted to mean a service sendind DDS. In that case the heartbeat thread.

The application uses no threads except the one generating heartbeats. It is an aggregation service that mostly only does DDS, everything is read/written sequentially by using the "read/take" and "write" functions. This is not my group's code, but by looking at the sources there are 27 readers and 47 writers. Some of the readers and writers have the same topic.

As there is little processing, this was thought to be a good service to measure if there are hiccups with DDS.

Just so you can understand the background or why I might have a hard time to answer some questions; we have a base platform, where I work, and other groups that do services for that platform. We started to providing DDS on the platform, no one did really set realtime requirements on the platform, thought about the amount of traffic we generate, etc. Now we are at the stage where it doesn't meet performance requirements and we have to tune it. When people with knowledge of the whole system does profile they often see stack traces like the one above blocking on DDS and points to us.

So for what you are saying on our configuratuion it shouldn't be taking a mutex, readers and writers run sequentially, so they should always find the mutex unclaimed and take it without a syscall. Then I assume the Subscriber communicating with the DomainParticipant doesn't take this lock.

Kind of unrelated: But if pinning a process using Connext to a single core, I guess that there are some configuration tweaks to do, at least if you internally use thread pools or whatever.

Howard's picture
Offline
Last seen: 1 week 2 days ago
Joined: 11/29/2012
Posts: 618

Ok, getting a better picture.  Before we go further, we should clarify the vocabulary being used...  There are "subscribers", "Subscribers", and "DDSSubscribers".  The first two can have various meanings...but the last is the name of a DDS Entity/class...at the same level as "DDSDataReader" aka "DataReader".   The problem is sometimes "subscriber"/"Subscriber" is used to refer to a DataReader object, and other times, it refers to an application that subscribes to data and finally, they could be used to actually refer to a "DDSSubscriber".

So, from what you describe your application only has 2 threads, the main thread and the heartbeat thread.  All the HB thread does is call "DataWriter::write()" on a single DDS Topic periodically.  the main thread will call user installed code that calls DataReader::read()/take() and/or DataWriter::write() calls.  So in the main thread, everything is done sequentially.  (correct me if there's a problem with the description)

Internally, the DDS Participant does start multiple threads.  A "database" thread, an "event" thread, multiple "receive" threads, and possibly publisher threads depending on if features like batching or asynchronous publish_mode are enabled.

One receive thread is used for each socket created.  There are different sockets used for discovery versus user data.  For multicast data versus unicast data.  For shared memory versus UDP.  Data from a single "socket" are processed by a single receive thread.

These receive threads need to take data out of the transport (UDP or sharedmem), deserialize the network back into a data structure, and then store the data in a DataReader's cache...in that process it will take the mutex of the DataReader.

As previously mentioned, all DataReaders created by the same DDSSubscriber object will share a mutex.

Do you know how big the data being received for the different DataReaders?  What's the biggest data?  How fast is it being received?  How much data is being received generally?

Does the occurrence of the delay correlate with high throughput?

If you have pinned the process using Connext DDS to a single core, what is the utilization of that core...does the delay correlate to when the utilization is at or near 100%?

And finally, if you are able to get a stack trace of the thread calling the DataReader()::read() when read() is taking a long time, can you get the stack traces of all of the other threads at that same time?  Maybe we can figure out what other thread is holding the mutex and what it's doing at that time?

 

Offline
Last seen: 3 years 1 month ago
Joined: 02/08/2021
Posts: 21

You understood correctly the program.

I don't have control or track of the amounts of the data requested. This is a test system where people does R&D so it constantly changes. I'm looking into instrumenting our code to be able to track what data rates are being requested.

It is not me doing the measurements, I'll ask that, but if I recall correctly it seems to me that the only profiler available now is NVTX (from Nvidia). I don't know the tool but on the screenshot I got I don't see RTI's threads.

So my takeaway is that on that program, the only thing able to cause the stack trace we got is the Reader colliding with the thread handling the Socket (Subscriber I guess).

Thanks for your help!

PS: Maybe on a future version the Reader could use queue(s) with no mutexes (SPSC) for Reader-Subscriber communication, but DDS has to support so many features that maybe more effective designs are undoable.

Howard's picture
Offline
Last seen: 1 week 2 days ago
Joined: 11/29/2012
Posts: 618

Well, the problem is certainly that the underlying DDS engine is multithreaded...and there's no getting around having to have mutexes if you wanna share resources (e.g., data structures) between threads. 

The DDSSubscriber object actually isn't an active/dynamic element in DDS.  The receive threads are owned by the Domain Participant...and will service all incoming network packets across all DataReaders of all Subscribers.

If you are on Linux, you should be able to a variety of opensource tools for profiling (although, blocking on a mutex shouldn't show up as taking much CPU time) and probing, like gprof or perf...

https://en.wikipedia.org/wiki/List_of_performance_analysis_tools

BTW, issues like this can be submitted to RTI's Support team by customers with a support contract.  I think that you fall into that category.  However, they'll be asking the same questions and needed the same info to help identify the root cause of the problem...

Offline
Last seen: 3 years 1 month ago
Joined: 02/08/2021
Posts: 21

I guess that my colleagues are aware of the Linux profiling tools, but are not using them for some reason. Long term we want to use LTTNG but it requires some integration.

The reasons you mention ares exactly why I didn't contact support. We need to do some homework on our side to be able to profile this system better. Otherwise it would be a waste to call support/consultants.

PS: On my previous message I was meaning non-blocking algorithms, of course multithreaded. But anyways I assume your colleagues know what they are doing.