Threads in different participants try to lock the same mutex when take() samples

5 posts / 0 new
Last post
Offline
Last seen: 6 months 2 weeks ago
Joined: 10/09/2022
Posts: 13
Threads in different participants try to lock the same mutex when take() samples

Hi, we've got the following program using rti_connext_dds-6.1.1.3 on QNX 7.1 x86 system.

The program has 7 threads and each thread creates a participant. Each participant has one subscriber and two publishers. Each subscribers has some DataReaders to subscribe data, and each publisher has some DataWriters to publish data. All the topics subscribed by these DataReaders are on the same host, so they should use shared memory. And most topics published by the DataWriters are also on the same host, other topics are published by UDP to other hosts. Some of the Qos are BuiltinQosLib::Pattern.Streaming and others are BuiltinQosLib::Pattern.Status.

These threads are looping in 10ms. In each loop, they will call  DataReader->take() for each subscribed topics and then do some calculation and publish some data. From the QNX traceevent, it seems that all these threads will try to lock a same mutex when take samples.

If I don't misunderstand the trace event. From the above pic:

1. Thread PatientSafety_4 got the mutex failed, as the mutex owner 0xf028000d was thread ConsoleSafety_Left.

2. Thread ConsoleSafety_Left released the mutex, and thread PatientSafety_3 got the mutex, owner is 0xf0280035

3. Thread PatientSafety_4/SafetyModule/ConsoleSafety_Left acquired mutex failed!

So do DataReaders in different participants share same mutex?

Keywords:
Howard's picture
Offline
Last seen: 1 week 1 day ago
Joined: 11/29/2012
Posts: 567

No, the DataReader::take() API only accesses mutexes local to its own participant (and specifically to its DDSSubscriber object).

The only possible shared MUTEX between participants would be the one used for the system shared memory transport.  But that's only taken by an internal DDS receive thread, allocated by each participant, to take data out of the shared memory transport and store into the DataReader receive queue.  When user calls take(), either the data is already in the receive queue or it isn't.  There's no getting the shared memory mutex to access the shared memory in the context of the take() call.

Can you print the stack trace of those threads when they are trying to take the mutex?

Why do you think that these threads are calling take() when this mutex is being taken?

NOTE:  different DataReaders created by the same DDSSubscriber (and of course in the same Participant) do share a mutex.  So, if one thread calls take() for 1 DataReader, other threads calling take on DataReader that shares a mutex, will block until the other DataReader's take returns.

Are you using DDSDataReaderListener's to take() and process the data?  Or are you using your own threads?  If using a DataReaderListener, then the Subscriber mutex is taken until the on_data_available() call back that you implemented returns.

Offline
Last seen: 6 months 2 weeks ago
Joined: 10/09/2022
Posts: 13

Can you print the stack trace of those threads when they are trying to take the mutex?

We can't print the stack trace as these threads don't take the mutex manually.

Why do you think that these threads are calling take() when this mutex is being taken?

We added some QNX User Events in the loop code and found the QNX  SyncMutexLock() event  when the threads were calling  take() function. See bellow:

Event, Time, Cycle, CPU, Owner, Class, Type, Data

# User event 1: begin call take() for each DataReader
0025, 8us, 0x1953FA0E9, 8, safety Conso#y_RIGHTTh, User, User Event 1, string begin input d0 0x69676562 d1 0x6e69206e d2 0x747570 d3 0x0

0049, 20us, 0x1954034D7, 8, safety Conso#y_RIGHTTh, Kernel Calls, SchedYield Enter,
0050, 21us, 0x195403DDD, 8, safety Conso#y_RIGHTTh, Kernel Calls, SchedYield Exit, ret_val 0x0
0079, 34us, 0x19540DF37, 8, safety Conso#y_RIGHTTh, Kernel Calls, TimerTimeout Enter, id 0 timeout_flags 0x2000 ntime(sec) 0 ntime(nsec) 0 event->sigev_notify 0 event->sigev_notify_ptr 0x0 event->sigev_value 0x0 event->sigev_notify_attributes_ptr 0x0
0080, 34us, 0x19540E267, 8, safety Conso#y_RIGHTTh, Kernel Calls, TimerTimeout Exit, prev_timeout_flags 0x0 otime(sec) 0 otime(nsec) 0

# SyncMutexLock() failed, it seems that call pthread_mutex_trylock() or use QNX kernel timeout to lock mutex with ZERO(0) timeout
0089, 37us, 0x195410932, 8, safety Conso#y_RIGHTTh, Kernel Calls, SyncMutexLock (64) Enter, sync_ptr 0x52a4c6bca0 count 2147483648 owner 0xc02c002b
0090, 38us, 0x19541109C, 8, safety Conso#y_RIGHTTh, Kernel Calls, SyncMutexLock Exit, rc -1 errno 260 (ETIMEDOUT - Connection timed out) ret_val 4294967295
0127, 57us, 0x19541FD76, 8, safety Conso#y_RIGHTTh, Kernel Calls, TimerTimeout Enter, id 0 timeout_flags 0x2000 ntime(sec) 0 ntime(nsec) 0 event->sigev_notify 0 event->sigev_notify_ptr 0x0 event->sigev_value 0x0 event->sigev_notify_attributes_ptr 0x0
0128, 57us, 0x195420090, 8, safety Conso#y_RIGHTTh, Kernel Calls, TimerTimeout Exit, prev_timeout_flags 0x0 otime(sec) 0 otime(nsec) 0

# SyncMutexLock() failed again
0137, 60us, 0x195422A1C, 8, safety Conso#y_RIGHTTh, Kernel Calls, SyncMutexLock (64) Enter, sync_ptr 0x52a4c6bca0 count 2147483648 owner 0xc02c002b
0138, 61us, 0x195422F56, 8, safety Conso#y_RIGHTTh, Kernel Calls, SyncMutexLock Exit, rc -1 errno 260 (ETIMEDOUT - Connection timed out) ret_val 4294967295
0162, 73us, 0x19542C7C0, 8, safety Conso#y_RIGHTTh, Kernel Calls, TimerTimeout Enter, id 0 timeout_flags 0x2000 ntime(sec) 0 ntime(nsec) 0 event->sigev_notify 0 event->sigev_notify_ptr 0x0 event->sigev_value 0x0 event->sigev_notify_attributes_ptr 0x0

# And there are many SyncMutexLock() failed, omitted...
...

# Now SyncMutexLock() success, we found there are 12 SyncMutexLock() success, but this thread only has 4 DataReaders
0423, 185us, 0x195483E7A, 8, safety Conso#y_RIGHTTh, Kernel Calls, SyncMutexLock (64) Enter, sync_ptr 0x52a4c6bca0 count 2147483648 owner 0xc02c000d
0424, 185us, 0x19548437E, 8, safety Conso#y_RIGHTTh, Process and Thread, Mutex, policy_name FIFO pid 4243500 tid 23 priority 240 policy 1 partition 7 sched_flags 0
0883, 402us, 0x19552CB0C, 7, safety Conso#y_RIGHTTh, Process and Thread, Ready, policy_name FIFO pid 4243500 tid 23 priority 240 policy 1 partition 7 sched_flags 0
0912, 411us, 0x19553409A, 8, safety Conso#y_RIGHTTh, Process and Thread, Running, policy_name FIFO pid 4243500 tid 23 priority 240 policy 1 partition 7 sched_flags 0
0913, 411us, 0x195534425, 8, safety Conso#y_RIGHTTh, Kernel Calls, SyncMutexLock Exit, ret_val 0


# Now SyncMutexUnlock() success
0922, 414us, 0x195536B55, 8, safety Conso#y_RIGHTTh, Kernel Calls, SyncMutexUnlock (64) Enter, sync_ptr 0x52a4c6bca0 count 2147483648 owner 0xc02c0017
0924, 415us, 0x1955375AF, 8, safety Conso#y_RIGHTTh, Kernel Calls, SyncMutexUnlock Exit, ret_val 0
0938, 420us, 0x19553B39E, 8, safety Conso#y_RIGHTTh, Kernel Calls, TimerTimeout Enter, id 0 timeout_flags 0x2000 ntime(sec) 0 ntime(nsec) 0 event->sigev_notify 0 event->sigev_notify_ptr 0x0 event->sigev_value 0x0 event->sigev_notify_attributes_ptr 0x0
0939, 421us, 0x19553B8A7, 8, safety Conso#y_RIGHTTh, Kernel Calls, TimerTimeout Exit, prev_timeout_flags 0x0 otime(sec) 0 otime(nsec) 0


# Next DataReader? SyncMutexLock() failed again...
0948, 424us, 0x19553E2C2, 8, safety Conso#y_RIGHTTh, Kernel Calls, SyncMutexLock (64) Enter, sync_ptr 0x52a4c6bca0 count 2147483648 owner 0xc02c002b
0949, 425us, 0x19553E927, 8, safety Conso#y_RIGHTTh, Kernel Calls, SyncMutexLock Exit, rc -1 errno 260 (ETIMEDOUT - Connection timed out) ret_val 4294967295
0958, 427us, 0x195540AF8, 8, safety Conso#y_RIGHTTh, Kernel Calls, TimerTimeout Enter, id 0 timeout_flags 0x2000 ntime(sec) 0 ntime(nsec) 0 event->sigev_notify 0 event->sigev_notify_ptr 0x0 event->sigev_value 0x0 event->sigev_notify_attributes_ptr 0x0

# There are many other SyncMutexLock() failure and another 11 SyncMutexLock() succss like above.

...


# Now all DataReader take() return.
9975, 4ms 654us, 0x19621E856, 8, safety Conso#y_RIGHTTh, User, User Event 2, string after input d0 0x65746661 d1 0x6e692072 d2 0x747570 d3 0x0

Are you using DDSDataReaderListener's to take() and process the data?  Or are you using your own threads?

We have one DDSDataReaderListener for each DataReader, which do nothing there. And our own threads call take() to receive samples.

 

Howard's picture
Offline
Last seen: 1 week 1 day ago
Joined: 11/29/2012
Posts: 567

I don't quite understand. 

1) why do you have DataReaderListener if you are not doing anything in the listener?  What is the Listener MASK that you are using on the installed listener?  In any case, you should not be using a Listener callback on for the ON_DATA_AVAILABLE status if you're not doing anything in the on_data_available() callback of the function

2) how many threads are calling DataReader::take()?   Is each thread calling take() on a different DataReader object?

3)  How are you instrumenting the take() call?

QNX_user_event(1);

reader->take();

QNX_user_event(2);

like the above?

3) How many threads are calling DataReader::take()?

Are all of the calls to take() being instrumented?

Are all of the take() being instrumented the same way, user event 1, take, user event 2?

Do you see multiple user event 1's printed before you see any user event 2?

 

4) what is the value 0x19621E856 in the following?   What is the thread id?

9975, 4ms 654us, 0x19621E856, 8, safety Conso#y_RIGHTTh, User, User Event 2, string after input d0 0x65746661 d1 0x6e692072 d2 0x747570 d3 0x0

 

Finally, you may find that RTI's support group is better suited to answer your question then this forum...as the support team has access to QNX hosts.

Offline
Last seen: 6 months 2 weeks ago
Joined: 10/09/2022
Posts: 13

Thanks for your kind reply.

1.0) why do you have DataReaderListener if you are not doing anything in the listener?
Because some DataReader use on_data_availble callback to take sample. But these threads don't.

1.1) What is the Listener MASK that you are using on the installed listener?
We inherit from dds::sub::NoOpDataReaderListener and do some logs in all the callback function.

2) how many threads are calling DataReader::take()? Is each thread calling take() on a different DataReader object?
Only one thread calls DataReader::take() in ech participant. And the thread will call take() for all the DataReader objects.

3)  How are you instrumenting the take() call?
Each thread loops like this:

while (1) {
    QNX_user_event(1);
    for_each(DataReader objects) {
        reader->take();
    }

    QNX_user_event(2);

    // do calculation

    // publish some samples

    // wait for next timeout

}

3.1) How many threads are calling DataReader::take()?

There are 7 threads in seperate participants call DataReader::take().

3.2) Are all of the calls to take() being instrumented?

Yes.

3.3) Are all of the take() being instrumented the same way, user event 1, take, user event 2?

Yes.

3.4) Do you see multiple user event 1's printed before you see any user event 2?

Yes. see below.

15237, 7ms 133us, 0x1969AABF7, 7, safety Patie#afety_1Th, User, User Event 5, string after publish d0 0x65746661 d1 0x75702072 d2 0x73696c62 d3 0x68
15272, 7ms 148us, 0x1969B6839, 7, safety Patie#afety_1Th, User, User Event 1, string begin input d0 0x69676562 d1 0x6e69206e d2 0x747570 d3 0x0
15334, 7ms 176us, 0x1969CC3A8, 7, safety Conso#ty_LEFTTh, User, User Event 5, string after publish d0 0x65746661 d1 0x75702072 d2 0x73696c62 d3 0x68
15366, 7ms 191us, 0x1969D7C64, 7, safety Conso#ty_LEFTTh, User, User Event 1, string begin input d0 0x69676562 d1 0x6e69206e d2 0x747570 d3 0x0
15429, 7ms 219us, 0x1969ED44E, 7, safety Patie#afety_2Th, User, User Event 3, string after handler d0 0x65746661 d1 0x61682072 d2 0x656c646e d3 0x72
15462, 7ms 234us, 0x1969F91CE, 7, safety Patie#afety_2Th, User, User Event 4, string after calculate d0 0x65746661 d1 0x61632072 d2 0x6c75636c d3 0x657461
16087, 7ms 528us, 0x196ADE5DB, 7, safety Patie#afety_2Th, User, User Event 5, string after publish d0 0x65746661 d1 0x75702072 d2 0x73696c62 d3 0x68
16094, 7ms 531us, 0x196AE1230, 7, safety Patie#afety_2Th, User, User Event 1, string begin input d0 0x69676562 d1 0x6e69206e d2 0x747570 d3 0x0
19726, 9ms 230us, 0x19700D209, 8, safety Conso#y_RIGHTTh, User, User Event 3, string after handler d0 0x65746661 d1 0x61682072 d2 0x656c646e d3 0x72
19756, 9ms 246us, 0x1970193FA, 8, safety Conso#y_RIGHTTh, User, User Event 4, string after calculate d0 0x65746661 d1 0x61632072 d2 0x6c75636c d3 0x657461
19863, 9ms 304us, 0x197046D8A, 8, safety Conso#y_RIGHTTh, User, User Event 5, string after publish d0 0x65746661 d1 0x75702072 d2 0x73696c62 d3 0x68
21311, 10ms 5us, 0x19726874A, 8, safety Conso#y_RIGHTTh, User, User Event 1, string begin input d0 0x69676562 d1 0x6e69206e d2 0x747570 d3 0x0
25286, 11ms 881us, 0x19781EB3A, 7, safety Patie#afety_1Th, User, User Event 2, string after input d0 0x65746661 d1 0x6e692072 d2 0x747570 d3 0x0
26035, 12ms 237us, 0x1979344CE, 7, safety Conso#ty_LEFTTh, User, User Event 2, string after input d0 0x65746661 d1 0x6e692072 d2 0x747570 d3 0x0
26504, 12ms 460us, 0x1979E23C6, 7, safety Patie#afety_2Th, User, User Event 2, string after input d0 0x65746661 d1 0x6e692072 d2 0x747570 d3 0x0

4) what is the value 0x19621E856 in the following?   What is the thread id?
0x19621E856 is QNX Cycle.
thread id(name) is Conso#y_RIGHTTh