why rCo**Rcv thread is abnormally high

7 posts / 0 new
Last post
Offline
Last seen: 2 years 10 months ago
Joined: 11/29/2022
Posts: 8
why rCo**Rcv thread is abnormally high

why rCo**Rcv thread is abnormally high?

- I create 2 participant,the domain id are 1,21.  I found that the internal threads( rCo**Rcv ) of the DDS were abnormally high, and there was no receiving or sending at the moment

- The program restarts and returns to normal

- As shown in the figure

 
(Flame Graph) As shown in the figure:
which show RTINetioReceiver_receiveFast fuction took a lot of time
Keywords:
Howard's picture
Offline
Last seen: 11 hours 23 min ago
Joined: 11/29/2012
Posts: 673

How long are the receive threads (rC0xxxRcv) using the CPU?  Are there other participants on other machines that are in the same domain?

In any case, on startup, DDS must send out packets to complete discovery.  So even if you're not sending user data, DDS itself will be using the network to discover other DDS applications.  The total amount of traffic, as well as the CPU usage for generating/processing the network packets...which include packets going through the shmem transport. will depend on the number of DomainParticipants and the number of datawriters/datareaders created by each participant.

But after the discovery phase is over, CPU usage should drop to whatever is used by the application itself.  This is normally only a few seconds in a small system, but can be longer for larger systems.

Offline
Last seen: 2 years 10 months ago
Joined: 11/29/2022
Posts: 8

Thanks for your reply

1. the whole environment:

-just a simple case:only one machine,in this machine,there are three processors in the same domain.  and All processes use the default participant qos.  

-Under normal conditions, the CPU usage is always less than 4%

-Under abnormal conditions, the CPU usage is shown in the figure above. It will not drop to the normal level in a few hours. The CPU usage can only be reduced to 4% after the process is restarted, otherwise it is always high

Exceptions occur occasionally. I want to know when this function(RTINetioReceiver_receiveFast)which belong to rCoxxxRcv thread  will occupy a high proportion

Howard's picture
Offline
Last seen: 11 hours 23 min ago
Joined: 11/29/2012
Posts: 673

The receive thread is usually blocked on a socket (UDP) or shared memory queue and processes packets received from other DomainParticipants.

On a single system, by default, DDS will use shared memory to communicate between applications on the same host.

It also will call and execute the on_data_available() callback on the DataReaderListener.  So if user code is taking a long time to run, or is in an infinite loop, that will show up as the receive thread taking up all of the CPU.

I would check to see what data is being sent and processed during abnormal conditions.

I would modify the DomainParticipant QOS (for all participants) to use only the builtin UDPV4 transport, and then use wireshark on the loopback interface to capture the packets during an abnormal condition.  That should tell you what kind of packets are being sent/received during this time and even for which topics if your wireshark capture includes the discovery traffic when the participant was started.

Offline
Last seen: 2 years 10 months ago
Joined: 11/29/2022
Posts: 8

- It can be determined that on_data_available()  callback is not entered,Because it can be clearly seen from the call stack,As shown in the figure

-I'll try your advice : " modify the DomainParticipant QOS (for all participants) to use only the builtin UDPV4 transport, and then use wireshark on the loopback interface to capture the packets during an abnormal condition",after this,Confirm whether high CPU consumption is caused by receiving and sending

-It will take a while for this abnormal situation to recur, and I will update the results in a timely manner

Offline
Last seen: 2 years 10 months ago
Joined: 11/29/2022
Posts: 8

hi howard :

We found that the reason for the high CPU is:the semaphore used in DDS is deleted.

-Have you ever encountered a similar situation,do you have any good suggestions?

Howard's picture
Offline
Last seen: 11 hours 23 min ago
Joined: 11/29/2012
Posts: 673

Can you provide more details of what you found? 

The Connext Rcv thread is usually blocked on a system UDP socket call to receive messages.  When there is no UDP packets to be processed, it should be blocked and sleeping waiting for the OS to wake it up when a packet arrives.  All other semaphores called by the Rcv thread are mutexes designed to protect access to shared resources.

The Rcv thread doesn't directly depend on a semaphore to not run in an infinite loop.  It depends on the recvfrom() call (or similar) provided by the OS to block it when there are no messages to process.

So,

1) how did you determine that a semaphore was deleted? 

2) was the semaphore actually deleted or is it just corrupted or somehow not blocking threads?

3) how do you know that the Connext Rcv thread uses that semaphore?  Do you have a callstack that shows Connext calling the semaphore that was "deleted"?

4) on which operating system (version) and which version of Connext DDS are you finding this problem?

5) when did this problem start occurring?  Is it occurring in an application that used Connext DDS successfully in earlier versions of the application?  If so, what changed in your application?  Can you revert to a version of your application that works and then slowly apply changes until it doesn't work?

I have seen objects allocated by Connext DDS (like semaphores) get corrupted by application code...memory overwrites, etc.  In one case, a customer insisted that there was a bug in Connext DDS until our analysis of their application code showed that they were overwriting a string that they allocated but the overwritten memory was a Connext DDS object that Connext allocated.