why rCo**Rcv thread is abnormally high?
- I create 2 participant,the domain id are 1,21. I found that the internal threads( rCo**Rcv ) of the DDS were abnormally high, and there was no receiving or sending at the moment
- The program restarts and returns to normal
- As shown in the figure
How long are the receive threads (rC0xxxRcv) using the CPU? Are there other participants on other machines that are in the same domain?
In any case, on startup, DDS must send out packets to complete discovery. So even if you're not sending user data, DDS itself will be using the network to discover other DDS applications. The total amount of traffic, as well as the CPU usage for generating/processing the network packets...which include packets going through the shmem transport. will depend on the number of DomainParticipants and the number of datawriters/datareaders created by each participant.
But after the discovery phase is over, CPU usage should drop to whatever is used by the application itself. This is normally only a few seconds in a small system, but can be longer for larger systems.
Thanks for your reply
1. the whole environment:
-just a simple case:only one machine,in this machine,there are three processors in the same domain. and All processes use the default participant qos.
-Under normal conditions, the CPU usage is always less than 4%
-Under abnormal conditions, the CPU usage is shown in the figure above. It will not drop to the normal level in a few hours. The CPU usage can only be reduced to 4% after the process is restarted, otherwise it is always high
Exceptions occur occasionally. I want to know when this function(RTINetioReceiver_receiveFast)which belong to rCoxxxRcv thread will occupy a high proportion
The receive thread is usually blocked on a socket (UDP) or shared memory queue and processes packets received from other DomainParticipants.
On a single system, by default, DDS will use shared memory to communicate between applications on the same host.
It also will call and execute the on_data_available() callback on the DataReaderListener. So if user code is taking a long time to run, or is in an infinite loop, that will show up as the receive thread taking up all of the CPU.
I would check to see what data is being sent and processed during abnormal conditions.
I would modify the DomainParticipant QOS (for all participants) to use only the builtin UDPV4 transport, and then use wireshark on the loopback interface to capture the packets during an abnormal condition. That should tell you what kind of packets are being sent/received during this time and even for which topics if your wireshark capture includes the discovery traffic when the participant was started.
- It can be determined that on_data_available() callback is not entered,Because it can be clearly seen from the call stack,As shown in the figure
-I'll try your advice : " modify the DomainParticipant QOS (for all participants) to use only the builtin UDPV4 transport, and then use wireshark on the loopback interface to capture the packets during an abnormal condition",after this,Confirm whether high CPU consumption is caused by receiving and sending
-It will take a while for this abnormal situation to recur, and I will update the results in a timely manner
hi howard :
We found that the reason for the high CPU is:the semaphore used in DDS is deleted.
-Have you ever encountered a similar situation,do you have any good suggestions?
Can you provide more details of what you found?
The Connext Rcv thread is usually blocked on a system UDP socket call to receive messages. When there is no UDP packets to be processed, it should be blocked and sleeping waiting for the OS to wake it up when a packet arrives. All other semaphores called by the Rcv thread are mutexes designed to protect access to shared resources.
The Rcv thread doesn't directly depend on a semaphore to not run in an infinite loop. It depends on the recvfrom() call (or similar) provided by the OS to block it when there are no messages to process.
So,
1) how did you determine that a semaphore was deleted?
2) was the semaphore actually deleted or is it just corrupted or somehow not blocking threads?
3) how do you know that the Connext Rcv thread uses that semaphore? Do you have a callstack that shows Connext calling the semaphore that was "deleted"?
4) on which operating system (version) and which version of Connext DDS are you finding this problem?
5) when did this problem start occurring? Is it occurring in an application that used Connext DDS successfully in earlier versions of the application? If so, what changed in your application? Can you revert to a version of your application that works and then slowly apply changes until it doesn't work?
I have seen objects allocated by Connext DDS (like semaphores) get corrupted by application code...memory overwrites, etc. In one case, a customer insisted that there was a bug in Connext DDS until our analysis of their application code showed that they were overwriting a string that they allocated but the overwritten memory was a Connext DDS object that Connext allocated.