High CPU usage after SHMEM resources are removed by a third-party

The shared memory built-in transport in RTI Connext uses IPC segments and semaphores under the hood to perform DDS communication between DomainParticipants. It is important that those IPC resources are not destroyed while RTI Connext applications are using them. Failure to do so may result in unexpected behavior, such as elevated CPU usage in RTI Connext receive threads.

Although unintended, there could be situations where the IPC resources being used by RTI Connext applications are inadvertently removed by the system or a third-party application.

In the case of Linux, one common scenario for IPC resources to be removed by the system is when a user logs out – intentionally or unintentionally. Out of the box, the OS will reclaim some system resources held by the user such as IPC segments and semaphores, which might leave RTI Connext applications that are still running in an inconsistent state. This is more likely to occur for example when RTI Connext applications are running as a systemd service.

This removing behavior is controlled by the logind.conf file in Linux usually under /etc/systemd/logind.conf. Within this systemd-logind configuration, we find the RemoveIPC setting. Its behavior is defined as follows:

Controls whether System V and POSIX IPC objects belonging to the user shall be removed when the user fully logs out. Takes a boolean argument. If enabled, the user may not consume IPC resources after the last of the user's sessions terminated (...). Defaults to "yes"

With this default configuration, if the user logs out from the system (e.g., from a connected SSH session), the IPC resources owned by the user will be deleted. If there are RTI Connext applications still running by this user (e.g., as a systemd service), they will run into inconsistencies such as high CPU usage.

If you determine this is your case (see section “Identifying the problem” below), we recommend to use RemoveIPC=no

Note: Currently RTI does not detect unexpected removal of in-use IPC segments/semaphores at runtime. RTI has a RFE to prevent CPU spinning when shared memory resources are cleaned by third parties (ID CORE-13852). In any case, this would be a non-recoverable situation affecting shared memory transport.

Identifying the problem

It is possible to confirm if the high CPU usage observed in RTI Connext threads is caused by a deletion of the IPC resources that your RTI Connext applications are using. To illustrate this in the following example, we will force the deletion of IPC resources used by an RTI Connext application, such as rtiddsping, while it is still running.

#1 Run rtiddsping. We will use "-verbosity 5" command-line parameter to generate high-verbosity logs:

$NDDSHOME/bin/rtiddsping -verbosity 5 > rtiddsping.log

#2 Identify the keys of the IPC semaphores used by the DomainParticipant. This can be obtained from the logs:

RTIOsapiSharedMemoryBinarySemaphore_attach:attached key 0X801CF4
(...)
RTIOsapiSharedMemoryBinarySemaphore_attach:attached key 0X801CF5
(...)
RTIOsapiSharedMemoryMutex_attach:attached key 0XB01CF4
(...)
RTIOsapiSharedMemoryMutex_attach:attached key 0XB01CF5       

Note: The keys may differ depending on the participant_id and domain_id of your DomainParticipants.

#3 Simulate an unexpected scenario by manually deleting the in-use semaphores. This can be done with the “ipcrm” Unix command:

for semkey in 0X801CF4 0X801CF5 0XB01CF4 0XB01CF5; do ipcrm -S "$semkey"; done

#4 Using the Linux “top” command, notice the increase in CPU usage:

PID COMMAND %CPU 
28955 rtiddsping 162.8

#5 Check that the IPC semaphores for the DomainParticipant (0X801CF4, 0X801CF5, 0XB01CF4, and 0XB01CF5) have been removed from the system with “ipcs -a”:

Semaphores:
T ID KEY MODE OWNER
s 2686985 0x610d5ec4 --ra-ra-ra- <user>
s 131085 0x00b01cf8 --ra-ra-ra- <user>

#6 Confirm from the generated logs that RTI Connext receive threads for shared memory are spinning on the deleted semaphores without receiving any bytes:

NDDS_Transport_Shmem_receive_rEA:rCoRTIng##02Rcv blocking on 0X1CF4
NDDS_Transport_Shmem_receive_rEA:rCoRTIng##02Rcv received 0 bytes
NDDS_Transport_Shmem_receive_rEA:rCoRTIng##02Rcv received 0 bytes
COMMENDActiveFacadeReceiver_loop:rCoRTIng##02Rcv disowning receive resource
NDDS_Transport_Shmem_receive_rEA:rCoRTIng##02Rcv blocking on 0X1CF4
NDDS_Transport_Shmem_receive_rEA:rCoRTIng##02Rcv received 0 bytes
NDDS_Transport_Shmem_receive_rEA:rCoRTIng##02Rcv received 0 bytes
COMMENDActiveFacadeReceiver_loop:rCoRTIng##02Rcv disowning receive resource
NDDS_Transport_Shmem_receive_rEA:rCoRTIng##02Rcv blocking on 0X1CF4
NDDS_Transport_Shmem_receive_rEA:rCoRTIng##02Rcv received 0 bytes
NDDS_Transport_Shmem_receive_rEA:rCoRTIng##02Rcv received 0 bytes
COMMENDActiveFacadeReceiver_loop:rCoRTIng##02Rcv disowning receive resource
(...)