Discovery with Many DomainParticipants

4 posts / 0 new
Last post
Offline
Last seen: 7 years 10 months ago
Joined: 10/21/2012
Posts: 18
Discovery with Many DomainParticipants

Hello,

We are attempting to build a system with a (ralatively) large number of processes, and hence, a relatively large number of DomainParticipants.

At the moment, we are looking at about 200 processes running on a single box, but we expect it to expand up to abut 5000 processes running over 20 or more boxes.  Each box is a reasonably beefy 24-core 2.67GHz Intel based Linux server.

Starting 200 processes n a box like this should not present any problems.  Indeed, when we disable DDS, they start fine and take only a second or two to coomplete their initialisation processes.

However, when each of these processes attempts to create a DomainParticipant, we have problems.

We have already discovered the 120 participant per node limit and have adjusted the domain_id_gain QoS setting to accomodate the number we need to run.

When we attempt to create all 200 participants at almost the same time, the whole node seems to become completely unresponsive for many tens of seconds.

We have tried to stagger the participant creation by adding small delays (anything up to 200msec) prior to participant creation in each application.  This certainly improves performance noticably, but the whole process still takes much longer than we expect and is still quite unreliable.

The improvement that we notice makes me think that there is some sort of discovery induced message storm occurring.  Is this a possibility, or is something else going on?

More importantly, is there a better solution other than inserting these delays to stagger the participant creation?

Thanks in advance

Dallas

PS: We are running RTI V5.1.0 on Redhat Enterprise Linix V6.  Applications use the C api: gcc version 4.4.7

Organization:
Gerardo Pardo's picture
Offline
Last seen: 3 months 2 weeks ago
Joined: 06/02/2010
Posts: 602

Hello Dallas,

I am guessing a bit here but one thing does come to mind.  See below.

When you start a DomainParticipant the system tries to find an available IP port to listen to. Since the port numbers are specified by the RTPS protocol I am guessing the DDS DomainParticipant does an incremental search starting from the first port assigned to that domain ID. If the port is busy it tries the next port, and so on until it finds a free one.

The test if the port is free uses the "bind()" system call to try to bind to that port. This system call it could be slow. To make matters worse as more and more ports are busy  the "linear" search causes the number of "binds" to increase cuadratically. For example after you started 50 proesses the 51st will try to bind unsuccessfully 50 times before it finds a free port. The 100th process will fail binding 100 times until it finds a port and so on.

There is a configuration parameter that would avoid this search. This is the participant_id which can be configured in the WireProtocolQosPolicy of the DomainParticipant.

To take advantage of this feature fyou would need to do is assign each DomainParticipant a different  participant_id that way the system would only try to bind to that port and it will find it free (as long as each DomainParticipant does have a different  participant_id).

Gerardo

 

 

Offline
Last seen: 7 years 10 months ago
Joined: 10/21/2012
Posts: 18

Thank you Gerardo for your prompt reply.

I will try setting explicit participant_id values and will post the results here.

Dallas

Offline
Last seen: 7 years 10 months ago
Joined: 10/21/2012
Posts: 18

Hmmm...

Unfortunately, setting explicit participant_id values did not seem to help.

Interestingly, similar machine lockup symptoms are observed when all of the applications are stopped at much the same time (using pkill).  They all have exit handler routines that attempt to shut everythng down and exit gracefully.

When the machine locks up, we often see a message like this in a terminal session:

Message from syslog@server003 at Sep 11 05:54:33 ...
 kernel:BUG: soft lockup - CPU#4 stuck for 72s! [rs:main Q:Reg:1962]

The date, CPU#, and the time it is stuck changes each time.

Any other ideas?