Discovery not completing for large numbers of participants publishing and subscribing same topic

4 posts / 0 new
Last post
Offline
Last seen: 6 years 4 months ago
Joined: 01/27/2014
Posts: 2
Discovery not completing for large numbers of participants publishing and subscribing same topic

I'm relatively new to DDS; so, I apologize ahead of time for using any incorrect terminology, but would appreciate any thoughts concerning an issue I am having...

I have a set of participants (73 of them) that are publishing and subscribing to the same topic. At the very start of execution, all 73 attempt to publish a sample on this topic, but a few of the participants are hanging waiting on the readers to pair up with the writer (as reported by get_publication_matched_status). I have logic in the code that will wait for the total number of readers to come on-line, but for a handfull of participants, this never seems to occur. I have used the RTIAnalyzer to perform match analysis, and it indicates that I have 73 readers and 73 writers, but from the participants point of view, this doesn't seem to be the case. I have observed that in some case there will be a group of say 5 participants that are hung. In this case, 1 will be waiting on 4 more to come on-line, and the other 4 are waiting for 1 each to come on line; so, it sort of looks like some sort of race condition where 4 are waiting on the 1 and the 1 is waiting on the 4. I've tried adding various sleeps and waits, but to no avail. Any suggestions or ideas would be much appreciated.

Thanks,

Tony

 

rip
rip's picture
Offline
Last seen: 10 hours 27 min ago
Joined: 04/06/2012
Posts: 321

Hi Tony,

Can you describe the topology of the graph?  How many systems, how many applications, how many participants?

Regards,

Rip

Offline
Last seen: 6 years 4 months ago
Joined: 01/27/2014
Posts: 2

The 73 applications are spread across 12 dual-quad-core machines. If I understand the terminology, each application is a separate participant. Each of the applications both subscribes and publishes the topic I mentioned. Also, we are running with DDS_RELIABLE_RELIABILITY_QOS and I have made attempts to adjust some of the heartbeat settings, but that didn't seem to help.

Thanks,

Tony

Offline
Last seen: 2 months 1 week ago
Joined: 05/23/2013
Posts: 49

Hi Tony,

Can you please update your initial peer lists like following in your XML file? Of course, you need to replace the IP addresses with your own machines. It increases the participant ID limit to 10 for a machine in this example because I guess you may run more than 4 (default number) participants in the same machine. 

<participant_qos> 
   <initial_peers>
      10@builtin.udpv4://127.0.0.1
      10@builtin.udpv4://192.168.1.10
      10@builtin.udpv4://192.168.1.11
      10@builtin.udpv4://192.168.1.12
   </initial_peers>
</participant_qos>