Hi
I have a system with 15 participants spread on 10 workstations. Every participant is subscribing 1000 or 500 topics. I am not using keyed topics and I cannot change to do this. One of the participants is also publishing the same 1000 topics.
I have a test where I push a button and the publisher will update the 1000 topics twice and measure how long it takes for all the subscribers to get all the samples.
I am using RELIABLE_RELIABILITY_QOS and KEEP_LAST_HISTORY_QOS with a depth of 2.
The issue that I am seeing is that sometimes the datareader / subscriber in the same application as the datawriter / publisher is very slow to get all the samples. All other participants the run on other workstations receive the samples within about 1 second but the local reader can takes up to 15 seconds until it gets all the updates. They are both using the same participant instance.
It will not happen every time I execute the test but only about 1 out of 5 times.
When it happens, I can see that most of the updates arrive quickly as I would expect and then 1 to 40 of the updates get delayed. They will then drop in slowly every few seconds.
The CPU load will spike when I push the button but quickly drop and stay low until the samples are received.
To try and narrow down what is happening I have tried to reduce the number of participants. It looks like this reduces the likelihood but I still see the issue with just 2 participants. However, the delay that I see is also smaller.
I have added checks in the code that test that my application is not in on_data_available for more that 100 ms. I did this to make sure that it is not my code that is creating the delay.
1) But I would like to know if there is something else I can check?
I have found that if I change so that the heartbeat is send with every sample and I delay the heartbeat response the issue does not appear. This is shown in the profile named DDLQoS_V2 in the attached USER_QOS_PROFILES.xml.
2) What I would like to understand is why this might happen?
3) Is there any reason why the heartbeat and acknack should be affecting local communication within a participant?
4) How does the sending of a sample from datawriter to datareader in the same participant work?
5) Is there something else I can try in order to understand the issue?
Thank you very much for any input
UPDATE:
I have created a small application SpeedTest that is attached based on the HelloWord example that will demonstrate the issue. I tested it with 10 subscribers on 10 workstations and that showed times from 500 ms to 10000 ms.
/Kennet
Attachment | Size |
---|---|
user_qos_profiles.xml | 1.47 KB |
speedtest.zip | 23.12 KB |
Hi Kennet,
I will try to reproduce the behavior you face with the attachment that you already included in the post. In the meantime, I would like to share some information that could be relevant in you case.
When you have more than one DomainParticipant attached to the same domain on the same machine, Connext DDS will attemp to use share memory for communicate. Connext DDS will define a limited shared memory segment (which you can configure) to share this information by all the DataReaders and DataWriters in the same domain (that belong to the same DomainParticipant).
If you are sharing a lot of information, as you do by sending two samples of 1000 topic, It may cause the resources to be consumed... and when this happens in a reliable scenario, the behavior could be what you describe: very slow in the begining because there is no space to store new sample information. When, at some point, the system manages to process all the accumulated samples, it comes back to normal.
you could check if that is the cause by increasing the log level and save the log in a file (https://community.rti.com/howto/useful-tools-debug-dds-issues) and look for something like this in the publisher log:
I hope this helps,
Irene
Hi Irene
Thanks for the feedback.
Just to be clear I do not have two DomainParticipants on the same machine. The publisher and the subscriber uses the same DomainParticipant.
If I enable the the log with warning level, I am not seeing any warnings when I see the delays.
If I enable the log with the all level, There is so much logging that the application is slowed down all the time and therefore it is not so easy to see the "slow" issue.
This is the output I get. ( I added a bit more info on what topics are delayed than in the code I attached )
HelloWorld publisher sending (28)
Received number 2000 on Hello World! (28) after 764 ms
HelloWorld publisher sending (29)
Received number 1935 on SpeedTest.0309 after 6536 ms Thread 5852
Received number 2000 on Hello World! (29) after 7020 ms