Hello,
We had a recent issue with a DDS sample not received by a Writer.
We have two processes: ExamManager (PID:55055) and PlatformClient (PID:35435). ExamManager is the writer for two topics: closeExam and ReconRequest. PlatformClient reads those topics. PlatformClient has only one DomainParticipant and one subscriber to which all the readers are attached, and only one publisher for which all the writers are attached. ExamManager has 2 DomainParticipant (one for large data and one for all the other topics such as ours). For each DomainParticipant, ExamManager has one subscriber and one publisher. The QOS profiles and the Domain ID match between those 2 components for all the topics that they share.
The issue is that the ReconRequest sample was published by ExamManager but never received by PlatformClient. The QOS profile used is the same for both and is named "SwApps.LastValueCache" (see attached QoS file)
We thought at first about a discovery issue between those two components, but 4 seconds after the sample not received, ExamManager published a sample on the topic "closeExam" and the message was successfully received by PlatformClient.
Please find the RTI logfile attached (sorry it is in Warning level). The lost message is the following:
requestKey:
: "49aa840d-7ea9-4efe-91c3-46093839dac3"
params:
appType:
ApplicationType: 2
algo:
AlgoID: 1
reconType:
ReconType: 1
projDir:
: "/export/home1/sdc_image_pool/images/p82/e84/s1512"
Here is a summary of the timing:
- At 10:48:37.381, ExamManager posted a message on the "ReconRequest" topic. Nothing came in PlatformClient.
- At 10:52:56.288, ExamManager posted a message on the "closeExam" topic and PlatformClient received it at 10:52:56.288
We observered as well a strange behavior that affects more generally our components having readers. The take methods is triggered and returns a DDS_RETCODE_NO_DATA. From what I saw in your documentation, this is not necessary an error. But I could not understand why it is triggered then. The mask we use for our Readers is ANY_SAMPLE_STATE. Oddly it happened in PlatformClient at 10:48:37:402, which is 20 milliseconds after the sending of the ReconRequest that should have been read. At first we thought that it would give us some hints about our issue, but as it is happening frequently we think it may just be a coincidence. Can you confirm it?
This is the first time we lose a sample for this topic. Moreover it happened only once and is not reproducible, or we did not understood the scenario yet.
Do you have any idea about what may have happened? If you need further information do not hesitate in contacting us.
thank you
Edouard Squillaci
Attachment | Size |
---|---|
RTI traces | 675.62 KB |
QoS file | 8.02 KB |