Hi,
I have come across an odd problem that perhaps someone could shed some light on.
I have a subscriber program written with the traditional API, I implement the on_sample_lost method to simply print to terminal status.total_count and I use a listener to retrieve samples. I have my publisher written in both the traditional API and the Modern API. When I send for example 10000 messages with the traditional API on_sample_lost never fires.
However, when I do the same task with the modern API pub after the writer program completes, my reader fires on_sample_lost with anything from 2000-7000 unacknowledged samples. The count between messages received and messages sent however is correct (i.e. the reader received 10000 messages with both writer programs).
Both publishers are not active simultaneously and each publisher/subscriber uses default QoS with the same sleep duration between writes. I run each program (pub/sub) on two identical machines both running the same Linux image, networked.
I have tried wait_for_acknowledgments writer side and also acknowledge_sample reader side doesn't seem to alleviate the problem. I have also waited for a few seconds writer side with modern C++ pub after the program completes to see if the acks arrive.
What am I missing? or could someone explain why this maybe happening, where I could to start to look/think about to resolve this or is it expected?
Thanks.
edit: sorry, I'm using 5.3.0 by the way.
To expand on this a little, what we are seeing is a seemingly random number (so far, between 2000 and 7000) of "ghost" messages being "lost" when a publishing application using the Modern C++ API terminates communication to a Traditional C++ API subscriber. The most confusing part of this issue is that the subscriber has received and processed the messages fine.
The publishing application is set to send a total of 10000 messages. The subscribing application receives these 10000 messages as expected, and accesses samples using the on_data_available callback. The number of received messages is tracked by the subscribing application and we can verify that it has received the 10000 messages.
After sending the 10000 messages, the publishing application terminates. We believe it is terminating "gracefully" though are not sure of this - we have followed the pattern of the generated examples and the publisher returns from publisher_main() and then from main() though nothing is explicitly torn down at application exit. When the publishing application does terminate in this manner, the subscribing application fires a number of calls to on_sample_lost, with the total number of lost samples ranging between 2000 and 7000; the reason for each of the losses is "lost_by_writer".
These on_sample_lost calls are fired after the subscribing application has successfully received and processed the 10000 messages the publisher sent. What is confusing us is that we are left with cases such as: Publisher sent 10000 messages, subscriber received 10000 messages and also lost 2000 messages - implying that 12000+ messages in total were sent.
We have been investigating this further today and are now under the impression that a call to "clock_gettime(CLOCK_REALTIME, &ts);" in the publishing application is causing the problem. If we remove this line and rebuild the application, everything behaves as expected. Does this behaviour sound familiar behaviour to anyone? Any ideas of where to look to see why we are seeing these "ghost" messages?
So, to sum-up:
1 subscriber - traditional API
1 publisher - traditional API
1 publisher - modern API
NOTES: