Hello, all,
We have recently begun to see a significant number of samples being lost in our system, with on_sample_lost indicating that the sample was dropped by the writer (LOST_BY_WRITER). Most of our topics are best-effort reliability.
Does anyone know how this particular error is detected and communicated?
My initial assumption is that the the publisher is trying to send samples faster than they can be sent over the network. Is that assumption valid?
For some background, our system consists of:
2 nodes (Linux & Qnx)
12 hosts (2 on Qnx, rest on Linux)
30 topics
~10,000 samples/sec
Shared memory & UDP transports
Any suggestions? Thanks!
You may need to experiment with some of the parameters in your Resource Limits QoS to get a better idea if the loss is due to history life-span, queue sizes, etc.
I've found this community article useful for problems such as this:
Tuning Queue Sizes and Other Resource Limits:
https://community.rti.com/static/documentation/connext-dds/5.3.1/doc/manuals/connext_dds/html_files/RTI_ConnextDDS_CoreLibraries_UsersManual/Content/UsersManual/Tuning_Queue_Sizes_and_Other_Resource_Li.htm
I would start with the history kind and depth to see if that has an effect.
Thank you for your reply, Gary. That is indeed an instructive (and dense) article. I haven't fully digested it yet, but one thing jumped out at me:
Aren't those queue sizes and tuning strategies mostly geared towards RELIABLE delivery samples? Almost all of our samples are best effort, and almost all of the on_sample_lost errors are reported on topics with best-effort delivery.
Am I misunderstanding the contents of the article?
Hi Jason,
You are correct, most of the tuning strategies are for reliable delivery so no need to go down that route. Sorry for the detour.
In terms of the cause of the LOST_BY_WRITER on_sample_lost status indicator, I was discussing this with a few colleagues and this is a general message. This message is triggered when a reader is waiting for a specific sequence number data packet but was delivered a message with a different sequence number. The reader makes the assumption that the expected sequence was dropped by the writer. However, it could have been lost in a transient for any number of reasons.
Since you started seeing a lot more of these messages recently, it does raise the question of what has changed in general: network changes or congestion, faster/slower machines, etc.
You could probably get a better idea of the cause by enabling logging and/or instrumenting your code.