We have occasional delays (DDS v5.2.3). Samples are delayed perhaps 15 seconds.
Curiously, samples of a single key are delayed together, whereas other samples we know were sent at the same time are not delayed.
I suspect the network is flaky and this is DDS doing its best to mop up network packet loss - these delays introduced as recovery takes place.
QoS is all history, reliable, multicast delivery, no deadline or other timing constraints so packets almost always there eventually.
How to I point conclusively the finger at the network? How can I measure how hard the DDS level reliability/recovery is working?
Hey David,
I believe RTI Monitoring Service allows you to see when samples are being resent (or when nacks are received?).
Another way to find out if you have a lot of missed samples is using wireshark to detect nacks listing multiple samples lost.
If you can do it, the best way is to record the network on the writer side and on the reader side (filtering data being sent from the writer to the reader and backwards) and see if the results are equal or not.
Good luck,
Roy.
I was going to ask how to distinguish the ACKs from NACKs when the WireShark example for RTI DDS has many....but now I come to look at my local test system, my multicast delivery mechanism doesn't have either, so I'm guessing it's a NACK based mechanism so they'll be easy to spot.
We'll see what turns up from the customer in the way of logs.
Thanks Roy
More examples today of data delayed by key.
This suggests DDS having to do message recovery for msg 1 and delaying 2,3,4 so the whole lot arrives late and together as 1,2,3,4 whereas other stuff with a different key carries on around it...
Still no WireSharks to work with....
Hey David,
Firstly: why no wireshark?
Secondly: lets try to understand your architecture better
how many applications do you have? how many participants do they have? how many publishers? subscribers? topics? readers? writers?
If what you're seeing is one application that has a signle participant, topic, publisher, writer sending multiple instances to one other application that has a single participant, topic, subscriber, reader:
1. do you see the problems for the same instance (that is, when the data is delayed, is it always the same instance they are delayed for)?
2. are you using java api?
3. what is the pattern for sending messages of different instances?
4. is the update rate of all instances similar?
I would avoid rushing to conclusions with as little data as you've provided us.
Good luck,
Roy.
Further investigation, having got some wiresharks showed:
So...optimised to use multicast heartbeats at a rate appropriate to the topic, make it a no-ack system only nacks and do multicast resends. Also adjusted history/depth to be a sensible value rather than ALL to better constrain sender resources.
These optimisations seem appropriate for our application, initial tests are positive and further anaysis required.
I was surprised how much unicast traffic there was for our 40 hosts/4 topics.