Liveliness reproducibly drops after certain time

3 posts / 0 new
Last post
Offline
Last seen: 2 years 8 months ago
Joined: 08/17/2021
Posts: 1
Liveliness reproducibly drops after certain time

Hello guys,

I've been sitting on a problem for a while now and couldn't figure out what's going wrong, that's why I'm seeking for help here.
I have an application consisting of a master and multiple worker nodes, using the Connext Modern C++ API. Both types of nodes have to detect when the other one is not there any more, e.g. due to a network failure or crash.

For this purpose, I use a QoS with Liveliness (more precisely, inherited from BuiltinQosLibExp::Pattern.Event). I have set RELIABLE_RELIABILITY_QOS, KEEP_ALL_HISTORY_QOS, AUTOMATIC_LIVELINESS_QOS and a lease_duration of 50ms. In my application, I have listeners attached to the on_liveliness_changed() event as well as to on_data_available().

 

 

My issue now is, that while this generally works pretty well, my application reproducibly reports the loss of liveliness of several nodes after about 50 seconds. Up to this point, everything works well, and liveliness is asserted without any issues. The application itself always repeats doing similar things, so the same code runs around 50 times until this happens. As this reproducibly happens, I suspect that some kind of buffers or something else is running full up to this time, but I could not figure out what exactly that could be.

The logs do not show anything (no warnings, errors) and there is no failure in the environment (network is up, nodes are connected to the same switch), so I do not really now where to continue investigating.

Up to this point I tried the following things:

  • switching to MANUAL liveliness kind and making sure I always assert the liveliness
  • stop sending much data (so basically the application almost only sends liveliness heartbeats, without the actual messages it should send) to assure there is no network overload
  • played around with assertions_per_lease, (set it to very high values, and very low values), which made it rather worse
  • pinned the DDS event thread onto one CPU core and increased its priority (to make sure events are processed, liveliness is assured)
  • increased OS level priority of the applications
  • increasing the lease_duration from the pretty low 50ms to 1sec. This also led to the application fail at some point in time, although it ran a bit longer than 50 secs. With 3secs, the application doesn't crash, but my goal is to get to know of node failures in <1sec, which should be feasible in my opinion. Increasing the lease_duration generally seems to delay the point of failure. E.g., from 50ms to 100ms, the applications fails 10-20 secs later.

I have attached the logs of a worker node and a master node. These were the only two nodes during recording these logs. In the end of the worker's log, one can see that heartbeats are sent but not received.

I would really appreciate some help, maybe another idea on what could be causing this. 

Best regards,
Oliver

 

 

 

 

 
AttachmentSize
Plain text icon master.txt67.53 KB
Plain text icon worker.txt68.04 KB
Keywords:
Howard's picture
Offline
Last seen: 6 days 9 hours ago
Joined: 11/29/2012
Posts: 567

Sorry to say that the logs don't really have the information needed to understand the problem.

When using AUTOMATIC Liveliness, the DomainParticipant (or specifically a thread associated with the participant) will periodically send out assertions of liveliness at a period of "lease_duration/assertions_per_lease_duration".  With AUTOMATIC liveliness, this is NOT HB (heartbeats) but a special packet that will show up as DATA(m) when using Wireshark.

Which brings us to Wireshark is the best way to understand what's happening.  I would use Wireshark to record the data both on the sending machine as well as the receiving machine.  And then see if DATA(m)'s are being sent by the Participant with the DataWriter for which Liveliness QoS was set to the Participant with the DataReader and then if the network is delivering DATA(m)'s by the receiving machine.

So, wireshark traces should show if Connext DDS is trying to assert the liveliness as configured...and if that is being received by the application that's detecting liveliness.

maxx's picture
Offline
Last seen: 6 months 1 day ago
Joined: 08/26/2020
Posts: 9

Hi Oliver,

Can you please provide the specific QoS settings you are using? And, if possible, a bit more information about the data payloads and rates between these applications? Have you implemented behavior to stimulate a loss of liveliness, and if so, how? 

As Howard mentioned, the attached logs do not include much relevant information to debug this liveliness problem, they provide information about data delivery and the reliability protocol heartbeats / ACKs. You should be able to get more information by enabling the Monitoring libraries and using RTI Monitor, accessible from the Launcher.

Thanks,

Maxx