Losing Messages

10 posts / 0 new
Last post
JC
Offline
Last seen: 2 years 8 months ago
Joined: 02/03/2015
Posts: 16
Losing Messages

I have a two Java applications using DDS messaging one of which spams messages on 6 different topics and the second reads them. The messages contain a sequence of numbers which is used to verify if messages are lost (sequence is broken). When I run these applications locally no problem occurs they run at a high rate and run indefinitly however when I move one of these applications to a second machine thats when the problems start to appear. I start seeing an alarming number of messages being lost even though the rate of the spamming is reduced significantly, waitsets are using on the topic readers and the QOS is set to best effort. I have attached the qos profile that both applications are using. Please note that I have ruled out the implementation as no messages are being lost when run locally at high rate.

Any ideas why the middleware would be losing messages at what seems to be a reasonable rate of sending? 

AttachmentSize
File qos_profiles.xml1.03 KB
Offline
Last seen: 1 year 5 months ago
Joined: 05/23/2013
Posts: 64

Hi,

If you set your reliability QoS to "best effort", the middleware does not guarantee reliable transmission over network. Your applications would not miss any messages when these are running on the same machine becuase they could use shared memory to exchange messages. However, if your applications are running on different machines, the middleware uses UDP transport as a default, so messages would be possibly lost regardless of your sending rate. The "reliable" reliability QoS retransmits messages if messages are lost in network while the "best efforts" reliability QoS does nothing to lost messages .So if you don't want your applications to miss any samples, you should set your reliability QoS to "reliable".

Thanks,
Kyoungho

Gerardo Pardo's picture
Offline
Last seen: 1 month 1 week ago
Joined: 06/02/2010
Posts: 602

As Kyoungho said this is not unexpected for BEST_EFFORTS.

If running on the same kind of CPU/computer a DataWriter can typically write faster than a DataReader can read it. This is because the DataReader has the extra context switch time to interrupt the CPU and notify it from the network packet arrival.  The implication is that all other things beign equal if you have a DataWriter that writes "as fast as it can" it can overwhelm the CPU on the receiver so it cannot keep up. 

When you use a RELIABLE DataWriter and DataReader you will normally not see this packet loss. This is because the DataWriter is notified via ACKs/NACKs that the DataReader cannot keep up, and RTI Connext DDS will try to throttle back/flow-control the DataWriter so that the DataReader is given a chance to keep up. You could still see messages not receceived if QoS is configured to favor receiving latter updates over old repairs (e.g. HISTORY kind=KEEP_LAST) or you are using time or content based filters. 

That said, if your network is not overloaded and you do not write faster than the reader can handle you typically do not see packet loss even in BEST-EFFORTS. But if you want to ensure reliability or if you want the writer to send as "fast as possible" and regulate its rate based on what the readers can handle, then you should use RELIABLE.

Gerardo

JC
Offline
Last seen: 2 years 8 months ago
Joined: 02/03/2015
Posts: 16

After the reliability setting was updated to RELIABLE it increased its reliability across the network considerably however it introduced another oddity. With the qos profile attached set to RELIABLE readers are now picking up already read messages. So if I send a message to the application and it reads it then i close the application and relaunch it soon after the topic reader finds the old message. This is also true if I read the message then unregister the reader to the topic and then register again it finds the same read message. I thought the VOLATILE_DURABILITY_QOS prevented this behavior? It seems to work as expected when the reliability it set to BEST-EFFORTS.

Gerardo Pardo's picture
Offline
Last seen: 1 month 1 week ago
Joined: 06/02/2010
Posts: 602

This should definitely not be happening. Are you sure you have not also changed the DURABILITY QoS when you swiched to RELIABLE? Or maybe it is picking the QoS from a file in the same directory? Perhaps to be 100% sure you can double check this using Admin Console and see what the running application is announcing? 

Gerardo

 

JC
Offline
Last seen: 2 years 8 months ago
Joined: 02/03/2015
Posts: 16

I start by loading the above QOS profile and apply it using set_qos on the participant factory. I then do the following:

- Create the participant using the PARTICIPANT_QOS_DEFAULT

- Create a publisher using the PUBLISHER_QOS_DEFAULT 

- Create a subscriber using SUBSCRIBER_QOS_DEFAULT

- Create all topics using TOPIC_QOS_DEFAULT and for the datawriters and datareaders I create them with DATAWRITER_QOS_DEFAULT and DATAREADER_QOS_DEFAULT.

Does this sound right? I just need it to operate over a network reliably, dont receive old messages when an application first creates the topic or topic reader and keep a history of about 20 on every topic.

Offline
Last seen: 1 year 5 months ago
Joined: 05/23/2013
Posts: 64

Hi,

If possible, would you updload your source code here? I could run your code on my machine and then help you on resolving your issue.

Kyoungho

JC
Offline
Last seen: 2 years 8 months ago
Joined: 02/03/2015
Posts: 16

I believe I solved the issue. I think its because if I terminate the application (doesn't get chance to delete participant and finalize_instance) and a new instance is launchered very quickly (within 5 or so seconds) it starts picking up old messages that the previous application already read.

thanks for the help

Offline
Last seen: 1 year 5 months ago
Joined: 05/23/2013
Posts: 64

Hi,

That stills sounds like unexpected behavior to me.
When your application (I guess you refer to a subscriber application?) is terminated and restarted, it would spawn a new process and the new process should not receive old messages if you set the durability QoS to "volatile".

Thanks,
Kyoungho

Gerardo Pardo's picture
Offline
Last seen: 1 month 1 week ago
Joined: 06/02/2010
Posts: 602

@JC. You are right. I stand corrected. There is in fact a race condition.  Samples that are not fully acknowledged by the "existing" DataWriters sit on some special state and they wil be sent to a late-joiner DataReader even if the DataReader is VOLATILE... So it seems that when your first application was terminated it had received the messages but it still had not acknowledged them (as far as the DataWriter could see).  So thet were sitting in that "special state" and were sent to the late-joiner reader.

@Kyoungho. You are right this is what one "would expect" however I did some checking and there is in fact a bug report/feature request already entered to address this edge case so in the future way may change it to provide a experence closer to what would seem the more correct behavior.  

Gerardo