Strict Reliability And Destination Order QoS

5 posts / 0 new
Last post
Offline
Last seen: 10 years 7 months ago
Joined: 09/30/2013
Posts: 6
Strict Reliability And Destination Order QoS

I have a test scenario where I start my Connext application, publish some messages between a handful of readers and writers (within the same participant, using UDPv4) and verify the data was properly processed.  This test is driven by a script that restarts the test over and over until a failure is detected.  These restarts are run on 5 different machines in a lab environment.  On four of the machines the tests will run for thousands of restarts and all the DDS messages are properly received by all readers, but on one machine I am running into an issue where after 20-30 restarts, one of the readers never receives the sample it was expecting, even when the write was succesful.  The missing sample never showed up in a listener callback as lost or rejected, it just never made it.  I messed around with the QoS settings and was able to narrow down the problem, although I'm still a bit confused why its happening.

The original QoS values (file attached) for the readers / writers were set for strict reliability and with a destination_order of BY_SOURCE.  The performance settings were set aggresively as well because the test only takes about 3-4 seconds to complete.  When I switched the destination_order to BY_RECEPTION, the problem went away and all samples were always received by all readers.

I think its possible that because the test is executing so quickly that in a rare case, a couple of samples might be received by a reader with timestamps that are out of order and therefore, the reader drops one of the samples when using BY_SOURCE?  Does the strict reliability QoS also depend on using BY_RECEPTION?

 

 

AttachmentSize
File USER_QOS_PROFILES.xml8.67 KB
Gerardo Pardo's picture
Offline
Last seen: 2 days 18 hours ago
Joined: 06/02/2010
Posts: 602

Hi,

The Destination Order BY_SOURCE_TIMESTAMP should not conflict with reliable delivery when data originates on a single DataWriter.  The samples written by each DataWriter are guaranteed to have timestamps that are monotonically increasing (each equal or greater than the previous) and this the DataReader will not reject them based on source timestamp.  The fact that a sample may be dropped on the wire and the reliable protocol may repair it does not affect this fact. The out-of-order sample will be staged in the "reliability queue" of the DataReader until the repair comes and by the time they are pushed to the DataReader cache they will be pushed in the correct order and the source timestamps will not cause a problem.

Thank you for sending the QoS profiles. It helps a lot in seeing how you are configuring your system. I looked at it and it all seems fine.

I do not have an explanation for the behavior you are seeing. I can think of some scenarios that can cause what you are seeing, but I am not sure if they match your situation... See questions below:

  • Are all the applications ( writers and reader ) re-started on each scenario or are you keeping the reading applications running?
  • How many writing applications do you have on each test run?
  • You mentioned 5 different computers being used. Are all 5 being used in a single scenario, meaning that the receiving applications are running on different computers? Or do you mean that you use different computers in different runs, but each run is executing on a single computer? 
  • Related to the previous question? Are the sending and receiving applications on different computers?

If the reading application remains running and the writing application re-starts and somehow the "source" timestamp had rolled back, then it would be possible for the reader to miss samples because it would reject anything with timestamps earlier to what it had already received. This could happen if you started the DataWriter on a different computer (where the timestamp is not exactly synchronized).

Gerardo

 

Offline
Last seen: 10 years 7 months ago
Joined: 09/30/2013
Posts: 6

Thanks for the reponse, Gerardo!

My test setup is as follows:

  • Each machine that is running the restart tests is isolated to localhost by using a discovery peers file containing only the locahost address.  There's 5 machines running independently just to cover different machine configurations and operating systems.
  • All DDS communications are within a single application with a single participant.  The tests are actually for verifying my application level logic when processing DDS messages, not to test the middleware itself which is why I kept it all local.
  • The application is completely restarted on each test run.

In the case of the reader that doesn't get the sample, there are two writers publishing to the same topic and they write within about 250ms of each other.  The reader doesn't get the sample from the writer that was the last to publish.

Hope that helps explains things a bit.  I tried to create an SSCCE example to demonstrate the behavior, but haven't been successful yet.  To be honest, its hard enough to make it happen with my full application.  It only occurs on a single machine and even then as that machine runs throughout the day it happens less and less frequently until its rebooted.

Nate

Gerardo Pardo's picture
Offline
Last seen: 2 days 18 hours ago
Joined: 06/02/2010
Posts: 602

Hello Nate,

I assume you have a single thread calling "write" on both DataWriters. Otherwise the concept of "before" and "after" may not match what you expect due to context switches... If you are using two different threads you would either to synchronize them to guarantee order, or else call DataWriter::write_w_timestamp (and synchronize getting the timestamps) to make sure the timestamps are going in the order you expect.

Even if you are writing with a single thread it may make sense to use write_w_timestamp and pass the source timestamp explicitly just so you can make sure that the Operating System is not doing anything funny and returning an earlier timestamp on the last call. I do not expect this, but it is worth ruling it out...

I think I have a couple of guesses of what could be happening and it is "normal" given the scenario you describe.

If you are using two different DataWriters to write the data then each will send its data independently from each other. This means thare are logically two separate streams, each with its own sequence number and therefore the DDS Reliability Protocol will not force any particular order in the delivery of the samples from both streams.

GUESS#1

Say DataWriter1 writes a sample (S1) at time T1. Then 250 msec later DataWriter2 writes another sample S2 at time T2 where T2 > T1.

Now assume that for some reason T2 is dropped because some buffer was full and the DataReader receives S2 before it received S1. Then because they are independent streams (it comes from a different DataWriter) the DataReader has no problem accepting S2 and giving it to the application. At some later point in time the DataReader dectects it had missed S1 from DataWriter1, asks for a repair and gets it. When it gets S1 it realizes that its timestamp T1 < T2 and given the "BY_SOURCE_TIMESTAMP" order it is not allowed to pass it to the application at drops it.

This however does ot explain that it does not get the sample from the DataWriter that "was last to publish." In fact it should be exactly the opposite... So I think the correct guess is #2 below...

GUESS#2

Say DataWriter1 writes a sample (S1) at time T1. Then later DataWriter2 writes another sample S2 at time T2. However the computer returns an identical value of the timestamp; that is T1=T2. This could happen if the Operating System clock does not have enough resolution to distinguish both events.

The main goal of the BY_SOURCE_TIMESTAMP destination order is to guarantee "consistency" in that all DataReaders agree on what the 'last value' for the Instance is (NOTE: I say Instance because you are leaving the PRESENTATION access_scope to the default value which is DDS_INSTANCE_PRESENTATION_QOS).  For this reason the DDS implementation must choose deterministically one of this an consider it the "latest" this decision has to be done in a manner that is consistent in all DataReaders so the criteria used is to compare the DataWriter GUID (a global unique identifyer each DataWriter has) and based on the value make the determiniation.  Since all the DataWriters within a DomainParticipant are given GUIDs in consicutive order based on when they were created the result must be (my guess; I did not check the implementation)  that if the two timestamps line up then the DataWriter that was created first always wins...

If GUESS#2 is correct you could verify it by using the DataWriter::write_w_params operation which allows you to retrieve the actual parameters that were used by the middleware, including the source timestamp. To do this fill the WriteParams_t structure with the automatic values an set the replace_auto field to TRUE so that the write operation fills the WriteParams_t structure with the actual values upon return. If you see that the failure corresponds to identical values of the WriteParams_t::source_timestamp then it is likely guess#2 is what is happening.

Gerardo

 

Offline
Last seen: 10 years 7 months ago
Joined: 09/30/2013
Posts: 6

Gerardo,

Your Guess #2 is correct.  The timestamps from the two writers are the same when the sample goes missing.

Thanks for the detailed explanation!  It helps tremendously in gaining insight into the inner workings of DDS.

Nate