Regulate sample traffic with datareader QoS

4 posts / 0 new
Last post
Offline
Last seen: 2 years 10 months ago
Joined: 01/27/2014
Posts: 22
Regulate sample traffic with datareader QoS

Hello,

I have a writer publishing a lot of topic instances (> 100 000), one sample per instance.

I want a late joining reader (or several) to get the last sample of each instances from the writer history.

I don't want the writer to block when writing samples.

I set the QoS for both datareader and datawriter:

KEEP_LAST_HISTORY_QOS
depth = 1

When the reader starts, all the samples are sent on the wire from the writer to the reader, leading to a huge peak in the flow of UDP packets.

When there is a big number of datawriters and datareaders of this kind in the whole system, the network may collapse.

 

However, I have no need to have all the samples sent on the wire when the reader start.

I would prefer that the reader requests new samples gradually as it process them, by using QoS.

 

To be more precise, my reader is processing samples using take in the datareader queue.

I am looking for a way to have DDS internally requests new history samples (one for each instance) gradually as there is free space in the dataread queue.

So I have tried to to limit the datareader queue size by adding:

max_samples = 100

However with this sizing, the datareader is losting samples (not get all the instances) and I don't understand why.

 

Do you have an explanation on this behavior ?

And perhaps a solution to my need ?

Boris.

 

 

Organization:
Keywords:
Howard's picture
Offline
Last seen: 1 week 1 hour ago
Joined: 11/29/2012
Posts: 623

Hi Boris,

I assume that you are using the Durability QOS and a non-VOLATILE durability setting.  This method will automatically send the historic data upon discovery...and there is no way for the user to start/stop this process via an API.

However, applications can use Topic Query API to explicit request for data that's stored in a DataWriter's History Queue...

See

https://community.rti.com/glossary/topic-query

https://community.rti.com/static/documentation/connext-dds/6.0.1/doc/manuals/connext_dds/html_files/RTI_ConnextDDS_CoreLibraries_UsersManual/index.htm#UsersManual/TopicQueries.htm#Topic_Queries%3FTocPath%3DPart%25203%253A%2520Advanced%2520Concepts%7CTopic%2520Queries%7C_____0

Using Topic Query, you will have to write more code to request and process historic data in a separate channel...it's not as seamless as using Durability

 

If you want to continue to use the Durability QOS, then you'll have to understand the underlying mechanism used by DDS to send historic data to new DataReaders...

So, when using KEEP_LAST history, DDS is allowed to overwrite the receive cache/queue if the application hasn't take()n the data and the queue is full (you're getting data faster than you can process).  With a large max_samples, this will happen later, with a small max_samples, this can happen sooner.  Of course this can result in lost data.  (from DDS's view point, DDS has reliably gotten the data from the sending app to the receiving app, but the receive app didn't process the data and has configured DDS to basically use the receive cache as a circular queue.

You can change this behavior by using KEEP_ALL history.  In that way, space will only open up in the receive cache when the user application has take()n the data received...allowing more data to be accepted into the cache.  HOWEVER, if the sending app is sending faster than the receiving app is processing, the receive cache will eventually fill up and any data that arrives when the cache is full is rejected and dropped by DDS.  This does not mean that the rejected data is lost...DDS will NACK the dropped data and ask the sending app to resend. 

Unfortunately, in your scenario, this will only exacerbate the problem since with a small receive cache, only small part of the burst of historical data will be able to be received on the first try and the extra rejected data will need to be sent again...which takes more network bandwidth, and depending on amount of historical data and the size of the receive queue, subsequent bursts of data may be rejected over and again.  All of this leads to additional network load due to sending the same data over and over again in repair packets.

 

So, there are some ways that you can use to control the bandwidth used for sending the historical data.  DDS uses the reliable protocol for sending data that's stored in the DataWriter's send cache (aka history queue) that a reader hasn't yet received.  When using non-VOLATILE Durability, when a new DataReader is discovered, all of the data in the writer's history queue is considered data that hasn't yet been received by the reader. 

The writer app will send HB (heartbeats) periodically to inform the reader that there is data that it hasn't received.  The reader will send NACKs (negative acknowledgements) to the writer asking it to resend the data.  In the HB and NACKs are sequence numbers corresponding to the data that hasn't been received or the data that should be repaired.

Connext DDS gives you a bunch of QOS parameters that fundamentally will allow you to control the bandwidth used by this repair process.

1) DataWriterQos.protocol.rtps_reliable_writer.late_joiner_heartbeat_period - This is the period at which HB's are sent to newly discovered, non-VOLATILE DataReaders

2) DataWriterQos.protocol.rtps_reliable_writer.max_bytes_per_nack_response - This is the maximum number of bytes (for repairing missing data) that it will send in response to a a single NACK

3) DataReaderQos.protocol.rtps_reliable_reader.nack_period - This is the period that a DataReader will preemptively send NACKs for historical data when it discovers a new DataWriter

The default values of the late_joiner_heartbeat_period and the nack_period are actually quite large...3 and 5 seconds respectively.  They're actually not the main mechanism that is working to get the DataWriter to send the historical data to the DataReader.

Once the DataWriter receives a NACK for "missing" data, it will immediately send the "missing" data.  A NACK won't ever NACK for more data samples than the DataReader is able to store in it's receive cache or 256 samples...which ever is less.  And the DataWriter will only send up to max_bytes_per_nack_response number of bytes (default 128 KB).

If there are data samples that were NACKed but were not sent because of the byte limit, that data will only be sent when a new NACK is received.  HOWEVER, with each NACK response is a HB at the end, so this process is self-triggering...i.e., does not depend on the periodic HB to continue.  Repair data will be sent out in bursts of max_bytes_per_nack_response. 

There is a set of QOS parameters in which you can use to somewhat control the rate at which these bursts are sent out, however, changing these values to slow down the nominal reliability repair process if during steady state operations, the network loses a packet that needs to be repaired

4) DataWriterQos.protocol.rtps_reliable_writer.{min/max}_nack_response_delay - which basically controls when DataWriter responds to a NACK.

For the fastest repair of a missing data you'd want the DataWriter to immediately send a repair packet (again historical data are basically repair packets).  And in steady state that's what you would typically want and thus would set these delays to 0.

But if you don't want to flood the network with repair of historical data to new datareaders, then you would want to set these delays to a larger value...you can experiment to see what value provides what bandwidth usage...this along with the bytes per nack response.  However, note that this value is set the behavior both for historical data and for live data.

This may or may not be sufficient for what you want to do...  So you can change max_bytes_per_nack_response to 64 KB or 32 KB or something...(should be at least as big as the largest data sample that needs to be repaired) and fiddle with the response delays to see if the resulting behavior meets your requirements both at startup and in steady-state.

 

 

 

 

 

 

Offline
Last seen: 2 years 10 months ago
Joined: 01/27/2014
Posts: 22

Thanks a lot for your detailed explanation.

 

Indeed, I realized that one can set KEEP_LAST for datawriter and KEEP_ALL for datareader.

With this config, datawriter always keep last sample and datareader never lose creation of instances even with a finite queue size.

Moreover, the datareader "regulates" the traffic by not asking samples when it is not needed.

That's Perfect !

 

I have now to play with the QoS you suggested in order to avoid as most as possible peaks in the UDP flow, due to the DDS protocol.

If I may, what do you think about this QoS:

      <datawriter_qos>
        <protocol>
          <push_on_write>false</push_on_write>
          <rtps_reliable_writer>          
            <heartbeats_per_max_samples>100000</heartbeats_per_max_samples>
            <high_watermark>1</high_watermark>
            <low_watermark>0</low_watermark>
            <min_nack_response_delay><sec>0</sec><nanosec>0</nanosec></min_nack_response_delay>
            <max_nack_response_delay><sec>0</sec><nanosec>0</nanosec></max_nack_response_delay>
            <fast_heartbeat_period><sec>0</sec><nanosec>100000000</nanosec></fast_heartbeat_period>
            <late_joiner_heartbeat_period><sec>0</sec><nanosec>100000000</nanosec></late_joiner_heartbeat_period>
            <max_bytes_per_nack_response>10000</max_bytes_per_nack_response>
            <max_heartbeat_retries>DDS_LENGTH_UNLIMITED</max_heartbeat_retries>          
            <heartbeat_period><sec>3600</sec><nanosec>0</nanosec></heartbeat_period>            
          </rtps_reliable_writer>          
        </protocol>
      </datawriter_qos>
      
      <datareader_qos>
        <protocol>
          <rtps_reliable_reader>
            <min_heartbeat_response_delay><sec>0</sec><nanosec>0</nanosec></min_heartbeat_response_delay>
            <max_heartbeat_response_delay><sec>0</sec><nanosec>200000000</nanosec></max_heartbeat_response_delay>
          </rtps_reliable_reader>
        </protocol>
      </datareader_qos>

Note that the data carried by the instances is quite small (<< 1kb) and I have no hard real-time constraints on the writer/reader data exchange.

Boris.

Howard's picture
Offline
Last seen: 1 week 1 hour ago
Joined: 11/29/2012
Posts: 623

Hi Boris,

So, it's difficult to comment on the QOS that you provided without any context, i.e., what is the behavior you're trying to configure or performance requirements that you're trying to meet with the QOS changes.  Also, the QOSes that you posted interact with other QOSes that you didn't show.

I can say that your changes look strange:  Why a heartbeat period of 1 hr and fast/late joiner HB periods of 100 ms?  Why 100,000 HB per max samples?  What is the value of max samples?  What are you hoping to acheive with the push_on_write = false and max_bytes_per nack response of 10,000?  Or the heartbeat_response_delay of 200 ms?