Hi, I'm testing Connext DDS java version on Windows XP.
I calculated RTT(round trip time) of reliable message and founded that on initial RTT takes 1~2 sec (average takes 2 ms ) during 9 sec.
Reliability QoS is set as reliable for the DataWriters & DataReaders, and also set the asynchronous publish mode with default flow controller.
Sender sends 1600bytes sized reliable data every 50ms and 8000bytes sized unreliable data every 10ms.
I guessed that data sending parts take much time at some interval.(See the attached file.)
This situation happenes only in java program ( I tested same program in C++ and C# but they don't appear this situation.)
I want to know why this situation happenes and what else configuration setting I have to do.
Thanks in advance.
Attachment | Size |
---|---|
RTT_reliable data_rti.xlsx | 43.36 KB |
Hello,
The excel spreadsheet that you attached appears to be corrupted. I tried to open it using Excel both on Windows and on a Mac and I get an error. Could you perhaps place it inside a ZIP file and attach it again?
Does the "initial" RTT refer to the very first sample you sent? What QoS are you using for the DataWriters and DataReaders? If you are configuring the QoS via an XML file can you also attach this file so we can take a look?
Absent further information my guesses would be:
1) If it is the RTT of the first sample the one that takes a long time, then it may be that what you are really measuring is the discovery time. The first sample (or its reply) cannot be delivered until discovery completes on both sides. This uses a separate aynchronous thread and the timing might depend on the Java VM scheduler.
You an eliminate this scenatio by either removong the first sample from your RTT computation or else waiting until discovery completes (notified via the listener on the DataWriter/DataReader or the Discovery builin-topics) before starting to write.
2) If it is some some samples other than the first the one that has a large RTT then my guess is the QoS setting of the reliable protocol paramaters is not aggresive enough. The out-of-the-box settings are more oriented towards low CPU and network usage. Not to low-latency or high-thoughput.
If you do not lose any samples then you will not see this. But if for some reason a sample is lost, then the time to repair that sample with the out-of-the-box settings can be quite large (up to of 3 seconds) with the out-of-the-box settings. This can explain what you are seeing.
The key parameters here are in the DataWriterQoS. Specifically the settings of the attributes under DataWriterQoS.protocol.rtps_reliable_writer. This attribute is of type: RtpsReliableWriterProtocol_t and within it you the critical fields are: max_samples , min_send_window_size, max_send_window_size, heartbeats_per_max_samples, max_nack_response_delay, min_nack_response_delay, late_joiner_heartbeat_period
The following are typically reasonable values to minimize latency:
max_samples = com.rti.dds.infrastructure.ResourceLimitsQosPolicy.LENGTH_UNLIMITED
min_send_window_size = 20;
max_send_window_size = min_send_window_size
heartbeats_per_max_samples = max_send_window_size
( OR max_samples IF max_samples != LENGTH_UNLIMITED)
max_nack_response_delay = {0, 0}
min_nack_response_delay = max_nack_response_delay
fast_heartbeat_period = {0, 1000000}
late_joiner_heartbeat_period = fast_heartbeat_period
Gerardo
Thanks you and sorry for the attachment.
#1
The original QoS for my DataWriters and DataReaders are listed below.
[DataWriter QoS]
writerQos.deadline.period.sec = Duration_t.DURATION_INFINITE
writerQos.history.kind = HistoryQosPolicyKind.KEEP_ALL_HISTORY_QOS;
writerQos.reliability.kind = ReliabilityQosPolicyKind.RELIABLE_RELIABILITY_QOS;
writerQos.reliability.max_blocking_time.sec = 5;
writerQos.reliability.max_blocking_time.nanosec = 1000;
writerQos.destination_order.kind = DestinationOrderQosPolicyKind.BY_RECEPTION_TIMESTAMP_DESTINATIONORDER_QOS;
writerQos.history.depth = 12;
writerQos.publish_mode.kind = PublishModeQosPolicyKind.ASYNCHRONOUS_PUBLISH_MODE_QOS;
writerQos.publish_mode.flow_controller_name = FlowController.DEFAULT_FLOW_CONTROLLER_NAME;
[DataReader QoS]
readerQos.deadline.period.sec = Duration_t.DURATION_INFINITE
readerQos.destination_order.kind = DestinationOrderQosPolicyKind.BY_RECEPTION_TIMESTAMP_DESTINATIONORDER_QOS;
readerQos.history.kind = HistoryQosPolicyKind.KEEP_ALL_HISTORY_QOS;
readerQos.reliability.kind = ReliabilityQosPolicyKind.RELIABLE_RELIABILITY_QOS;
readerQos.reliability.max_blocking_time.sec = 5;
readerQos.reliability.max_blocking_time.nanosec = 1000;
The test result for the QoS were written in the attatchment, "1.txt".
In the attatchment, I compared the results between C# and Java and the QoS settings between them were the same.
The test result of C# looks perfectly fine.
#2
Today, following your advice, I added the QoS setting you mentioned based on the original QoS setting listed above, and the original problem, sudden increase of RTT, looked solved.
But a new problem rose and I lost some data, even I enabled the reliable QoS!!
I attached the result of today's test ("2.txt", the problem occured around the 1667th message)
#3
I changed two parameters again.
The changes were ....
- Publish mode: from Asynchronous to Synchronous
- fast_heartbeat_period: from 1ms to 100ms
- late_joiner_heartbeat_period: from 1ms to 100ms.
In this case, all published samples were arrived to the published node without any loss, but the RTT took much longer and it was about 300ms.
The result is attatched as "3.txt" and the problem could be found around the 1667th message.
Would you please firgure out why these problems were happened?
I really appreciate your help.
Thank you.
Hi,
The 2nd and 3rd attachments seems to be swapped with regards to yout description. The 3rd attachement (http://community.rti.com/sites/default/files/3_0.txt) shows some data-loss around sample 1667 but no delays. Whereas the 2nd attachment (http://community.rti.com/sites/default/files/2_0.txt) shows no loss but higher delays around sample 1667. Can you confirm which is which scenario?
It is very interesting that the issue appears in both cases at sample 1667. Is this repeatatable over various tests? I wonder if this this caused by some Java VM behavior where it decides to do garbage collection or something like that and in doing so it introduces significant delay in the critical path. Are you doing Java object allocations and frees in the critical path of message processing?
It seems that something is causing a delay in processing that sample (1667) and dependign on the reliabilty settings that delay is causing samples to be lost. I think this could be explained by the settings of the HISTORY QoS. Can you confirm which settings you are using for Example 2 and Example 3?
Data-loss when you configure the RELIABLITY QoS to RELIABLE is typically due to the configuration of the HISTORY QoS as kind KEEP_LAST with a small history depth.
The out-of-the-box settings are HISTORY kind=KEEP_LAST, depth=1.
The same applies to the DataReader.
This settings indicate that the DataWriter only needs to keep in its cache the last sample for every instance (key). Thus if you are writing fast it can happen that a sample is lost, and by the time the reliabilty protocol detects it, sends a NACK and the DataWriter is ready to repair a new sample was written for the same key and it replaced the previous one. Note that this would not violate the 'RELIABILITY' contract because RELIABILITY just requires the DataWriter to reliably communicate the samples in the DataWriter cache to the DataReader, but the History KEEP_LAST is saying that having all the history is not needed, that it is better to keep the last few samples (per key) and not burden readers that might have lost something with when is now old data.
If you do not want to lose those 'intermediate' samples. Then you can either set the HISTORY 'depth' to be larger so that at the rate that you write each key the system has time to repair missing samples.
Or another alternative would be to use HISTORY kind=KEEP_ALL. In which case the DataWriter cannot remove a sample until us fully acknowledged by all active reliable DataReaders.
I would make that setting both in the DataWriter and DataReader QoS because the DataReader with a setting of HISTORY kind=KEEP_LAST, depth=1. Could also miss a sample if for some reason the scheduler did not wake up the DatsReader, or the application decided to not read/take the sample and in the meantime a new sample for the same key arrives.
Gerardo
Hi,
Thank you for your concern and support.
In my case, I confirmed that data losing at sample 1667 is repeated and garbage collector doesn't work. And I set the HISTORY kind = KEEP_ALL.
Actually, I resolved my problem with tuning the JVM.
I just changed the JVM to JRockit and found that data loss didn't occur , decreased RTT delay and the delay only occurs at initial phase.
So, I tunned my Hot Spot JVM to increase heap memory space and page size and so on.
After then, no data loss and no higher delays happen.
Is this solution right?
If you have any other suggestions, please let me know.
Once again, thanks for your help.