Losing Messages

7 posts / 0 new
Last post
Offline
Last seen: 10 years 11 months ago
Joined: 12/12/2013
Posts: 4
Losing Messages

I have two programs -- I'll call them REQUESTER and RESPONDER. The REQUESTER sends a request DDS message to the RESPONDER and, in the DDS callback, the RESPONDER sends back 88 small messages in a tight loop. Once the RESPONDER sends the messages it exits. When the REQUESTER receives all 88 messages it exits. However, the REQUESTER only gets the first 32 messages and waits for the others. The RESPONDER exits as expected after sending all 88.

If I put a 10ms sleep between each response message all 88 are received.

I've tried enabling reliable comms by setting RELIABLE_RELIABILITY_QOS and KEEP_ALL_HISTORY_QOS in the DataReader and DataWriter. No change.

I've tried sleeping in the RESPONDER after sending the messages (giving queued messages time to leave before exit). No change.

Any ideas?

Thanks,

rway

Details:
- ndds.5.0.0
- Using Java API on Windows Vista.
- Both apps on same machine.
- Only listening for DATA_AVAILABLE_STATUS
- Not using request/reply, just basic pub/sub with topics

Organization:
Offline
Last seen: 9 months 3 weeks ago
Joined: 06/13/2013
Posts: 17

Richard,

Are you using reliable or best effort? Assuming you are using reliable reliability my guess is that the messages are still in the writers send queue when the RESPONDER exists and the messages get deleted. You can use wait_for_acknowledgments to wait until all the responses have successfully been sent and the acknowledged. After that it is safe to exit the RESPONDER.  

Let me know if this works for you.

Andre

Gerardo Pardo's picture
Offline
Last seen: 3 weeks 6 days ago
Joined: 06/02/2010
Posts: 602

Hello,

I think what you are observing is caused by having your RESPONDER send all 88 messages witin the on_data_available callback.

To mimimize message reception latency the  on_data_availablecallback is executed within the context of the internal RTI DDS thread that receives the message. This means that while you are in the callback RTI DDS cannot process any other messages. This includes internal reliability traffic such as negative acknowlegments (NACKs).  Assume that some of the 88 messages sent by the RESPONDER to the REQUESTER get lost (actually dropped due to a resource limit as I explain at the end). The REQUESTER will send send back a NACK asking for the lost message to be re-sent. However the DDS thread within the RESPONDER that is supposed to process the NACK is busy inside your on_data_availablecallback so it cannot send the repair. The fact that you "sleep" inside the RESPONDER to wait for the messages to be delivered does not help because you are sleeping inside the on_data_available which it still preventing the processing of the NACKs. In fact this sleep just makes the matters worse.

The best thing to do would be to change the RESPONDER to not send the 88 messages from the on_data_availablecallback. You can do this using a DDS WaitSet to have some other thread (maybe the main thread) wait until data is received and then send the messages. If you do that and sleep a little after you have sent the 88 messages (or call wait_for_acknowledgments as Andre suggested) then all your messages should be delivered.

You may wonder why you seem to only get the first 32 messages... This is due to the fact that your RESPONDER is essentially trying to write all 88 messages insantaneously; due to processor/operating system interrupt latency the REQUESTER probaby does not even wake up before you have finished writing. In this situation if there is any resource limit that prevents all 88 messages from being stored then the latter messages will be lost.  Since you mention you are in the same machine the limit is probably due to the default setting for the SharedMemory transport which has a parameter called received_message_count_max that specifies the maximum number of messages that can be buffered on the shared memory reader queue. The default setting for this is 32. While this configuration can be changed (see the RTI Connext DDS User's Manual section 15.6 Setting Builtin Transport Properties with the PropertyQosPolicy) I would not recommend changing it because it can cause incompatibilities with other RTI DDS applications running on the same computer. So the approach of not sending the messages from the on_data_available is the best one I think.

This also explains why you do not see the problem if you pause after sending each message. In this situation the REQUESTER has a chance to process the messages before the 32 limit in its shared memory queue is reached. No message is lost and therefore the fact that the RESPONDER is not seeing any NACKs does not cause problems.  But this is a lucky scenario. As you can deduce from the previous explanation about the threads it is really a bad idea to sleep inside the on_data_available callback

Gerardo

Offline
Last seen: 10 years 11 months ago
Joined: 12/12/2013
Posts: 4

Thank you for the detailed responses. You have a professionally-developed, professionally-documented and professionally-supported product. It's a rarity and I really appreciate it.

I tried sending the 88 messages from the main thread instead of the callback. No change.

I tried calling wait_for_acknowledgements() with a 2 second timeout after sending the messages. The call times out, HOWEVER after that the remaining messages arrive at the REQUESTER.

I tried calling wait_for_acknowledgements() with a 10 second timeout. The call does not time out and all messages arrive, but there is a 4-5 second gap between the first 32 messages and the remaining 56.

I tried to create a small example to illustrate the problem, but it works as expected. All 88 messages are immediately received. So I must be doing something unusual in my larger program. I'll try to track that down, but if you have any more DDS-internals insights I'd really appreciate it. I'm out of ideas.

Gerardo Pardo's picture
Offline
Last seen: 3 weeks 6 days ago
Joined: 06/02/2010
Posts: 602

Hello,

Were you able to create a simple reproducer? This would certainly make it easier to troubleshoot what specific QoS of coding pattern is causing this...

What you are seeing is consistent with a situation where one of the last messages  the RESPONDER sends before it stops writing is lost. Say it is the last message it sends (the 88th in your case). In this situation given that there are no more messages being sent from the RESPONDER the only way for the REQUESTER to detect that it has lost something is the periodic heartbeat that the RESPONDER sends. However in Connext 5.0.x the default period for this 'periodic heartbeat' (the so called fast_heartbeat_period) is 3 seconds. This can explain the 3+ second gap you are seeing.

To avoid this you could try adjusting the DataWriterQoS and specifically the fast_heartbeat_period. The Knowledge Base article titled "Which QoS parameters are important to tune for throughput testing?" describes the settings that impact performance most.

Gerardo

Offline
Last seen: 10 years 11 months ago
Joined: 12/12/2013
Posts: 4

(1) Tweaking the Qos parameters mentioned in the "Which Qos parameters..." link I could change number of messages successfully received, but could not consistently receive them all. I don't know how to set or determine 'receive_buffer_size' or determine 'message_size_max' so I didn't know how to set the parameters exactly.
(2) The attached "simple reproducer" Java program demonstrates the problem. I had to up the number of messages (from my larger program) to see message loss. If you remove the comment on line 70 of the TestPublisher it slows message sending and all messages are received.

I don't mind if the writer blocks, but setting 'max_blocking_time' in the DataWriter had no effect.

Note: My two programs run on the same machine (Windows Vista), so I assume they are using the shared memory transport.

Any additional insight would be appreciated.

Offline
Last seen: 10 years 11 months ago
Joined: 12/12/2013
Posts: 4

Now that I understand the problem a little better, I think my question boils down to this:

When a publisher and subscriber are on the same machine and the publisher is sending messages quickly, the receiver loses messages even when the connection is RELIABLE. (See example code.) What do I need to do to ensure all messages get received? Performance isn't an issue. I don't care if the publisher blocks to allow the subscriber to catch up.


Thanks.