DDS Discovery very slow

12 posts / 0 new
Last post
Offline
Last seen: 4 years 5 months ago
Joined: 11/30/2012
Posts: 18
DDS Discovery very slow

During discovery, one of our services, which contains about 20 message topics, takes at least 15 seconds to complete discovery.  Is this normal?  And, also, are there known ways to speed up this discovery time (such as bandwidth improvements, etc)?  Can we get DDS to consume more network bandwidth during discovery to speed up this process?  It seems that it now uses fairly little bandwidth during discovery.

Thanks,

Jim

Gerardo Pardo's picture
Offline
Last seen: 1 week 1 day ago
Joined: 06/02/2010
Posts: 602

What do you mean by "complete discovery"? Do  you mean:

  1. Your service (i.e the writers and readers it has) is discovered by the applications that are already running?
  2. Your service discovers that applications that are already running, or
  3. Both?
If your service starts when remaining applications are running, then the time takes the service to discover applications is going to depend mostly on the number of Participants, DataWriters, and DataReaders in all the existing applications. Not so much on the number that the service contains. This is because each application only needs to discover the things the service has, which is done quickly via multicast, while the service has to discover everything that is already there in order to match its local writers. This is potentially a much larger set and dominates the total time.
 
It is not easy to answer if 15 seconds it normal without understanding more of the system. The number of participants in the system, the number of DataWriters and readers alrready present, the bandwidth and the round-trip time. I would not be too surprised that for some of these parameters the network is not too loaded by discovery traffic when the new serviece appears because the protocol used to send the information to the service is being throtlled by the service itselt. 
In terms of thigs that could speed up the discovery protocol, I would look at modifying the DiscoveryConfigQosPolicy. Specifically the  attributes publication_reader, subscription_reader, publication_writer, and subscription_writer.
 
The publication_reader, subscription_reader are of type DDS_RtpsReliableReaderProtocol_t. In these I would try modifying the attributes:  heartbeat_suppression_duration (to 0 seconds, 0 nanoseconds), nack_period ( to 0 seconds, 500000000 nanoseconds).
The publication_writer and subscription_writer are of type DDS_RtpsReliableWriterProtocol_t . In these I would modify the late_joiner_heartbeat_period to (0 seconds, 500000000 nanoseconds).  
If you see significant differences making these kind of changes it would indicate that some tunning could be the a good way to improve the performance.
Gerardo
Offline
Last seen: 4 years 5 months ago
Joined: 11/30/2012
Posts: 18

For this case, I mean the service completing discovery with all other services currently running - from startup to the point that messages can be exchanged.  In this case, there are 6 other services and the time, from startup of the service in question to it being able to send/receive all 45 message types, is about 15 seconds.

Also, if there is a message type that will never find a match (the service that would match is not running), will that slow down discovery.

Gerardo Pardo's picture
Offline
Last seen: 1 week 1 day ago
Joined: 06/02/2010
Posts: 602

Hi,

There are 6 other services. But how many DataWriters and DataReaders does each of the other services have?

In other words what is the total number of DataWriters and DataReaders in the system?  This number is what will dominate the discovery time. The new service will have to discover all the DataWriters and DataReaders in the systems. It will start sending and receiving data as soon as it discovers each matching DataReader/Writer. But if it happens that the one it needs is the last one to be created in some service it will have to discover everythig before it gets to that one.

Did you try modifying some of the discovery parameters I mentioned? If so, did you see any differences in discovery time?

Gerardo

Offline
Last seen: 4 years 5 months ago
Joined: 11/30/2012
Posts: 18

This comment does help me understand a little better..."It will start sending and receiving data as soon as it discovers each matching DataReader/Writer. But if it happens that the one it needs is the last one to be created in some service it will have to discover everythig before it gets to that one."

Question is: is there a way to query a Reader/Writer to know when it has completed discovery?  We are not running reliable protocol so I'm not sure that the changes would have an effect, is that correct?  I guess we're looking to know when discovery is complete so we know when we can successfully send a message (without using reliable protocol).  Is there a way to know this before sending? 

Gerardo Pardo's picture
Offline
Last seen: 1 week 1 day ago
Joined: 06/02/2010
Posts: 602

I will start with a disclaimer and then point you to some mechanisms you can use...

Disclaimer: Strictly speaking dicovery is never complete  as it is an on-going process that will keep discovering new Participants as they appear, new DataWriters and DataReaders as they are created/destroyed and also detect changes on their QoS, content filters, etc.

From the point of view of an application that joins an existing system it makes sense to ask "when did a finish discovering everything that was already there when I joined"? However it is hard to answer this question deterministically in a distributed peer-to-peer environment. Because without a central service how do you really know that there isn't some applicatication that was already started but for some reason/configuration hasn't sent an announcement yet, or the initial announcements that were sent were lost by some network error or resource overflow...

Possible approaches: That said there are API's that the middleware offers that can be used to answer questions like the above:

First Option: If you know what you are looking for. Meaning if a DataWriter knows it expects a certain number of DataReaders or a DataReader expects a certain number of DataWriters you can detect whether they have been already discovered by checking theit "matched status" which will tell you how many matched entities they have.  You do this calling the DataReader get_subscription_matched_status and/or the corresponding get_publication_matched_status on the DataWriter. Note also that there are other functions you can also call to get details on the matched entities and there is even an operation in the corresponding DataReaderListener (and DataWriterListener) that will notify yiu each time a match occurs.

A second option is to take advantage of the operation on DataReader API called wait_for_historical_data(). This operation applies to a DataReader that has a DURABILITY QoS with a kind different from VOLATILE. It takes a timeout and blocks until the DataReader has received all the data that was previously published by DataWriters the corresponding Topic.  This operation combined with the fact that the discovery data is sent via DDS "builtin" DataWriters and DataReaders, which you can access, gives you an opportunity to wait for "all discovery data" known to the system at the time you make the call.

To use this second option you first access the builtin discovery DataReaders using the APIs described in the HOWTO titled Detect the presence of DomainParticipants, DataWriters and DataReaders in the DDS Domain.  Once you have a referece to the PublicationBuiltinTopicDataDataReader and the SubscriptionBuiltinTopicDataDataReader you call wait_for_historical_data() on these DataReaders and when the function returns it is a good (but not perfect) indication that discovery has completed.

Hope this helps,

Gerardo

Offline
Last seen: 4 years 5 months ago
Joined: 11/30/2012
Posts: 18

Thank you Gerardo - that helps tremendously. You do a great job on this forum.

Offline
Last seen: 4 years 5 months ago
Joined: 11/30/2012
Posts: 18

Gerardo -

I am trying the second option you listed and find that wait_for_historical_data() always returns immediately (doesn't block until discovery is 'complete').  I am assuming that the discovery DataReaders are always of a DURABILITY type other than VOLATILE.  Is that a good assumption of do I need to explicitly define it as something other than VOLATILE?  Ifso, what would be the best way to do that?

Offline
Last seen: 4 years 5 months ago
Joined: 11/30/2012
Posts: 18

Maybe it'd help if I explain my end goal more succinctly... We have a service, Service A, that kicks off another service, Service B.  After Service A has kicked off Service B, Service A needs to send a series of messages to Service B to insure that Service B has all the proper state info.  Is there a way in DDS that we can know Service B is at state that it will successfully receive a given subset of messages that Service A sends to it?  Right now we have to confirm receipt of the messages at the application level on a service by service basis, but would prefer a generic DDS solution to know when entities can safely communicate over DDS.

Gerardo Pardo's picture
Offline
Last seen: 1 week 1 day ago
Joined: 06/02/2010
Posts: 602

Hello Jim,

I think with this additional description I might have an explanation and solution for what you are seeing.

The root cause may be that the out-of-the-box configuration is not so optimal and specifically it does not favor fast discovery of newly created DDS Entities (DomainParticipants, DataReaders and DataWriters). Rather it is more oriented towards minimizing overall network traffic.

I set up a simple scenario to illustrate the situation and show the effect of some QoS changes that hopefully you could apply to your situation as well.

The scenario contains two applications 'ServiceTypePublisher' and 'ServiceTypeSubscriber'. There is also one 'CommandTopic' and a configurable number of 'ServiceTopics'.

The 'ServiceTypeSubscriber' reads the  'CommandTopic' when it gets the command to 'start the services' it creates DataReaders for the indicated number of services ('numServices') and then prints the data it receives in the services.

In addition, the 'ServiceTypePublisher' also times how long it takes for the whole process to complete: From the moment it sends the command to start the services until it stops the services and receives confirmation that they have been stopped.  

When I tried this on a local computer using 50 services with the default discover settings the total time was 8 to 10 seconds. With the QoS changes I suggest here the total time was reduced to 0.1 to 0.2 seconds.

In addition to setting the QoS to get the 'discovery' completion to be reasonably fast it was also important to include logic that ensures sure none of the commands sent to each of the services are lost.  The idea behind the approach is to ensure that the services are already discovered by the DataWriters in the 'ServiceTypePublisher' before sending data to them. This is achieved as follows:

The two applications 'ServiceTypePublisher' and 'ServiceTypeSubscriber' may be started on any order.

'ServiceTypePublisher' Logic:

  • The 'ServiceTypePublisher' first creates the DataWriter for 'CommandTopic' and waits until it detects that the  'ServiceTypeSubscriber' has created a corresponding DataReader. Only then it sends the command to 'start all services' (numServices = 50 in this case).
  • The 'ServiceTypePublisher' ;then creates the DataWriter for each of the (numServices = 50) services and waits until it detects that the corresponding DataReaders have been created.
  • After detecting that the service DataWriters have discovered the DataReaders it sends a message to each Service. It then waits until each message has been acknowledged and only then it sends the command to stop all services.

'ServiceTypeSubscriber' logic:

  • The 'ServiceTypeSubscriber' initially creates the DataReader for 'CommandTopic'. It then listens and processes commands.
  • When it gets a command to 'start services' it reads the number of services and topic names to use and creates DataReaders for each one of the services (numServices = 50 in this case).
  • Whenever data is received on the service data readers it prints it.
  • When the   'ServiceTypeSubscriber' receives a command to stop services it deletes everything and exits.
The flow is also illustrated in the diagram below:
'ServiceTypePublisher'                                          'ServiceTypeSubscriber'

create CommandDataWriter
wait for a reader to CommandDataWriter
                                                                create CommandDataReader
                                       discovery protocol
                                    <--------------------
wakeup from wait
START TIMER
CommandDataWriter->write()           "start services"
                                     -------------------->   create 50x ServiceTypeDataReader
create 50x ServiceTypeDataWriter                                                  
wait for a reader to each ServiceTypeDataWriter
                                    <--------------------   
                                  50 x discovery protocol                                                
                                    <--------------------                                                    
wakeup from 50x WaitSet::wait()
write 50x messages one on each ServiceTypeDataWriter
                                                           
                                     -------------------->  print received messages
                                     50x   "message" 
                                     -------------------->    
wait for acknowledgments to all messages                                                        
                                    <--------------------
                                     reliability protocol
                                    <--------------------
wakeup from 50x DataWriter::wait_for_acknowledgments()

CommandDataWriter->write()           "stop services"
                                     --------------------> delete all entities & EXIT
wait for acknowledgments to stop command                              
                                     reliability protocol
                                    <--------------------
wakeup
STOP TIMER                                                                                  
EXIT

Two things to note here:

Note 1. The DataWriter uses a WaitSet in combination with a StatusCondition to detect the PUBLICATION_MATCHED_STATUS. This indicates it has detected a DataReader.

In general to be robust we should be waiting for a specific DataReader. Otherwise in publish-subscribe scenario our application may not work as expected if for example someone starts a different/additional DataReader (e.g. rtiddsspy, the RTI Recording Service, A visualization tool or any other DataReader we did not expect). There are several ways to do this. The one shown here uses the EntityName QoS to mark the DataReader with a special 'role_name' that the DataWriter uses to identify it.

Note 2. The DataWriter never writes a command or message until it has discovered the DataReader. In this case the use of the reliable protocol is sufficient to ensure that the message will be delivered. Note that it is possible that due to network timing the message arrives to the DataReader before the DataReader itself has discovered the DataWriter. If that were the case the middleware layer would drop the message (because it arrives from an unrecognized DataWriter). However the reliability protocol would kick-in and ensure that it is delivered as soon as the DataWriter is discovered by the DataReader.

The reason why this works is that the DataWriter is taking responsibility for reliable delivery to the DataReader all the messages that were written after the DataReader was discovered. As the DataReader was known to the DataWriter prior to writing the message the DataWriter will keep insisting until the DataReader acknowledges the message.

For these reasons we only need to set the RELIABILITY QoS and not the DURABILITY QoS.

I have attached all the Java files and the makefiles you can use for Linux and MacOSX. The example is prepared to run with RTI Connext DDS 5.0.0. There is a README.txt with instructions on how to run on a different platform.

Assuming you have the proper makefile for your platform and the environment variable NDDSHOME properly set and you must be in the directory that contains the USER_QOS_PROFILES.xml file.

To build and run on Linux do as follows:

On one terminal window:

make -f makefile_discovery_completion_i86Linux2.6gcc4.1.1jdk ServiceTypePublisher

On the other terminal window:

make -f makefile_discovery_completion_i86Linux2.6gcc4.1.1jdk ServiceTypeSubscriber

To run on MacOSX you can follow the same process but substitute the makefile used after the "-f" option with makefile_discovery_completion_x64Darwin10gcc4.2.1jdk

The programs will prompt you to select the QoS profile you want to use.

To see the different results open two terminal windows and run these programs (one on each terminal window) and type "1" to the prompt to specify the default QoS. You should see on the ServiceTypePublisher how everything progresses and in the end it prints the total time it took to run the scenarion. In my local MacBookPro it takes 8 to 10 seconds.

Next run again and specify "2" to the prompt to specify the 'fast discovery" profile QoS that appears in the USER_QOS_PROFILES.xml. You will see similar output but now the programs should complete the scenario a lot faster. On my MacBookPro it took less than 0.2 seconds.

Gerardo

Offline
Last seen: 4 years 5 months ago
Joined: 11/30/2012
Posts: 18

Thanks Gerardo.  Now that I am back from being out for the holidays I will try this.  I very much appreciate the detailed responses you provide.

Offline
Last seen: 5 years 3 months ago
Joined: 03/25/2015
Posts: 33

Hi Gerardo,

This is precisely what I was looking for, from some time..

This is my problem statement:

 -          I have a publisher, say p1, publishing 2 topics

-          I have 2 subscribers listening to each of the 2 topics that are published, say s1 and s2

-          Is there a way s1 can be guaranteed that s2 has received the topic data, when it receives it’s?

-          In other words, is there a way publishing order can guarantee the subscribing order when there is one subscriber for each of the topic? The reason for asking this is that s1 triggers p1's destruction. I wanted to be sure that if s1 received it's samples, then s2 has received it's samples (p1 is going to publish the samples in the same order.. s1's sample followed by s2's sample)..

Guranteeing the delivery, I think, can be handled by RELIABILITY as well as DURABILITY (durability to take into account scenario where my data is published by reader is yet to discover the writer). To give an idea, p1 and s1 are transient entities wherin s2 is a permanent entity. Even I am facing high discovery time (ranging from .01 sec to 2.9 sec). I am using DDS5.2. Can you help provide revised QoS for 5.2 ?

Now, even after the discovery happens can I be sure that s2 receives all the samples and s1 go ahead with triggering p1's destrution?

Thanks.

Uday