During discovery, one of our services, which contains about 20 message topics, takes at least 15 seconds to complete discovery. Is this normal? And, also, are there known ways to speed up this discovery time (such as bandwidth improvements, etc)? Can we get DDS to consume more network bandwidth during discovery to speed up this process? It seems that it now uses fairly little bandwidth during discovery.
Thanks,
Jim
What do you mean by "complete discovery"? Do you mean:
For this case, I mean the service completing discovery with all other services currently running - from startup to the point that messages can be exchanged. In this case, there are 6 other services and the time, from startup of the service in question to it being able to send/receive all 45 message types, is about 15 seconds.
Also, if there is a message type that will never find a match (the service that would match is not running), will that slow down discovery.
Hi,
There are 6 other services. But how many DataWriters and DataReaders does each of the other services have?
In other words what is the total number of DataWriters and DataReaders in the system? This number is what will dominate the discovery time. The new service will have to discover all the DataWriters and DataReaders in the systems. It will start sending and receiving data as soon as it discovers each matching DataReader/Writer. But if it happens that the one it needs is the last one to be created in some service it will have to discover everythig before it gets to that one.
Did you try modifying some of the discovery parameters I mentioned? If so, did you see any differences in discovery time?
Gerardo
This comment does help me understand a little better..."It will start sending and receiving data as soon as it discovers each matching DataReader/Writer. But if it happens that the one it needs is the last one to be created in some service it will have to discover everythig before it gets to that one."
Question is: is there a way to query a Reader/Writer to know when it has completed discovery? We are not running reliable protocol so I'm not sure that the changes would have an effect, is that correct? I guess we're looking to know when discovery is complete so we know when we can successfully send a message (without using reliable protocol). Is there a way to know this before sending?
I will start with a disclaimer and then point you to some mechanisms you can use...
Disclaimer: Strictly speaking dicovery is never complete as it is an on-going process that will keep discovering new Participants as they appear, new DataWriters and DataReaders as they are created/destroyed and also detect changes on their QoS, content filters, etc.
From the point of view of an application that joins an existing system it makes sense to ask "when did a finish discovering everything that was already there when I joined"? However it is hard to answer this question deterministically in a distributed peer-to-peer environment. Because without a central service how do you really know that there isn't some applicatication that was already started but for some reason/configuration hasn't sent an announcement yet, or the initial announcements that were sent were lost by some network error or resource overflow...
Possible approaches: That said there are API's that the middleware offers that can be used to answer questions like the above:
First Option: If you know what you are looking for. Meaning if a DataWriter knows it expects a certain number of DataReaders or a DataReader expects a certain number of DataWriters you can detect whether they have been already discovered by checking theit "matched status" which will tell you how many matched entities they have. You do this calling the DataReader get_subscription_matched_status and/or the corresponding get_publication_matched_status on the DataWriter. Note also that there are other functions you can also call to get details on the matched entities and there is even an operation in the corresponding DataReaderListener (and DataWriterListener) that will notify yiu each time a match occurs.
A second option is to take advantage of the operation on DataReader API called wait_for_historical_data(). This operation applies to a DataReader that has a DURABILITY QoS with a kind different from VOLATILE. It takes a timeout and blocks until the DataReader has received all the data that was previously published by DataWriters the corresponding Topic. This operation combined with the fact that the discovery data is sent via DDS "builtin" DataWriters and DataReaders, which you can access, gives you an opportunity to wait for "all discovery data" known to the system at the time you make the call.
To use this second option you first access the builtin discovery DataReaders using the APIs described in the HOWTO titled Detect the presence of DomainParticipants, DataWriters and DataReaders in the DDS Domain. Once you have a referece to the PublicationBuiltinTopicDataDataReader and the SubscriptionBuiltinTopicDataDataReader you call wait_for_historical_data() on these DataReaders and when the function returns it is a good (but not perfect) indication that discovery has completed.
Hope this helps,
Gerardo
Thank you Gerardo - that helps tremendously. You do a great job on this forum.
Gerardo -
I am trying the second option you listed and find that wait_for_historical_data() always returns immediately (doesn't block until discovery is 'complete'). I am assuming that the discovery DataReaders are always of a DURABILITY type other than VOLATILE. Is that a good assumption of do I need to explicitly define it as something other than VOLATILE? Ifso, what would be the best way to do that?
Maybe it'd help if I explain my end goal more succinctly... We have a service, Service A, that kicks off another service, Service B. After Service A has kicked off Service B, Service A needs to send a series of messages to Service B to insure that Service B has all the proper state info. Is there a way in DDS that we can know Service B is at state that it will successfully receive a given subset of messages that Service A sends to it? Right now we have to confirm receipt of the messages at the application level on a service by service basis, but would prefer a generic DDS solution to know when entities can safely communicate over DDS.
Hello Jim,
I think with this additional description I might have an explanation and solution for what you are seeing.
The root cause may be that the out-of-the-box configuration is not so optimal and specifically it does not favor fast discovery of newly created DDS Entities (DomainParticipants, DataReaders and DataWriters). Rather it is more oriented towards minimizing overall network traffic.
I set up a simple scenario to illustrate the situation and show the effect of some QoS changes that hopefully you could apply to your situation as well.
The scenario contains two applications 'ServiceTypePublisher' and 'ServiceTypeSubscriber'. There is also one 'CommandTopic' and a configurable number of 'ServiceTopics'.
The 'ServiceTypeSubscriber' reads the 'CommandTopic' when it gets the command to 'start the services' it creates DataReaders for the indicated number of services ('numServices') and then prints the data it receives in the services.
In addition, the 'ServiceTypePublisher' also times how long it takes for the whole process to complete: From the moment it sends the command to start the services until it stops the services and receives confirmation that they have been stopped.
When I tried this on a local computer using 50 services with the default discover settings the total time was 8 to 10 seconds. With the QoS changes I suggest here the total time was reduced to 0.1 to 0.2 seconds.
In addition to setting the QoS to get the 'discovery' completion to be reasonably fast it was also important to include logic that ensures sure none of the commands sent to each of the services are lost. The idea behind the approach is to ensure that the services are already discovered by the DataWriters in the 'ServiceTypePublisher' before sending data to them. This is achieved as follows:
The two applications 'ServiceTypePublisher' and 'ServiceTypeSubscriber' may be started on any order.
'ServiceTypePublisher' Logic:
'ServiceTypeSubscriber' logic:
Two things to note here:
Note 1. The DataWriter uses a WaitSet in combination with a StatusCondition to detect the PUBLICATION_MATCHED_STATUS. This indicates it has detected a DataReader.
In general to be robust we should be waiting for a specific DataReader. Otherwise in publish-subscribe scenario our application may not work as expected if for example someone starts a different/additional DataReader (e.g. rtiddsspy, the RTI Recording Service, A visualization tool or any other DataReader we did not expect). There are several ways to do this. The one shown here uses the EntityName QoS to mark the DataReader with a special 'role_name' that the DataWriter uses to identify it.
Note 2. The DataWriter never writes a command or message until it has discovered the DataReader. In this case the use of the reliable protocol is sufficient to ensure that the message will be delivered. Note that it is possible that due to network timing the message arrives to the DataReader before the DataReader itself has discovered the DataWriter. If that were the case the middleware layer would drop the message (because it arrives from an unrecognized DataWriter). However the reliability protocol would kick-in and ensure that it is delivered as soon as the DataWriter is discovered by the DataReader.
The reason why this works is that the DataWriter is taking responsibility for reliable delivery to the DataReader all the messages that were written after the DataReader was discovered. As the DataReader was known to the DataWriter prior to writing the message the DataWriter will keep insisting until the DataReader acknowledges the message.
For these reasons we only need to set the RELIABILITY QoS and not the DURABILITY QoS.
I have attached all the Java files and the makefiles you can use for Linux and MacOSX. The example is prepared to run with RTI Connext DDS 5.0.0. There is a README.txt with instructions on how to run on a different platform.
Assuming you have the proper makefile for your platform and the environment variable NDDSHOME properly set and you must be in the directory that contains the USER_QOS_PROFILES.xml file.
To build and run on Linux do as follows:
On one terminal window:
On the other terminal window:
To run on MacOSX you can follow the same process but substitute the makefile used after the "-f" option with makefile_discovery_completion_x64Darwin10gcc4.2.1jdk
The programs will prompt you to select the QoS profile you want to use.
To see the different results open two terminal windows and run these programs (one on each terminal window) and type "1" to the prompt to specify the default QoS. You should see on the ServiceTypePublisher how everything progresses and in the end it prints the total time it took to run the scenarion. In my local MacBookPro it takes 8 to 10 seconds.
Next run again and specify "2" to the prompt to specify the 'fast discovery" profile QoS that appears in the USER_QOS_PROFILES.xml. You will see similar output but now the programs should complete the scenario a lot faster. On my MacBookPro it took less than 0.2 seconds.
Gerardo
Thanks Gerardo. Now that I am back from being out for the holidays I will try this. I very much appreciate the detailed responses you provide.
Hi Gerardo,
This is precisely what I was looking for, from some time..
This is my problem statement:
- I have a publisher, say p1, publishing 2 topics
- I have 2 subscribers listening to each of the 2 topics that are published, say s1 and s2
- Is there a way s1 can be guaranteed that s2 has received the topic data, when it receives it’s?
- In other words, is there a way publishing order can guarantee the subscribing order when there is one subscriber for each of the topic? The reason for asking this is that s1 triggers p1's destruction. I wanted to be sure that if s1 received it's samples, then s2 has received it's samples (p1 is going to publish the samples in the same order.. s1's sample followed by s2's sample)..
Guranteeing the delivery, I think, can be handled by RELIABILITY as well as DURABILITY (durability to take into account scenario where my data is published by reader is yet to discover the writer). To give an idea, p1 and s1 are transient entities wherin s2 is a permanent entity. Even I am facing high discovery time (ranging from .01 sec to 2.9 sec). I am using DDS5.2. Can you help provide revised QoS for 5.2 ?
Now, even after the discovery happens can I be sure that s2 receives all the samples and s1 go ahead with triggering p1's destrution?
Thanks.
Uday