What causes fails to discover?

11 posts / 0 new
Last post
Offline
Last seen: 5 years 3 weeks ago
Joined: 05/30/2016
Posts: 16
What causes fails to discover?

I have a not very demanding application: one server, usually two or three clients, running all in local, two topics per user, small sized messages. Until some days ago it worked fine, and has been like that for months, then it began to fail the discovery. Not necessarily first messages, it fails independently of the moment or the message topic. Suddenly, when it tries to send a message, on_publication_matched is never triggered. When it succeeds to discover, it's only after several seconds. A couple of days ago I could get things done eventually  just restarting everything, but today it became quite worse and that doesn't work. I guessed it was network problems, because there were no changes in the code, but I don't know how could I check the source of the problem or how could I solve it. I am not dowloading things in the background or similar, and internet works as fine as usual. Any suggestion? There's something that could be done from the code side when this happens?

Offline
Last seen: 11 months 3 weeks ago
Joined: 02/11/2016
Posts: 144

Hey,

Have you tried checking the log produced by rti? it could be you are running out of RAM / having issues with newly configured NICs.

Also, you can try using RTI Admin Console (and or RTI Monitoring Service) to debug your connectivity issues.

Good luck,

Roy.

Offline
Last seen: 5 years 3 weeks ago
Joined: 05/30/2016
Posts: 16

Thanks for answering. I don't master the Admin Console but it behaves strange too. It takes a lot to refresh and sometimes it shows topics that were created in previous executions and remain there. But when it shows what it should, usually -I'll explain this usually later- I see no errors, just some expected warnings because the readers or writers of some topics have not been created yet.

It doesn't fail always in the same point, and the problem is not always in the same kind of client, but I find strange the following fails:

1) Server publishes using topicA

2) client reads the message and publishes using topicB

3) server reads from topicB and sends another message via topicA. But this time the message never arrives to client while there was no problem some seconds ago. I see no errors in the Admin Console then, just that the message does no appear. Other times the message sent in 2) never arrives to the server.

This morning everything seemed to work fine again for some hours. Then, during the afternoon, it began to fail again a bit, and fails terribly in the evening. Rebooted after lunch, restarted the router...  all the same.

I would discard a RAM problem (43% of 16 GB while executing), and nothing has changed in my computer since last week, when it began to fail. I think this happened to me before and one day it suddenly worked again after days and days pulling my hair out. But I can't just sit and wait.

About the previous 'usually', in the Console log of the Administration console, now a I get suddenly a couple of errors (in red) NDDS_Transport_UPD_send_unblock_message: unblock receive resource message failed. Not same errors in each execution. And many warnings, I am just going to paste one of each type:

2018-07-12 18:54:55,234 : WARN : com.rti.tools.console.entitymodel.util.Log4jLoggerDevice.write(Log4jLoggerDevice.java:77) : - DISCEndpointDiscoveryPlugin_unregisterParticipantRemoteEndpoints:remote endpoint not previously asserted by plugin: 0XC0A80164,0X393C,0X1,0
2018-07-12 19:04:52,023 : WARN : com.rti.tools.console.entitymodel.util.Log4jLoggerDevice.write(Log4jLoggerDevice.java:77) : - [D0000|ENABLE]NDDS_Transport_UDPv4_Socket_bindWithIp:0X1CF2 in use
2018-07-12 19:04:52,024 : WARN : com.rti.tools.console.entitymodel.util.Log4jLoggerDevice.write(Log4jLoggerDevice.java:77) : - [D0000|ENABLE]NDDS_Transport_UDPv4_SocketFactory_create_receive_socket:invalid port 7410
2018-07-12 19:04:52,024 : WARN : com.rti.tools.console.entitymodel.util.Log4jLoggerDevice.write(Log4jLoggerDevice.java:77) : - [D0000|ENABLE]NDDS_Transport_UDP_create_recvresource_rrEA:!create socket

2018-07-12 19:04:52,078 : WARN : com.rti.tools.console.entitymodel.util.Log4jLoggerDevice.write(Log4jLoggerDevice.java:77) : - DISCSimpleEndpointDiscoveryPlugin_subscriptionOnSampleLost: 4c7; total 1, delta 1
2018-07-12 19:04:52,078 : WARN : com.rti.tools.console.entitymodel.util.Log4jLoggerDevice.write(Log4jLoggerDevice.java:77) : - DISCSimpleEndpointDiscoveryPlugin_subscriptionOnSampleLost: 4c7; total 2, delta 1
2018-07-12 19:09:00,356 : WARN : com.rti.tools.console.entitymodel.util.Log4jLoggerDevice.write(Log4jLoggerDevice.java:77) : - DISCEndpointDiscoveryPlugin_unregisterParticipantRemoteEndpoints:remote endpoint not previously asserted by plugin: 0XC0A80164,0X3B3C,0X1,0

2018-07-12 20:48:17,478 : WARN : com.rti.tools.console.entitymodel.util.Log4jLoggerDevice.write(Log4jLoggerDevice.java:77) : - NDDS_Transport_UDP_sendToMultipleSockets:OS WSASendTo() failure, error 0X2751: Se ha intentado una operación de socket en un host no accesible.
2018-07-12 20:48:17,478 : WARN : com.rti.tools.console.entitymodel.util.Log4jLoggerDevice.write(Log4jLoggerDevice.java:77) : - NDDS_Transport_UDP_send:send message size count
2018-07-12 20:48:17,997 : WARN : com.rti.tools.console.entitymodel.util.Log4jLoggerDevice.write(Log4jLoggerDevice.java:77) : - NDDS_Transport_UDP_setOption: multicast set option x8 failed with code x2741

Another attempt worked a little better and just printed three of these:

[... ]$ReaderListener.on_subscription_matched(DynamicDataSubscription.java:1083) :  - unable to find DataWriterModel corresponding to : [...]

and one of these:

[...] DISCEndpointDiscoveryPlugin_unregisterParticipantRemoteEndpoints:remote endpoint not previously asserted by plugin: [...]

Tomorrow I'll try to connect directly to the network, in case is a problem with the router of my office.

 

 

 

 

 

 

Offline
Last seen: 11 months 3 weeks ago
Joined: 02/11/2016
Posts: 144

Hey,

You previously stated it's all running locally, now there's some router?

Anyway, if you could post your qos file and code it may help.

Also, looking into your dds log could help.

If things stop working after a while it could be related to some resource limit in your qos (or, depending on your code and setup, some resource limit in your setup)

Offline
Last seen: 5 years 3 weeks ago
Joined: 05/30/2016
Posts: 16

Yesterday I wrote a long post answering and explaining last advances and now I see there was some problem and was not published. I'll summarize:

- Sorry, I was not quite clear. I'm in an office with two machines, both share the same router, which connects to the LAN of the university. I ran the last tests with clients and server running in the same machine. But sometines I use the other machine to run an extra client. If I remove the router and connect directly to the LAN, the problem remains. Just trying things.

- I don't use a QoS file. I use the default tag in every entity  (like DDS.Subscriber.DATAREADER_QOS_DEFAULT). (Should I follow the QoS file in this example?: https://github.com/rticommunity/rticonnextdds-examples/tree/master/examples/using_qos_profiles/cs)

- this is the schema of the whole thing: https://imgur.com/a/FRQMdSI. Two types, n clients -in the tests just 4 clients-. All clients send to the same topic (one for each of the two types). But the server writes to the custom topics of each client.

- the fail is not after a while, is always with the first message, whatever the type. Concretely, the first time the client send a message to the server; client can't discover. Yesterday it worked a bit better during the morning and quite better during the afternoon, although not so well as 10 days ago. When failed, restarting everything would work. Today it works absolutely terrible I haven't been able to connect all clients a single time. An example of the RTI Administration Console log one of the times it fails:

8-07-18 17:01:31,817 : WARN : com.rti.tools.console.entitymodel.util.Log4jLoggerDevice.write(Log4jLoggerDevice.java:77) : - [D0000|ENABLE]NDDS_Transport_UDPv4_Socket_bindWithIp:0X1CF2 in use
2018-07-18 17:01:31,995 : WARN : com.rti.tools.console.entitymodel.util.Log4jLoggerDevice.write(Log4jLoggerDevice.java:77) : - [D0000|ENABLE]NDDS_Transport_UDPv4_SocketFactory_create_receive_socket:invalid port 7410
2018-07-18 17:01:31,995 : WARN : com.rti.tools.console.entitymodel.util.Log4jLoggerDevice.write(Log4jLoggerDevice.java:77) : - [D0000|ENABLE]NDDS_Transport_UDP_create_recvresource_rrEA:!create socket
2018-07-18 17:01:31,996 : WARN : com.rti.tools.console.entitymodel.util.Log4jLoggerDevice.write(Log4jLoggerDevice.java:77) : - [D0000|ENABLE]NDDS_Transport_UDPv4_Socket_bindWithIp:0X1CF4 in use
2018-07-18 17:01:31,996 : WARN : com.rti.tools.console.entitymodel.util.Log4jLoggerDevice.write(Log4jLoggerDevice.java:77) : - [D0000|ENABLE]NDDS_Transport_UDPv4_SocketFactory_create_receive_socket:invalid port 7412
2018-07-18 17:01:31,996 : WARN : com.rti.tools.console.entitymodel.util.Log4jLoggerDevice.write(Log4jLoggerDevice.java:77) : - [D0000|ENABLE]NDDS_Transport_UDP_create_recvresource_rrEA:!create socket

Which doesn't make any sense to me. Just to point out, I use a simulator that uses a UDP socket to send and receive data in ports 8000 and 9000. Anyway, the discovery fails also when the simulator is not running yet. ui am sure because the client needs to send the data used to load the simulator, and that is one of the messages that usually fails.

This is how I create the reader that couldn't be found by the client  (just main lines, no trycatchs). It appears in the Administration console with an ALIVE. The reader is always created, for sure, before the client message is sent:

CommandMsgTypeSupport.register_type(_participant, CommandMsgTypeSupport.get_type_name());

_commandsReaderTopic = _participant.create_topic("NtoM client-to-server Commands", CommandMsgTypeSupport.get_type_name(), DDS.DomainParticipant.TOPIC_QOS_DEFAULT, null, DDS.StatusMask.STATUS_MASK_ALL);

_subscriber = _participant.create_subscriber(DDS.DomainParticipant.SUBSCRIBER_QOS_DEFAULT, null, DDS.StatusMask.STATUS_MASK_NONE);

_reader = (CommandMsgDataReader)_subscriber.create_datareader(_commandsReaderTopic, DDS.Subscriber.DATAREADER_QOS_DEFAULT, this, DDS.StatusMask.STATUS_MASK_ALL);

 

And this is how I create the writer in the client:

CommandMsgTypeSupport.register_type(participant, CommandMsgTypeSupport.get_type_name());

DDS.Topic topic = participant.create_topic("NtoM client-to-server Commands", CommandMsgTypeSupport.get_type_name(), DDS.DomainParticipant.TOPIC_QOS_DEFAULT, null, DDS.StatusMask.STATUS_MASK_NONE);

DDS.Publisher publisher = this.participant.create_publisher(DDS.DomainParticipant.PUBLISHER_QOS_DEFAULT, null, DDS.StatusMask.STATUS_MASK_NONE);

commandsDataWriter = (CommandMsgDataWriter)publisher.create_datawriter(topic, DDS.Publisher.DATAWRITER_QOS_DEFAULT, this, DDS.StatusMask.STATUS_MASK_ALL);

And all the same for the other type.

 

Offline
Last seen: 11 months 3 weeks ago
Joined: 02/11/2016
Posts: 144

Hey,

Something is unclear to me but let me explain how rti dds works (well, in short) and see if I can understand what doesn't work for you.
at time X you create a writer for a topic and at time X + 1 you create a reader for the same topic.

You create these entities using different participants but they use multicast to discover each other.

When your writer detects the existence of the reader it will notify you with on_publication_matched.

If at some point in time you create a reader and on_publication_matched doesn't happen we'll call this scenario A (so you can say in your reply that this is the problem).

Assuming new readers are all detected by your writer, and you then send a sample which somehow doesn't arrive at one (or more) readers we will call this scenario B.

If your problem is that a message is sent before on_publication_matched happens and that message doesn't reach a reader we will call this scenario C.

regarding scenario A let's break it into two options:

Your writer gets matched with some readers and then some reader is created and the writer isn't matched with it (A1)

OR

You run your tests a few times and the writer matches all readers.

Every now and then when you run your tests your writer doesn't match any or some readers (A2)

 

If you can help clear up which scenario you're encountering I can try to help you further.

 

Good luck,

Roy.

Offline
Last seen: 5 years 3 weeks ago
Joined: 05/30/2016
Posts: 16

I do just the opposite. In x I create the reader in the server. In x+1 I create the writer in the client. Then the client tries to send the first message -to login-.

As what DDS::PublisherListener::on_publication_matched does is: This callback is called when the DDS::DataWriter has found a DDS::DataReader that matches the DDS::Topic, has a common partition and compatible QoS, or has ceased to be matched with a DDS::DataReader that was previously considered to be matched.

I try to send the message for login like this:

private const int MAX_CONSECUTIVE_WRITE_ERROR = 5;

int consecutiveErrors = 0;

try {

    while (!this.service_discovered)

            Thread.Sleep(1);

     commandsDataWriter.write(msg, ref DDS.InstanceHandle_t.HANDLE_NIL);

    consecutiveErrors = 0; /* Always clear the error count in case of successful write */

} catch (DDS.Retcode_Error e) {

   System.Diagnostics.Debug.WriteLine("! Write error " + e.GetType() + ": " + e.Message);

   if (++consecutiveErrors > MAX_CONSECUTIVE_WRITE_ERROR) {

                 System.Diagnostics.Debug.WriteLine("! Reached maximum number of failure, stopping writer...");

                  return;

    }

}

….

public override void on_publication_matched(DDS.DataWriter writer, ref DDS.PublicationMatchedStatus status) {

this.service_discovered = true;

}

 

But it never exits from the while.

 

Offline
Last seen: 11 months 3 weeks ago
Joined: 02/11/2016
Posts: 144

Hey,

 

Are you using docker? or some unusual operating system?

Offline
Last seen: 5 years 3 weeks ago
Joined: 05/30/2016
Posts: 16

No Docker. And running on Windows 10.

sara's picture
Offline
Last seen: 1 month 2 days ago
Joined: 01/16/2013
Posts: 128

Hi Miquel,

Do you still have the problem? What do you see in Wireshark, are Data(p) getting to the other side? I want to make sure the router is not dropping the multicast messages for any reason.

Thanks,
Sara

Offline
Last seen: 5 years 3 weeks ago
Joined: 05/30/2016
Posts: 16

Hi Sara,

I stopped to try to find the reason because I can't spend more time on that at the moment, but I wasn't using the router anymore, and both clients and the server were running in the same computer. Anyway, I just gave a try to confirm you if the problem remains. And it works perfectly well now. Is not the first time that this happens: the discover fails terribly during some days and then one day the problem just disappears. No changes in the code, the hardware or the installation that could be involved. Luckily this doesn't happen frequently, maybe a couple or three times a year, and I already assumed that is just a question of time, but I was wondering if there is a way to bypass or at least identify the problem in case it appears at a delicate moment. I don't have the background to analyse the network performance, but I take note of your recommendation and I'll give a try to Wireshark next time. Thanks!