We are using the Routing service to connect multiple geographically separated sites across a WAN using TCP. The sites exchange status information amounting to several (maybe a dozen) small messages (typically < 5K bytes each) every 6 seconds. The status messages originate on a DDS topic. For some reason not all of the statuses we see on the topic on the LAN are written to the router service TCP socket. The only filter we have in the router QoS is on the topic. Any ideas why messages would be lost?
Experimentation has shown that we are apparently publishing status batches too fast. When we insert a small sleep between each publication the remote view of status stabilizes. Is there any way too improve performance so that the delayh is not needed?
You can try configuring both Routing Service and your applications with the built-in Qos Profile for high-throughput strict-reliable communications:
Create your DataReaders and DataWriters with the library name "BuiltinQosLibExp" and the profile name "Generic.StrictReliable.HighThroughput" and in your Routing Service XML file, set the datawriter_qos and datareader_qos tags to inherit from base_name="BuiltinQosLibExp::Generic.StrictReliable.HighThroughput"
Hope this helps,
Alex
Can you give me an example of what the XML would look like for the routing service?
Something like this:
<routing_service name="MyRoutingService">
<domain_route name="MyDomainRoute">
...
<session name="MySession">
...
<topic_route name="MyTopicRoute">
<input participant="1">
...
<datareader_qos base_name="BuiltinQosLibExp::Generic.StrictReliable.HighThroughput"/>
</input>
<output>
...
<datawriter_qos base_name="BuiltinQosLibExp::Generic.StrictReliable.HighThroughput"/>
</output>
</topic_route>
</session>
</domain_route>
</routing_service>
That helped some but I am still seeing lost messages. Our application is sending 11 status messages in a burst every 6 seconds (each ~ 3K) to two other remote sites and only somewhere from 8 to 11 get sent over TCP according to Wireshark. The receive sides seem to get everything that gets on the wire. If we insert an artificial delay then things work better.
In addition to the above I have also tried
dds.transport.TCPv4.tcp1.socket_monitoring_kind = WINDOWS_IOCP and
dds.transport.TCPv4.tcp1.force_asynchronous_send = 1.
We have upgraded to version 5.2.3 and still no joy. To reiterate we have 4 geographicaly remote sites connected via VPN (trhe intention is to expand to more). Each site publishes status to all of the other sites. The status packets are of two sizes (~0.6K and ~ 1.5K bytes) with a total required throughput of < 10K every 6 seconds. Three of the sites reliably display status of the other sites but the fourth site only display status for two sites reliably. The status between (let's say site 1 and site 2) at site 2 is very spotty. Looking at Wireshark, One type of status report (the smaller one) is inconsistently being sent from site 1 to site 2. Our spy program shows the packet being published at the expected rate to the status topic. Any idea why we would be dropping packets?
Also, in order to get to this point we had to meter the packets separating them by ~ 20 ms at the point where they are published in order to get this level of reliability. If this isn't resolved we are going to needd to roll our own equivalent of the routing service to distribute status across the WAN.
What are the settings for the socket buffer for the send and receive buffers? (how big is the TCP/IP socket buffering: SO_SNDBUF, SO_RCVBUF) Look at each of the endpoints (site 1 packet sender and routing service receiver/sender) and see if anything there looks out of whack.
https://community.rti.com/forum-topic/routing-service-windows-service has a note at the bottom for how to set the values using the QOS file.
I currently have set the following properties:
dds.transport.TCPv4.tcp1.disable_nagle = 1
dds.transport.TCPv4.tcp1.recv_socket_buffer_size = 1048576
dds.transport.TCPv4.tcp1.send_socket_buffer_size = 1048576
dds.transport.TCPv4.tcp1.socket_monitoring_kind = WINDOWS_IOCP
dds.transport.TCPv4.tcp1.force_asynchronous_send = 1
I have tried various combinations with no improvement.
Any ideas as to what the issue is here or should I move on to a different solution?