RTI Routing Service asymmetric latency

7 posts / 0 new
Last post
Offline
Last seen: 8 years 8 months ago
Joined: 11/16/2015
Posts: 7
RTI Routing Service asymmetric latency

HI all,

I'm working on setting up a LAN-to-WAN streaming service based on RTI DDS, and if possible I'd like to use the RTI Routing Service for the private network-to-public cloud bridging.

I've set up the routing service example to that I have an instance of the Shapes Demo running on my private LAN (domain 2), through a local Routing Service which forwards it to a cloud server (over domain 1).  On the cloud server is my remote Routing Service, which forwards onto domain 0, and finally, the remote instance of the Shaped Demo is running on there.

That all works, but I've encountered an interesting issue:  The latency of updates as they're shuttled across the WAN are highly different depending on where they originate (LAN or WAN).  Objects that I publish on the LAN (domain 2) instance of the Shapes Demo are smoothly rendered on the cloud server instance (domain 0), while objects I publish remotely (domain 0) are updated only ~2 x per second on the local end (domain 2).

Why would this be?  The application we're implementing is highly sensitive to latency in the network, and it doesn't seem to me that there's any good reason why updates in one direction would be much slower than updates in the other direction?

Thanks

/Franck

 

fercs77's picture
Offline
Last seen: 3 months 2 weeks ago
Joined: 01/15/2011
Posts: 30

Hi Franck,

I am wondering if for some reason the samples published on domain 0 are lost and they have to be repaired. How fast samples are repaired is determined by the HB/NACK exchange between DataWriters and DataReaders.

I have a few questions:

1) What transports are you using to communicate the Routing Service instances (UDP, TCP, ?) and the applications (Shapes Demo) with Routing Services?

2) Are you using UDP multicast to send data? In some cases multicast traffic may be filtered out by firewalls and NATs and that will force your DDS application to repair data using unicast.

3) Can you send the configuration files?

Regards,

- Fernando

Offline
Last seen: 8 years 8 months ago
Joined: 11/16/2015
Posts: 7

Hi Fernando,

Thanks for getting back to me.

1)  So far, each Shaped Demo runs on the same system as one of the Routing Services, and thus I'm using shared memory to publish shapes from the Shapes Demo to the Routing Service.  The two Routing Service instances are connected by the tcpv4_wan in an asymmetric configuration, where the domain 0 one, running on the cloud server, has a public IP and NDDS_DISCOVERY_PEERS is empty, while the domain 2 one is behind a NAT, and NDDS_DISCOVERY_PEERS is set to the IP of the cloud server.

2)  There shouldn't be any multicast involved.  Locally, I'm using shmem, and across the WAN, I'm using tcpv4_wan.  Update:  Even though I've set the NDDS_DISCOVERY_PEERS to only include shmem://, there continues to be quite a lot of Multicast traffic going around on the server system, as identified by Wireshark..

3)  I'm attaching the QOS files for both endpoints here.  They're virtually identical to the example files found in C:\Program Files (x86)\RTI\RTI_Routing_Service_5.1.0\example\shapes\tcp_transport.xml.  The only differences are that I've entered the public IP on the cloud server side, and for the other side, I've enabled asymmetric mode, AND:

Under the TCP_2 configuration, session->auto_topic_route entries, I had to change the ON_DOMAIN_AND_ROUTE_MATCH directives to ON_DOMAIN_OR_ROUTE_MATCH for traffic to successfully be routed through.

 

I hope this is enough for you to work with.

 

Update: I've been playing around with the Routing Service in different configurations, and it seems very much like the Routing Service itself is kinda selective in the routing with the attached QOSs.  Even when I move the "cloud server" component onto a computer on the same LAN as the other system, I see this issue.

In fact, I can start publishing shapes on the server system (domain 0) and then start two consumers on the other system (one domain 0, one domain 2, so that one gets the multicast traffic, while the other goes through the Routing Services), and see one run perfectly smoothly (the multicast) while the other stutters with only 2/3 updates per second.

 

Thanks for the help,

/Franck

 

fercs77's picture
Offline
Last seen: 3 months 2 weeks ago
Joined: 01/15/2011
Posts: 30

Hi Franck,

I have been taking a look into the XML configuration files and I have not found anything weird. Is it possible?

1) That you are using time based filter to subscribe to data on the ShapesDemo application running on domain 2

I have attcahed a screenshot showing how to configure Time based filter. The value should be 0.

2) That you are using the command line parameter -subInterval when executing the ShapesDemo application running on domain 2

Can you also run rtiddspy in domain 2 and let me know if that application also receives 1-2 updates per second?

Regards,

- Fernando

 

 

File Attachments: 
Offline
Last seen: 8 years 8 months ago
Joined: 11/16/2015
Posts: 7

Hi Fernando,

1)  I can confirm that the Time based filter is set to 0 ms on the Shapes Demo on domain 2.

2)  I'm running rtishapesdemo.exe without any command line parameters on either end.  The only thing I do before launching it, is to set the NDDS_DISCOVERY_PEERS environment variable.

3) Running rtiddsspy, it seems that the Routing Service is indeed shuttling more than 3-4 updates per second through (see attached screenshot), but that they're just not getting through to (or rendered in) Shapes Demo..

This is rtiddsspy on domain 2:

All the probing here shows that 8-10 samples arrive per second.

 

This is rtiddsspy on domain 0:

- It consistently captures 16 samples per second before relaying to the Routing Service.

 

It seems like something fishy's going on with my setup :o/

 

/Franck

Offline
Last seen: 8 years 8 months ago
Joined: 11/16/2015
Posts: 7

 

Okay, I'm still working on the asymmetric performance issues, but I've come across something else that puzzles me as well.

I've set up my DDS service in the following way:

P(ublisher):  LAN (TCP), 1Gbit/s, publishes large samples at a given interval

S1(subscriber):  LAN (TCP), 1Gbit/s, receives P's samples

S2(subscriber):  WAN (TCP), 250 MBit/s, receives P's samples

 

Now, I've captured the following network load graph on S1:

S1 and S2 receive the samples correctly at 5Hz, but beyond that, whenever S2 is receiving, transmit rate seems to cap off at ~100MBit/s (7-8Hz).  As you can see in the graph, network load on S1 does not increase when I increase sample rate.  However, if I disconnect S2 (at "WAN Disconnect"), suddenly the sample rate jumps from the previous saturation limit to the full 15 Hz (~200MBit/s).  I've changed no QoS parameters or anything other than the sample rate during this graphing.

On S2, the sample rate won't go beyond 7-8Hz at all, no matter if S1 is connected or not.  I've verified through other applications that I can achieve a bandwidth in excess of 250 Mbit/s using a standard TCP link from P to S2, and the application graphed above can also successfully stream samples at 60Hz (+800Mbit/s) from S1 to S2.

So, it seems like the fact that I'm routing across a WAN will throttle the publisher-side Routing Service down to 100Mbit/s.  I'm thinking this may be due to some of the reliability/ordering features of Connext being very sensitive to the increase in latency we see as we move from LAN to WAN (especially since both LAN and WAN connections are throttled once WAN delivery is enabled), but I might be wrong.

Is there anything I can tweak to be able to utilize the full bandwidth of my WAN link?

 

Thanks

/Franck

fercs77's picture
Offline
Last seen: 3 months 2 weeks ago
Joined: 01/15/2011
Posts: 30

Hi Franck,

In the example that you describe, what is your configuration? how many Routing Service instances are you running? 3? How are you configuring these instances?

I am wondering if there are some issues with the configuration parameters of the RTIConnext DDS TCP transport. Lets try to adjust the configuration as follows:

Since you are running in Windows switch to IOCP socket monitoring by setting the property:

dds.transport.TCPv4.tcp1.socket_monitoring_kind = WINDOWS_IOCP

By default sockets are monitored using SELECT. That method is less scalable than IOCP

Also, you may want to try setting the property:

dds.transport.TCPv4.tcp1.force_asynchronous_send = 1

By default, the send operation on the TCP transport blocks. With this property we will do asynchronous write and we will not block.

Regards,

- Fernando