Why is the discovery phase taking so long on my Windows system?

Description

The discovery phase in DDS applications can be considered a high throughput scenario under some circumstances. Some examples are:

  • The applications have a lot of DDS entities.
  • More than 2 hosts are involved.
  • The type definitions are sent within the discovery packets (i.e., TypeCode and TypeObject).

Some high throughput scenarios may lead to IP Fragmentation issues. IP Fragmentation happens when the payload provided from the above layer (UDP or TCP typically) exceeds the maximum payload that fits in a single IP packet as defined by the MTU. When the NIC receives IP fragments, it stores them in a buffer until all the fragments are received and can be reassembled to form UDP datagrams or TCP segments. When the last fragment is received the reassembly is performed and the message is provided to the application layer.

Two relevant parameters involved in this algorithm are:

  1. The resources allocated to temporarily store fragments (number of fragments or the size of the buffer that holds fragments).
  2. The timeout to discard fragments that can’t be reassembled for any reason. For instance, UDP packets that are missing any IP fragment.

In Windows, the resources allocated to temporary hold IP fragments are specified as a maximum number of fragments. If this value is not large enough and the cleanup timeout period is too long, the system may end up without free resources to hold new incoming IP fragments. This potentially happens if IP fragments are lost in transit. If this happens, UDP packets or TCP segments greather than the MTU are silently dropped.

Due to the nature of the DDS Discovery traffic (for more information, see Chapter 14 in the User’s Manual), some packets are small and some of them are larger. In a retransmission:

  1. The DataWriter sends a Heartbeat. This packet is around 100 bytes.
  2. The DataReader answers with an ACKNACK. This packet is around 100 bytes.
  3. The DataWriter sends back the nack’ed packets. This packet can be from hundreds of bytes to tens of kilobytes (by default, up to 128 KB). See:

DDS_RtpsReliableWriterProtocol_t::max_bytes_per_nack_response.

As it can be seen, (1) and (2) don’t need IP fragmentation but (3) needs it.

Experiments

Experiments have been performed with DDS Discovery traffic sending packets of 10 KB. This represents more than 98% of the bandwidth used in the communication (10 KB / (10 KB + 200 bytes)). In a scenario where several applications booting use around 500 Mbps, this means more than 33.000 IP fragments per second.

In systems like Windows 7, you can keep up to 65535 fragments and the IP Fragmentation timeout is 60 seconds. This timeout cannot be changed (see Conclusions below). It has been shown that 3 machines (each one with 7 Domain Participants), couldn’t discover each other in a reasonable time. Discovery could take more than 1 hour.

After seeing those results, a couple of experiments were performed using raw UDP traffic instead of DDS. In this case, the experiments were the following:

  • Environment: 3 x Windows 7 machines: host A1, host A2 and host B (host A1 and host A2 slower than host B).
  • Traffic: 50% small samples (< 100 bytes) and 50% large samples (> 10 KB).
  • Expected results: Have the receiving application processing 50% small samples and 50% large samples.

Test A:

  • Fast host (host type B) as a receiver.
  • Traffic put in the network: around 600 Mbps. Raw UDP traffic.
  • Result: Around 85% of the samples received by Host B are small samples, as opposed to the expected 50%.
  • This scenario using the DDS setup described above completed discovery in a few minutes (around 5).

Test B:

  • Slow host (host type A) as a receiver.
  • Traffic put in the network: around 600 Mbps. Raw UDP traffic.
  • Result: Around 95% of the samples received by the host type A are small samples, as opposed to the expected 50%.
  • This scenario using the DDS setup described above completed discovery in one hour and a half.

When this IP fragmentation behavior happens, it is recommendable to debug using tools like netstat. However, netstat shows:

    Reassembly Required                 = X
    Reassembly Successful              = Y
    Reassembly Failures                   = 0

With Y << X. Netstat doesn’t show any Reassembly Failures. However, the number of Reassembly Successful events in this kind of situations is much smaller than Reassembly Required. This is because when the fragments are dropped from the buffer, the OS doesn't consider the reassembly a failure but also it is not considered a success.

Once the issue is happening, all the applications were rebooted waiting a time T before starting them:

  • When T < 60 seconds, the newly booted applications suffered the same effect as the old applications. This is because the buffer hasn’t been released yet (timeout = 60 seconds).
  • When T > 60 seconds, everything worked fine until the buffer got full again.

Workaround at the DDS level

Although this is not a DDS issuethere are some workarounds that can be done in the QoS to overcome these issues:

  1. Limit max_bytes_per_nack_response to a smaller value. This decreases the throughput and thus, avoids the congestion of IP fragments. This alternative has been verified in this research.
  2. Using a flow controller in Discovery. This has been proven in customer scenarios.

Conclusions

Users can run into a problem in Windows which is related to IP fragmentation when booting a system formed by more than 2 hosts. The resources designated for IP reassembly are exhausted and packets are dropped, resulting in a long discovery phase.

Microsoft has been informed about this issue (support case reference code: [REG 115052212763130]) and they claim that this is an unsupported scenario. Their suggestion is that the proper way to do it would be using TCP instead of UDP. The timeout causing the error is not user configurable and can not be changed.

For more information, please email support@rti.com.

References

  1. RFC talking about IPv4 reassembly errors at high data rates (https://tools.ietf.org/html/rfc4963)
  2. Blog post talking about the timeout (http://blogs.technet.com/b/nettracer/archive/2010/06/03/why-doesn-t-ipreassemblytimeout-registry-key-take-effect-on-windows-2000-or-later-systems.aspx)
  3. Microsoft KB solution regarding the maximum number of fragments (https://support.microsoft.com/en-us/kb/811003)
Platform:

Comments

http://blogs.technet.com/b/nettracer/archive/2010/06/03/why-doesn-t-ipreassemblytimeout-registry-key-take-effect-on-windows-2000-or-later-systems.aspx is 404, suggest replace with

 
https://support.microsoft.com/en-us/kb/811003 is also 404