Further help reading parsed log

6 posts / 0 new
Last post
Offline
Last seen: 9 months 1 week ago
Joined: 05/08/2020
Posts: 8
Further help reading parsed log

I have parsed the log and I still am not quite sure how to interpret the results. Are there any more resources that describe how to read the message?

 

I am testing a server that intermittently fails to respond to a request. My test has 3 request/response pairs and the failure occurs on any of the 3. Last time I ran 100 samples I got 1 failure. I parsed the log from a passing run and the failing run. From what I can figure out in my parsed logs and comparing the two it seems that the server attempts to publish the response but the response seems to only have 1 DATA received instead of 2 like the passing run. Below are the logs but I would like to know what the numbers in brackets mean. One seems to have [1, 2] and the other has [2, 2]. There are no errors or warnings. I am also including the packet statistics because it seems like the failed one has a missing packet?

Here is the log of the passing

2020-05-07T22:50:52.987027 | | | | Created publisher
2020-05-07T22:50:52.994037 | | | | Created topic, name: 'topic', type: 'type'
2020-05-07T22:50:52.994646 | | | | Created writer for topic 'topic'
2020-05-07T22:50:52.997087 | H2.A1.P1 | | | Announcing new writer W-K_800000L
2020-05-07T22:50:52.997941 | H2.A1.P1 | | | Discovered new writer W-K_800000L
2020-05-07T22:50:52.998157 | H2.A1.P1 | | R-K_800001L | Discovered local reliable writer W-K_800000L
2020-05-07T22:50:53.001084 | | <--- | W-K_800000L | Sent preemptive HB to let know about samples in [1, 0]
2020-05-07T22:50:53.001137 | H2.A1.P2 | | W-K_800000L | Discovered remote reliable reader R-K_800001L
2020-05-07T22:50:53.001184 | H2.A1.P1 | ---> | R-K_800001L | Received HB [0] from writer W-K_800000L for samples in [1, 0]
2020-05-07T22:50:53.002470 | H2.A1.P1 | ---> | R-K_800001L | Received DATA [1] from writer W-K_800000L (reliable)
2020-05-07T22:50:54.292754 | H2.A1.P2 | ---> | R-K_800001L | Received DATA [2] from writer W-K_800000L (reliable)
2020-05-07T22:50:54.374320 | H2.A1.P1 | ---> | R-K_800001L | Received DATA [2] from writer W-K_800000L (reliable)
2020-05-07T22:50:54.378839 | | <--- | W-K_800000L | Sent periodic HB [2] for samples in [1, 2]
2020-05-07T22:50:54.815384 | H2.A1.P2 | ---> | R-K_800001L | Received HB [2] from writer W-K_800000L for samples in [1, 2]
2020-05-07T22:50:54.815426 | | <--- | W-K_800000L | Sent periodic HB [2] for samples in [1, 2]
2020-05-07T22:50:54.997207 | H2.A1.P2 | <--- | R-K_800001L | Sent ACK [2] to writer W-K_800000L for 3 count 0
2020-05-07T22:50:54.997271 | H2.A1.P1 | ---> | R-K_800001L | Received HB [2] from writer W-K_800000L for samples in [1, 2]
2020-05-07T22:50:54.997341 | H2.A1.P1 | ---> | W-K_800000L | Received ACKNACK [2] from reader R-K_800001L for 3 +0
2020-05-07T22:50:54.997406 | H2.A1.P1 | <--- | R-K_800001L | Sent ACK [2] to writer W-K_800000L for 3 count 0
2020-05-07T22:50:54.997518 | H2.A1.P2 | ---> | W-K_800000L | Received ACKNACK [2] from reader R-K_800001L for 3 +0

### Packet statistics:
* GUID: BLVD_W+K_000200
* send: 2 packets
* PIGGYBACK HB: 2 (100.0%)
* GUID: SED_PUB_WRITER
* send: 2 packets
* PIGGYBACK HB: 2 (100.0%)
* GUID: SED_SUB_WRITER
* send: 3 packets
* PIGGYBACK HB: 3 (100.0%)
* GUID: H2.A1
* receive: 200 packets
* ACK: 25 (12.5%)
* INFO_DST: 50 (25.0%)
* INFO_TS: 50 (25.0%)
* HEARTBEAT: 25 (12.5%)
* DATA: 50 (25.0%)

Here is the failing log

2020-05-07T22:49:19.197979 | | | | Created publisher
2020-05-07T22:49:19.199623 | | | | Created topic, name: 'topic', type: 'type'
2020-05-07T22:49:19.204212 | | | | Created writer for topic 'topic'
2020-05-07T22:49:19.208722 | H2.A1.P1 | | | Announcing new writer W-K_800000L
2020-05-07T22:49:19.209417 | H2.A1.P1 | | | Discovered new writer W-K_800000L
2020-05-07T22:49:19.209550 | H2.A1.P1 | | R-K_800001L | Discovered local reliable writer W-K_800000L
2020-05-07T22:49:19.209759 | | <--- | W-K_800000L | Sent preemptive HB to let know about samples in [1, 0]
2020-05-07T22:49:19.209797 | H2.A1.P2 | | W-K_800000L | Discovered remote reliable reader R-K_800001L
2020-05-07T22:49:19.209867 | H2.A1.P1 | ---> | R-K_800001L | Received HB [0] from writer W-K_800000L for samples in [1, 0]
2020-05-07T22:49:19.212468 | H2.A1.P1 | ---> | R-K_800001L | Received DATA [1] from writer W-K_800000L (reliable)
2020-05-07T22:49:19.221226 | | <--- | W-K_800000L | Sent periodic HB [2] for samples in [2, 2]
2020-05-07T22:49:19.701790 | H2.A1.P2 | ---> | R-K_800001L | Received HB [2] from writer W-K_800000L for samples in [2, 2]
2020-05-07T22:49:19.701826 | H2.A1.P2 | <--- | R-K_800001L | Sent ACK [2] to writer W-K_800000L for 3 count 0
2020-05-07T22:49:19.750976 | H2.A1.P1 | ---> | W-K_800000L | Received ACKNACK [2] from reader R-K_800001L for 3 +0

### Packet statistics:

* GUID: BLVD_W+K_000200
* send: 2 packets
* PIGGYBACK HB: 2 (100.0%)
* GUID: SED_PUB_WRITER
* send: 1 packets
* PIGGYBACK HB: 1 (100.0%)
* GUID: SED_SUB_WRITER
* send: 2 packets
* PIGGYBACK HB: 2 (100.0%)
* GUID: H2.A1
* receive: 194 packets
* ACK: 24 (12.4%)
* INFO_DST: 48 (24.7%)
* INFO_TS: 49 (25.3%)
* HEARTBEAT: 24 (12.4%)
* DATA: 49 (25.3%)

Offline
Last seen: 3 years 5 months ago
Joined: 08/20/2012
Posts: 25

The numbers in brackets are sequence number ranges. Samples are assigned a monotonically increasing sequence number as they are written. Your application can access those through SampleInfo. With respect to QoS, are you using >=Transient Local Durability? Keep All or Keep Last History? Do your Writer and Reader have enough history depth?

Offline
Last seen: 9 months 1 week ago
Joined: 05/08/2020
Posts: 8

Using DDS_RELIABLE_RELIABILITY_QOS, DDS_KEEP_LAST_HISTORY_QOS depth is 10, VOLATILE_DURABILITY_QOS.

 

I thought the 49 vs 50 data packet difference was a sign of a problem but I have other samples that request/respond with 48, 50, and 51 data packets though most 86/100 have 50.

 

The only line that seems different is:

2020-05-07T22:50:54.374320 | H2.A1.P1 | ---> | R-K_800001L | Received DATA [2] from writer W-K_800000L (reliable)

This line seems to be in all 99 good runs but absent from the 1 failure

 

I am also trying to set up wireshark to run while I am testing to see if that can provide any insight but I am running centos 6 and seem to only have easy access to wireshark 1.8.10 and packets going over the rtps protocol do not seem to be showing up. Also when I try to follow the knowledge base how to on changing the colors I get that "rtitcp" is neither a field nor a protocol name.

Offline
Last seen: 3 years 5 months ago
Joined: 08/20/2012
Posts: 25

Regarding Wireshark, in older versions the protocol was called "RTPS2", and "RTPS" is an older version pre-dating standardization. With newer versions of Wireshark, the protocol is just "RTPS".

With that combination of QoS, I would say you have two possible opportunities for data "loss".

With a keep last history of depth 10, the middleware is being told that only the most recent 10 samples for an instance need to be retained. On the Reader side, if an 11th sample is delivered before your application has seen the ten in the history, the oldest sample is replaced with the new one. There is no error because this is expected behavior that the application is requesting. It's similar on the Writer side, except that if a sample was lost and NACKed by a Reader, the Writer will only be able to repair the most recent ten samples. If you must receive all samples, you might consider KEEP ALL history (minding resource limits in light of the fact that history depth doesn't apply). Some refer to this as "strict reliability".

The other potential issue is a race condition with VOLATILE durability. If your test begins writing as soon as it detects the Reader, the initial samples may not be accepted by the Reader. Discovery is mutual, and the Writer may determine that it is matched with the Reader before the Reader determines that it matches with the Writer. Samples from an unmatched Writer will not be accepted. A common approach for this use case is to use TRANSIENT_LOCAL Durability. That way, the Writer retains samples up to its history depth to be delivered to a late joining Reader, so samples written before the Discovery match will be "repaired".

Offline
Last seen: 9 months 1 week ago
Joined: 05/08/2020
Posts: 8

These QOS profile settings have been around for a long time and either haven't caused an issue anyone has noticed or there is something in my test setup that is violating the conditions under which they normally work. The first doesn't sound likely as I am only sending 6 messages unless the heart beats and acks add to the number of samples. I could believe a race condition. Would that be indicated in the log? Also it seems that transient will hold a sample for a newly discovered reader but is the race condition caused by the writer thinking the reader is already discovered when discovery is not quite done? I will try adjusting them to see if they affect anything but it took another 300 or so runs before I got another failure below that seems a little different:

2020-05-09T12:55:35.110189 |               |        |                      | Created publisher

2020-05-09T12:55:35.115520 |               |        |                      | Created topic, name: 'topic', type: 'type'
2020-05-09T12:55:35.133898 |               |        |                      | Created writer for topic 'topic'
2020-05-09T12:55:35.143621 | H2.A1.P2 |        |                      | Announcing new writer W-K_800000L
2020-05-09T12:55:35.143891 |               | <--- | W-K_800000L | Sent preemptive HB to let know about samples in [2, 1]
2020-05-09T12:55:35.145395 | H2.A1.P1 |        | W-K_800000L | Discovered remote reliable reader R-K_800001L
2020-05-09T12:55:35.145662 | H2.A1.P2 |        |                      | Discovered new writer W-K_800000L
2020-05-09T12:55:35.145971 | H2.A1.P2 |        | R-K_800001L | Discovered local reliable writer W-K_800000L
2020-05-09T12:55:35.146236 | H2.A1.P1 | ---> | W-K_800000L | Received ACKNACK [0] from reader R-K_800001L for 0 +0
2020-05-09T12:55:35.146236 | H2.A1.P1 | <--- | W-K_800000L | Sent preemptive GAP to volatile reader R-K_800001L
2020-05-09T12:55:35.146236 |               | <--- | W-K_800000L | Sent HB [2] to verify GAP for samples in [1, 1]
2020-05-09T12:55:35.146384 | H2.A1.P2 | ---> | R-K_800001L | Received GAP from writer W-K_800000L for [0, 2] (+0)
2020-05-09T12:55:35.146422 | H2.A1.P2 | ---> | R-K_800001L | Received HB [2] from writer W-K_800000L for samples in [1, 1]
2020-05-09T12:55:35.146435 |               | <--- | W-K_800000L | Sent piggyback HB [3] from synchronous reparation to acknowledge samples in [1, 1]
2020-05-09T12:55:35.146513 | H2.A1.P2 | ---> | R-K_800001L | Received HB [3] from writer W-K_800000L for samples in [1, 1]
2020-05-09T12:55:35.146527 | H2.A1.P2 | <--- | R-K_800001L | Sent ACK [2] to writer W-K_800000L for 2 count 0
2020-05-09T12:55:35.146564 | H2.A1.P1 | ---> | W-K_800000L | Received ACKNACK [2] from reader R-K_800001L for 2 +0
2020-05-09T12:55:35.146582 |               |         |                     | Ignored ACK
2020-05-09T12:55:46.526963 | H2.A1.P2 | ---> | R-K_800001L | Received DATA [2] from writer W-K_800000L (reliable)

 

Also with wireshark I saw the RTPS2 in the settings but watching the live packet feed nothing shows up while the DDS messages are being sent back and forth especially with any sort of RT protocol. All I see is 4 UDP discovery messages.

Offline
Last seen: 3 years 5 months ago
Joined: 08/20/2012
Posts: 25

I think it is indicated in the log: "Sent preemptive GAP to volatile reader R-K_800001L" and "Sent HB [2] to verify GAP for samples in [1, 1]". Sequence numbers are being announced as unavailable to deliver to the Reader, and it even mentions Volatile Durability.

The race condition would be resolved by switching to Transient Local. Even if the first attempt to deliver a sample doesn't succeed, the reliability protocol will retransmit. There is also a lurking potential problem with Keep Last and the limited history depth wrt the sample count you are expecting, but you may find that this resolves just by switching the Writer and Reader to Transient Local Durability.