RTI Connext

Core Libraries and Utilities

User’s Manual

Part 3 — Advanced Concepts

Chapters 10-21

Version 5.0

© 2012 Real-Time Innovations, Inc.

All rights reserved.

Printed in U.S.A. First printing.

August 2012.

Trademarks

Real-Time Innovations, RTI, DataBus, and Connext are trademarks or registered trademarks of Real-Time Innovations, Inc. All other trademarks used in this document are the property of their respective owners.

Copy and Use Restrictions

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form (including electronic, mechanical, photocopy, and facsimile) without the prior written permission of Real- Time Innovations, Inc. The software described in this document is furnished under and subject to the RTI software license agreement. The software may be used or copied only under the terms of the license agreement.

Third-Party Copyright Notices

Note: In this section, "the Software" refers to third-party software, portions of which are used in Connext; "the Software" does not refer to Connext.

This product implements the DCPS layer of the Data Distribution Service (DDS) specification version 1.2 and the DDS Interoperability Wire Protocol specification version 2.1, both of which are owned by the Object Management, Inc. Copyright 1997-2007 Object Management Group, Inc. The publication of these specifications can be found at the Catalog of OMG Data Distribution Service (DDS) Specifications. This documentation uses material from the OMG specification for the Data Distribution Service, section 7. Reprinted with permission. Object Management, Inc. © OMG. 2005.

Portions of this product were developed using ANTLR (www.ANTLR.org). This product includes software developed by the University of California, Berkeley and its contributors.

Portions of this product were developed using AspectJ, which is distributed per the CPL license. AspectJ source code may be obtained from Eclipse. This product includes software developed by the University of California, Berkeley and its contributors.

Portions of this product were developed using MD5 from Aladdin Enterprises.

Portions of this product include software derived from Fnmatch, (c) 1989, 1993, 1994 The Regents of the University of California. All rights reserved. The Regents and contributors provide this software "as is" without warranty.

Portions of this product were developed using EXPAT from Thai Open Source Software Center Ltd and Clark Cooper Copyright (c) 1998, 1999, 2000 Thai Open Source Software Center Ltd and Clark Cooper Copyright (c) 2001, 2002 Expat maintainers. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

Technical Support

Real-Time Innovations, Inc.

232 E. Java Drive

Sunnyvale, CA 94089

Phone:

(408) 990-7444

Email:

support@rti.com

Website:

https://support.rti.com/

Contents, Part 3

10 Reliable Communications

10-1

10.1 Sending Data Reliably

10-1

10.1.1

Best-effort Delivery Model

10-1

10.1.2

Reliable Delivery Model

10-2

10.2 Overview of the Reliable Protocol

10-3

10.3 Using QosPolicies to Tune the Reliable Protocol

10-6

10.3.1

Enabling Reliability

10-7

 

10.3.1.1

Blocking until the Send Queue Has Space Available

10-7

10.3.2 Tuning Queue Sizes and Other Resource Limits

10-7

 

10.3.2.1

Understanding the Send Queue and Setting its Size

10-8

 

10.3.2.2

Understanding the Receive Queue and Setting Its Size

10-10

10.3.3 Controlling Queue Depth with the History QosPolicy

10-13

10.3.4 Controlling Heartbeats and Retries with DataWriterProtocol QosPolicy

10-13

 

10.3.4.1

How Often Heartbeats are Resent (heartbeat_period)

10-13

 

10.3.4.2 How Often Piggyback Heartbeats are Sent

 

 

 

(heartbeats_per_max_samples)

10-14

 

10.3.4.3

Controlling Packet Size for Resent Samples

 

 

 

(max_bytes_per_nack_response)

10-15

 

10.3.4.4 Controlling How Many Times Heartbeats are Resent

 

 

 

(max_heartbeat_retries)

10-16

 

10.3.4.5

Treating Non-Progressing Readers as Inactive Readers

 

 

 

(inactivate_nonprogressing_readers)

10-17

 

10.3.4.6 Coping with Redundant Requests for Missing Samples

 

 

 

(max_nack_response_delay)

10-17

 

10.3.4.7

Disabling Positive Acknowledgements

 

 

 

(disable_postive_acks_min_sample_keep_duration)

10-18

10.3.5 Avoiding Message Storms with DataReaderProtocol QosPolicy

10-19

10.3.6 Resending Samples to Late-Joiners with the Durability QosPolicy

10-19

10.3.7

Use Cases

10-19

 

10.3.7.1

Importance of Relative Thread Priorities

10-19

 

10.3.7.2

Aperiodic Use Case: One-at-a-Time

10-20

 

10.3.7.3

Aperiodic, Bursty

10-23

 

10.3.7.4

Periodic

10-26

11 Collaborative DataWriters

11-1

11.1 Collaborative DataWriters Use Cases

11-2

iii

 

11.2 Sample Combination (Synchronization) Process in a DataReader

11-3

 

11.3

Configuring Collaborative DataWriters

11-3

 

 

11.3.1

Assocating Virtual GUIDs with Data Samples

11-3

 

 

11.3.2

Assocating Virtual Sequence Numbers with Data Samples

11-3

 

 

11.3.3 Specifying which DataWriters will Deliver Samples to the DataReader

 

 

 

 

from a Logical Data Source

11-3

 

 

11.3.4 Specifying How Long to Wait for a Missing Sample

11-4

 

11.4 Collaborative DataWriters and Persistence Service

11-4

12

Mechanisms for Achieving Information Durability and

 

 

Persistence

 

12-1

 

12.1

Introduction

 

12-1

 

 

12.1.1 Scenario 1. DataReader Joins after DataWriter Restarts (Durable Writer History)

12-2

 

 

12.1.2

Scenario 2: DataReader Restarts While DataWriter Stays Up (Durable Reader State) 12-2

 

 

12.1.3 Scenario 3. DataReader Joins after DataWriter Leaves Domain (Durable Data)

12-3

 

12.2 Durability and Persistence Based on Virtual GUIDs

12-4

 

12.3

Durable Writer History

12-5

 

 

12.3.1 Durable Writer History Use Case

12-5

 

 

12.3.2 How To Configure Durable Writer History

12-6

 

12.4

Durable Reader State

12-8

 

 

12.4.1 Durable Reader State With Protocol Acknowledgment

12-8

 

 

 

12.4.1.1

Bandwidth Utilization

12-9

 

 

12.4.2

Durable Reader State with Application Acknowledgment

12-10

 

 

 

12.4.2.1

Bandwidth Utilization

12-10

 

 

12.4.3 Durable Reader State Use Case

12-10

 

 

12.4.4 How To Configure a DataReader for Durable Reader State

12-11

 

12.5

Data Durability

 

12-12

 

 

12.5.1

RTI Persistence Service

12-12

13

Guaranteed Delivery of Data

13-1

 

13.1

Introduction

 

13-1

 

 

13.1.1 Identifying the Required Consumers of Information

13-2

 

 

13.1.2 Ensuring Consumer Applications Process the Data Successfully

13-3

 

 

13.1.3 Ensuring Information is Available to Late-Joining Applications

13-4

 

13.2

Scenarios

 

13-5

 

 

13.2.1 Scenario 1: Guaranteed Delivery to a-priori known subscribers

13-5

 

 

13.2.2 Scenario 2: Surviving a Writer Restart when Delivering Samples to

 

 

 

 

a priori Known Subscribers

13-7

 

 

13.2.3 Scenario 3: Delivery Guaranteed by Persistence Service (Store and Forward) to

 

 

 

 

a priori Known Subscribers

13-7

 

 

 

13.2.3.1

Variation: Using Redundant Persistence Services

13-9

 

 

 

13.2.3.2

Variation: Using Load-Balanced Persistent Services

13-10

14

Discovery

 

14-1

 

14.1

What is Discovery?

14-1

 

 

14.1.1

Simple Participant Discovery

14-2

 

 

14.1.2

Simple Endpoint Discovery

14-2

iv

 

14.2

Configuring the Peers List Used in Discovery

14-3

 

 

14.2.1

Peer Descriptor Format

14-4

 

 

 

14.2.1.1

Locator Format

14-5

 

 

 

14.2.1.2

Address Format

14-6

 

 

14.2.2 NDDS_DISCOVERY_PEERS Environment Variable Format

14-6

 

 

14.2.3

NDDS_DISCOVERY_PEERS File Format

14-7

 

14.3

Discovery Implementation

14-8

 

 

14.3.1

Participant Discovery

14-8

 

 

 

14.3.1.1

Refresh Mechanism

14-9

 

 

 

14.3.1.2 Maintaining DataWriter Liveliness for kinds AUTOMATIC and

 

 

 

 

 

MANUAL_BY_PARTICIPANT

14-14

 

 

14.3.2

Endpoint Discovery

14-14

 

 

14.3.3

Discovery Traffic Summary

14-20

 

 

14.3.4

Discovery-Related QoS

14-20

 

14.4

Debugging Discovery

14-21

 

14.5

Ports Used for Discovery

14-23

 

 

14.5.1 Inbound Ports for Meta-Traffic

14-24

 

 

14.5.2 Inbound Ports for User Traffic

14-25

 

 

14.5.3 Automatic Selection of participant_id and Port Reservation

14-25

 

 

14.5.4 Tuning domain_id_gain and participant_id_gain

14-25

15

Transport Plugins

15-1

 

15.1

Builtin Transport Plugins

15-2

 

15.2

Extension Transport Plugins

15-2

 

15.3

The NDDSTransportSupport Class

15-3

 

15.4

Explicitly Creating Builtin Transport Plugin Instances

15-3

 

15.5

Setting Builtin Transport Properties of the Default Transport Instance

 

 

 

—get/set_builtin_transport_properties()

15-4

 

15.6

Setting Builtin Transport Properties with the PropertyQosPolicy

15-5

 

 

15.6.1 Notes Regarding Loopback and Shared Memory

15-17

 

 

15.6.2 Setting the Maximum Gather-Send Buffer Count for UDPv4 and UDPv6

15-17

 

 

15.6.3 Formatting Rules for IPv6 ‘Allow’ and ‘Deny’ Address Lists

15-18

 

15.7

Installing Additional Builtin Transport Plugins with register_transport()

15-18

 

 

15.7.1

Transport Lifecycles

15-19

 

 

15.7.2

Transport Aliases

15-19

 

 

15.7.3

Transport Network Addresses

15-20

 

15.8

Installing Additional Builtin Transport Plugins with PropertyQosPolicy

15-20

 

15.9

Other Transport Support Operations

15-21

 

 

15.9.1 Adding a Send Route

15-21

 

 

15.9.2 Adding a Receive Route

15-22

 

 

15.9.3 Looking Up a Transport Plugin

15-23

16

Built-In Topics

 

16-1

 

16.1

Listeners for Built-in Entities

16-1

 

16.2

Built-in DataReaders

16-2

 

 

16.2.1 LOCATOR_FILTER QoS Policy (DDS Extension)

16-7

v

16.3

Accessing the Built-in Subscriber

16-8

16.4

Restricting Communication—Ignoring Entities

16-8

 

16.4.1 Ignoring Specific Remote DomainParticipants

16-9

 

16.4.2

Ignoring Publications and Subscriptions

16-9

 

16.4.3

Ignoring Topics

16-10

17 Configuring QoS with XML

17-1

17.1

Example XML File

17-2

17.2

How to Load XML-Specified QoS Settings

17-2

 

17.2.1

Loading, Reloading and Unloading Profiles

17-3

17.3

How to Use XML-Specified QoS Settings

17-4

17.4

XML File Syntax

17-5

17.5

Using Environment Variables in XML

17-6

17.6

XML String Syntax

17-7

17.7

How the XML is Validated

17-7

 

17.7.1

Validation at Run-Time

17-7

 

17.7.2

XML File Validation During Editing

17-8

17.8

Configuring QoS with XML

17-8

 

17.8.1

QosPolicies

17-9

 

17.8.2

Sequences

17-9

 

17.8.3

Arrays

17-12

 

17.8.4

Enumeration Values

17-12

 

17.8.5

Time Values (Durations)

17-12

 

17.8.6

Transport Properties

17-13

 

17.8.7

Thread Settings

17-13

17.9

QoS Profiles

17-14

 

17.9.1

QoS Profiles with a Single QoS

17-15

 

17.9.2

QoS Profile Inheritance

17-15

 

17.9.3

Topic Filters

17-17

 

17.9.4

Overwriting Default QoS Values

17-19

 

17.9.5

Get Qos Profiles

17-20

17.10

QoS Libraries

17-20

 

17.10.1 Get Qos Profile Libraries

17-21

17.11

URL Groups

17-21

17.12

Configuring Logging Via XML

17-22

18 Multi-channel DataWriters

18-1

18.1

What is a Multi-channel DataWriter?

18-2

18.2

How to Configure a Multi-channel DataWriter

18-4

 

18.2.1 Limitations

18-5

18.3

Multi-channel Configuration on the Reader Side

18-5

18.4

Where Does the Filtering Occur?

18-6

 

18.4.1 Filtering at the DataWriter

18-6

 

18.4.2 Filtering at the DataReader

18-7

 

18.4.3 Filtering on the Network Hardware

18-7

18.5

Fault Tolerance and Redundancy

18-7

vi

18.6

Reliability with Multi-Channel DataWriters

18-8

 

18.6.1

Reliable Delivery

18-8

 

18.6.2

Reliable Protocol Considerations

18-9

18.7

Performance Considerations

18-9

 

18.7.1

Network-Switch Filtering

18-9

 

18.7.2 DataWriter and DataReader Filtering

18-10

19 Connext Threading Model

19-1

19.1

Database Thread

19-1

19.2

Event Thread

19-2

19.3

Receive Threads

19-3

19.4

Exclusive Areas, Connext Threads and User Listeners

19-4

19.5

Controlling CPU Core Affinity for RTI Threads

19-5

20 Sample-Data Memory Management

20-1

20.1 Sample-Data Memory Management for DataWriters

20-1

20.1.1 Memory Management without Batching

20-2

20.1.2 Memory Management with Batching

20-2

20.1.3 Writer-Side Memory Management when Using Java

20-3

20.1.4 Writer-Side Memory Management when Working with Large Data

20-4

20.2 Sample-Data Memory Management for DataReaders

20-6

20.2.1 Memory Management for DataReaders Using Generated Type-Plugins

20-6

20.2.2 Reader-Side Memory Management when Using Java

20-7

20.2.3 Memory Management for DynamicData DataReaders

20-8

20.2.5 Memory Management for Fragmented Samples

20-10

20.2.6 Reader-Side Memory Management when Working with Large Data

20-10

21 Troubleshooting

21-1

21.1 What Version am I Running?

21-1

21.1.1 Finding Version Information in Revision Files

21-1

21.1.2 Finding Version Information Programmatically

21-1

21.2 Controlling Messages from Connext

21-2

21.2.1 Format of Logged Messages

21-4

21.2.1.1

Timestamps

21-4

21.2.1.2

Thread identification

21-5

21.2.1.3

Hierarchical Context

21-5

21.2.1.4

Explanation of Context Strings

21-5

21.2.2 Configuring Logging via XML

21-6

21.2.3 Customizing the Handling of Generated Log Messages

21-7

vii

Chapter 10 Reliable Communications

Connext uses best-effort delivery by default. The other type of delivery that Connext supports is called reliable. This chapter provides instructions on how to set up and use reliable communication.

This chapter includes the following sections:

Sending Data Reliably (Section 10.1)

Overview of the Reliable Protocol (Section 10.2)

Using QosPolicies to Tune the Reliable Protocol (Section 10.3)

10.1Sending Data Reliably

The DCPS reliability model recognizes that the optimal balance between time-determinism and data-delivery reliability varies widely among applications and can vary among different publications within the same application. For example, individual samples of signal data can often be dropped because their value disappears when the next sample is sent. However, each sample of command data must be received and it must be received in the order sent.

The QosPolicies provide a way to customize the determinism/reliability trade-off on a per Topic basis, or even on a per DataWriter/DataReader basis.

There are two delivery models:

Best-effort delivery mode “I’m not concerned about missed or unordered samples.”

Reliable delivery model “Make sure all samples get there, in order.”

10.1.1Best-effort Delivery Model

By default, Connext uses the best-effort delivery model: there is no effort spent ensuring in-order delivery or resending lost samples. Best-effort DataReaders ignore lost samples in favor of the latest sample. Your application is only notified if it does not receive a new sample within a certain time period (set in the DEADLINE QosPolicy (Section 6.5.5)).

The best-effort delivery model is best for time-critical information that is sent continuously. For instance, consider a DataWriter for the value of a sensor device (such as a the pressure inside a tank), and assume the DataWriter sends samples continuously. In this situation, a DataReader for this Topic is only interested in having the latest pressure reading available—older samples are obsolete.

10-1

10.1.2Reliable Delivery Model

Reliable delivery means the samples are guaranteed to arrive, in the order published.

The DataWriter maintains a send queue with space to hold the last X number of samples sent. Similarly, a DataReader maintains a receive queue with space for consecutive X expected samples.

The send and receive queues are used to temporarily cache samples until Connext is sure the samples have been delivered and are not needed anymore. Connext removes samples from a publication’s send queue after the sample has been acknowledged by all reliable subscriptions. When positive acknowledgements are disabled (see DATA_WRITER_PROTOCOL QosPolicy (DDS Extension) (Section 6.5.3) and DATA_READER_PROTOCOL QosPolicy (DDS Extension) (Section 7.6.1)), samples are removed from the send queue after the corresponding keep- duration has elapsed (see Table 6.36, “DDS_RtpsReliableWriterProtocol_t,” on page 6-81).

If an out-of-order sample arrives, Connext speculatively caches it in the DataReader’s receive queue (provided there is space in the queue). Only consecutive samples are passed on to the

DataReader.

DataWriters can be set up to wait for available queue space when sending samples. This will cause the sending thread to block until there is space in the send queue. (Or, you can decide to sacrifice sending samples reliably so that the sending rate is not compromised.) If the DataWriter is set up to ignore the full queue and sends anyway, then older cached samples will be pushed out of the queue before all DataReaders have received them. In this case, the DataReader (or its Subscriber) is notified of the missing samples through its Listener and/or Conditions.

Connext automatically sends acknowledgments (ACKNACKs) as necessary to maintain reliable communications. The DataWriter may choose to block for a specified duration to wait for these acknowledgments (see Waiting for Acknowledgments in a DataWriter (Section 6.3.11)).

Connext establishes a virtual reliable channel between the matching DataWriter and all DataReaders. This mechanism isolates DataReaders from each other, allows the application to control memory usage, and provides mechanisms for the DataWriter to balance reliability and determinism. Moreover, the use of send and receive queues allows Connext to be implemented efficiently without introducing unnecessary delays in the stream.

Note that a successful return code (DDS_RETCODE_OK) from write() does not necessarily mean that all DataReaders have received the data. It only means that the sample has been added to the DataWriter’s queue. To see if all DataReaders have received the data, look at the RELIABLE_WRITER_CACHE_CHANGED Status (DDS Extension) (Section 6.3.6.7) to see if any samples are unacknowledged.

Suppose DataWriter A reliably publishes a Topic to which DataReaders B and C reliably subscribe. B has space in its queue, but C does not. Will DataWriter A be notified? Will DataReader C receive any error messages or callbacks? The exact behavior depends on the QoS settings:

If HISTORY_KEEP_ALL is specified for C, C will reject samples that cannot be put into the queue and request A to resend missing samples. The Listener is notified with the on_sample_rejected() callback (see SAMPLE_REJECTED Status (Section 7.3.7.8)). If A has a queue large enough, or A is no longer writing new samples, A won’t notice unless it checks the RELIABLE_WRITER_CACHE_CHANGED Status (DDS Extension) (Section 6.3.6.7).

If HISTORY_KEEP_LAST is specified for C, C will drop old samples and accept new ones. The Listener is notified with the on_sample_lost() callback (see SAMPLE_LOST Status (Section 7.3.7.7)). To A, it is as if all samples have been received by C (that is, they have all been acknowledged).

10-2

10.2Overview of the Reliable Protocol

An important advantage of Connext is that it can offer the reliability and other QoS guarantees mandated by DDS on top of a very wide variety of transports, including packet-based transports, unreliable networks, multicast-capable transports, bursty or high-latency transports, etc. Connext is also capable of maintaining liveliness and application-level QoS even in the presence of sporadic connectivity loss at the transport level, an important benefit in mobile networks. Connext accomplishes this by implementing a reliable protocol that sequences and acknowledges application-level messages and monitors the liveliness of the link. This is called the Real-Time Publish-Subscribe (RTPS) protocol; it is an open, international standard.1

In order to work in this wide range of environments, the reliable protocol defined by RTPS is highly configurable with a set of parameters that let the application fine-tune its behavior to trade-off latency, responsiveness, liveliness, throughput, and resource utilization. This section describes the most important features to the extent needed to understand how the configuration parameters affect its operation.

The most important features of the RTPS protocol are:

Support for both push and pull operating modes

Support for both positive and negative acknowledgments

Support for high data-rate DataWriters

Support for multicast DataReaders

Support for high-latency environments

In order to support these features, RTPS uses several types of messages: Data messages (DATA), acknowledgments (ACKNACKs), and heartbeats (HBs).

DATA messages contain snapshots of the value of data-objects and associate the snapshot with a sequence number that Connext uses to identify them within the DataWriter’s history. These snapshots are stored in the history as a direct result of the application calling write() on the DataWriter. Incremental sequence numbers are automatically assigned by the DataWriter each time write() is called. In Figure 10.1 through Figure 10.7, these messages are represented using the notation DATA(<value>, <sequenceNum>). For example, DATA(A,1) represents a message that communicates the value ‘A’ and associates the sequence number ‘1’ with this message. A DATA is used for both keyed and non-keyed data types.

HB messages announce to the DataReader that it should have received all snapshots up to the one tagged with a range of sequence numbers and can also request the DataReader to send an acknowledgement back. For example, HB(1-3) indicates to the DataReader that it should have received snapshots tagged with sequence numbers 1, 2, and 3 and asks the DataReader to confirm this.

ACKNACK messages communicate to the DataWriter that particular snapshots have been successfully stored in the DataReader’s history. ACKNACKs also tell the DataWriter which snapshots are missing on the DataReader side. The ACKNACK message includes a set of sequence numbers represented as a bit map. The sequence numbers indicate which ones the DataReader is missing. (The bit map contains the base sequence number that has not been received, followed by the number of bits in bit map and the optional bit map.

The maximum size of the bit map is 256.) All numbers up to (not including) those in the set are considered positively acknowledged. They are represented in Figure 10.1 through Figure 10.7 as ACKNACK(<first-missing>) or ACKNACK(<first-missing>-<last-

1. For a link to the RTPS specification, see the RTI website, www.rti.com.

10-3

missing>). For example, ACKNACK(4) indicates that the snapshots with sequence numbers 1, 2, and 3 have been successfully stored in the DataReader history, and that 4 has not been received.

It is important to note that Connext can bundle multiple of the above messages within a single network packet. This ‘submessage bundling’ provides for higher performance communications.

Figure 10.1 Basic RTPS Reliable Protocol

Assigned sequence number

History of send data values

Whether or not the sample has been delivered to the reader history

DataWriter DataReader

write

(A)

1

A

X

 

 

DATA (A,1);

 

 

cache

HB (1)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(A, 1)

 

 

cache

 

 

 

 

 

 

 

 

(A, 1)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

check

(1)

 

 

 

 

 

 

ACKNACK

1

A

4

 

 

 

(2)

 

acked

 

 

 

 

 

 

 

 

 

 

 

 

(1)

 

 

 

 

 

 

 

time

time

 

 

 

 

 

 

 

 

 

 

 

 

 

Assigned sequence number

DataReader history

Whether or not the sample is available for the application to read/take

1 A 4

Figure 10.1 illustrates the basic behavior of the protocol when an application calls the write() operation on a DataWriter that is associated with a DataReader. As mentioned, the RTPS protocol can bundle multiple submessages into a single network packet. In Figure 10.1 this feature is used to piggyback a HB message to the DATA message. Note that before the message is sent, the data is given a sequence number (1 in this case) which is stored in the DataWriter’s send queue. As soon as the message is received by the DataReader, it places it into the DataReader’s receive queue. From the sequence number the DataReader can tell that it has not missed any messages and therefore it can make the data available immediately to the user (and call the DataReaderListener). This is indicated by the “” symbol. The reception of the HB(1) causes the DataReader to check that it has indeed received all updates up to and including the one with sequenceNumber=1. Since this is true, it replies with an ACKNACK(2) to positively acknowledge all messages up to (but not including) sequence number 2. The DataWriter notes that the update has been acknowledged, so it no longer needs to be retained in its send queue. This is indicated by the “” symbol.

Figure 10.2 illustrates the behavior of the protocol in the presence of lost messages. Assume that the message containing DATA(A,1) is dropped by the network. When the DataReader receives

10-4

Figure 10.2 RTPS Reliable Protocol in the Presence of Message Loss

DataWriter

write(S01)

1 A X

cache (A, 1)

D

 

AT

 

A (

 

A,

 

1);

 

HB (1)

DataReader

r

write(S02)

1

A

X

 

 

 

 

cache(B,2)

 

 

 

2

B

X

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

get(1)

write(S03)

DATA

 

 

(B,2);

 

 

HB (1-2)

 

K(1)

KNAC

 

AC

 

 

D

 

 

AT

 

 

A (

 

 

A,1)

 

1

A

X

cache(C,3)

 

 

 

2

B

X

 

 

 

 

 

3

C

X

 

 

 

 

 

 

 

 

 

 

 

 

 

1

A

4

 

 

 

 

 

2

B

4

 

 

 

 

acked(1-3)

3

C

4

 

 

 

 

 

 

 

 

 

DATA (C,3); HB (1-3)

 

K(4)

KNAC

AC

 

time

 

 

 

 

1

 

X

cache (B,2)

 

 

 

 

2

B

X

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

A

4

cache (A,1)

 

 

 

2

B

4

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

A

4

cache (C,3)

 

 

 

2

B

4

 

 

 

 

 

 

 

 

 

 

 

 

 

3

C

4

 

 

 

 

 

 

 

 

 

 

 

 

check(1-3)

 

 

 

time

See Figure 10.1 for

meaning of table columns.

 

the next message (DATA(B,2); HB(1-2)) the DataReader will notice that the data associated with sequence number 1 was never received. It realizes this because the heartbeat HB(1-2) tells the DataReader that it should have received all messages up to and including the one with sequence number 2. This realization has two consequences:

1.The data associated with sequence number 2 (B) is tagged with ‘X’ to indicate that it is not deliverable to the application (that is, it should not be made available to the application, because the application needs to receive the data associated with sample 1

(A) first).

2.An ACKNACK(1) is sent to the DataWriter to request that the data tagged with sequence number 1 be resent.

Reception of the ACKNACK(1) causes the DataWriter to resend DATA(A,1). Once the DataReader receives it, it can ‘commit’ both A and B such that the application can now access both (indicated

10-5

by the “”) and call the DataReaderListener. From there on, the protocol proceeds as before for the next data message (C) and so forth.

A subtle but important feature of the RTPS protocol is that ACKNACK messages are only sent as a direct response to HB messages. This allows the DataWriter to better control the overhead of these ‘administrative’ messages. For example, if the DataWriter knows that it is about to send a chain of DATA messages, it can bundle them all and include a single HB at the end, which minimizes ACKNACK traffic.

10.3Using QosPolicies to Tune the Reliable Protocol

Reliability is controlled by the QosPolicies in Table 10.1. To enable reliable delivery, read the following sections to learn how to change the QoS for the DataWriter and DataReader:

Enabling Reliability (Section 10.3.1)

Tuning Queue Sizes and Other Resource Limits (Section 10.3.2)

Controlling Heartbeats and Retries with DataWriterProtocol QosPolicy (Section 10.3.4)

Avoiding Message Storms with DataReaderProtocol QosPolicy (Section 10.3.5)

Resending Samples to Late-Joiners with the Durability QosPolicy (Section 10.3.6)

Then see this section to explore example use cases:

Use Cases (Section 10.3.7)

Table 10.1 QosPolicies for Reliable Communications

QosPolicy

Description

Related

Reference

Entitiesa

 

 

 

 

 

 

 

 

To establish reliable communication, this QoS must be

 

Section 10.3.1,

Reliability

set to DDS_RELIABLE_RELIABILITY_QOS for the

DW, DR

Section 6.5.19

 

DataWriter and its DataReaders.

 

 

 

 

 

 

 

 

 

This QoS determines the amount of resources each side

 

 

 

can use to manage instances and samples of instances.

 

 

 

Therefore it controls the size of the DataWriter’s send

 

Section 10.3.2,

ResourceLimits

queue and the DataReader’s receive queue. The send

DW, DR

Section 6.5.20

 

queue stores samples until they have been ACKed by

 

 

 

 

 

all DataReaders. The DataReader’s receive queue stores

 

 

 

samples for the user’s application to access.

 

 

 

 

 

 

History

This QoS affects how a DataWriter/DataReader behaves

DW, DR

Section 10.3.3,

 

when its send/receive queue fills up.

 

Section 6.5.10

DataWriterProtocol

This QoS configures DataWriter-specific protocol. The

DW

Section 10.3.4,

 

QoS can disable positive ACKs for its DataReaders.

 

Section 6.5.3

 

When a reliable DataReader receives a heartbeat from a

 

 

 

DataWriter and needs to return an ACKNACK, the

 

Section 10.3.5,

DataReaderProtocol

DataReader can choose to delay a while. This QoS sets

DR

Section 7.6.1

 

the minimum and maximum delay. It can also disable

 

 

 

 

 

positive ACKs for the DataReader.

 

 

 

 

 

 

10-6

Table 10.1 QosPolicies for Reliable Communications

QosPolicy

Description

Related

Reference

Entitiesa

 

 

 

 

 

 

 

 

This QoS determines additional amounts of resources

 

 

 

that the DataReader can use to manage samples

 

 

DataReaderResource-

(namely, the size of the DataReader’s internal queues,

DR

Section 10.3.2,

Limits

which cache samples until they are ordered for reliabil-

 

Section 7.6.2

 

ity and can be moved to the DataReader’s receive queue

 

 

 

for access by the user’s application).

 

 

 

 

 

 

Durability

This QoS affects whether late-joining DataReaders will

DW, DR

Section 10.3.6,

 

receive all previously-sent data or not.

 

Section 6.5.7

a.DW = DataWriter, DR = DataReader

10.3.1Enabling Reliability

You must modify the RELIABILITY QosPolicy (Section 6.5.19) of the DataWriter and each of its reliable DataReaders. Set the kind field to DDS_RELIABLE_RELIABILITY_QOS:

DataWriter

writer_qos.reliability.kind = DDS_RELIABLE_RELIABILITY_QOS;

DataReader

reader_qos.reliability.kind = DDS_RELIABLE_RELIABILITY_QOS;

10.3.1.1Blocking until the Send Queue Has Space Available

The max_blocking_time property in the RELIABILITY QosPolicy (Section 6.5.19) indicates how long a DataWriter can be blocked during a write().

If max_blocking_time is non-zero and the reliability send queue is full, the write is blocked (the sample is not sent). If max_blocking_time has passed and the sample is still not sent, write() returns DDS_RETCODE_TIMEOUT and the sample is not sent.

If the number of unacknowledged samples in the reliability send queue drops below max_samples (set in the RESOURCE_LIMITS QosPolicy (Section 6.5.20)) before max_blocking_time, the sample is sent and write() returns DDS_RETCODE_OK.

If max_blocking_time is zero and the reliability send queue is full, write() returns DDS_RETCODE_TIMEOUT and the sample is not sent.

10.3.2Tuning Queue Sizes and Other Resource Limits

Set the HISTORY QosPolicy (Section 6.5.10) appropriately to accommodate however many samples should be saved in the DataWriter’s send queue or the DataReader’s receive queue. The defaults may suit your needs; if so, you do not have to modify this QosPolicy.

Set the DDS_RtpsReliableWriterProtocol_t in the DATA_WRITER_PROTOCOL QosPolicy (DDS Extension) (Section 6.5.3) appropriately to accommodate the number of unacknowledged samples that can be in-flight at a time from a DataWriter.

For more information, see the following sections:

Understanding the Send Queue and Setting its Size (Section 10.3.2.1)

Understanding the Receive Queue and Setting Its Size (Section 10.3.2.2)

Note: The HistoryQosPolicy’s depth must be less than or equal to the ResourceLimitsQosPolicy’s max_samples_per_instance; max_samples_per_instance must be less than or equal to the ResourceLimitsQosPolicy’s max_samples (see RESOURCE_LIMITS QosPolicy (Section 6.5.20)), and max_samples_per_remote_writer (see

10-7

DATA_READER_RESOURCE_LIMITS QosPolicy (DDS Extension) (Section 7.6.2)) must be less than or equal to max_samples.

depth <= max_samples_per_instance <= max_samples

max_samples_per_remote_writer <= max_samples

Examples:

DataWriter

writer_qos.resource_limits.initial_instances = 10; writer_qos.resource_limits.initial_samples = 200; writer_qos.resource_limits.max_instances = 100; writer_qos.resource_limits.max_samples = 2000; writer_qos.resource_limits.max_samples_per_instance = 20; writer_qos.history.depth = 20;

DataReader

reader_qos.resource_limits.initial_instances = 10; reader_qos.resource_limits.initial_samples = 200; reader_qos.resource_limits.max_instances = 100; reader_qos.resource_limits.max_samples = 2000; reader_qos.resource_limits.max_samples_per_instance = 20; reader_qos.history.depth = 20; reader_qos.reader_resource_limits.max_samples_per_remote_writer = 20;

10.3.2.1Understanding the Send Queue and Setting its Size

A DataWriter’s send queue is used to store each sample it writes. A sample will be removed from the send queue after it has been acknowledged (through an ACKNACK) by all the reliable DataReaders. A DataReader can request that the DataWriter resend a missing sample (through an ACKNACK). If that sample is still available in the send queue, it will be resent. To elicit timely ACKNACKs, the DataWriter will regularly send heartbeats to its reliable DataReaders.

A DataWriter’s send queue size is determined by its RESOURCE_LIMITS QosPolicy (Section 6.5.20), specifically the max_samples field. The appropriate value depends on application parameters such as how fast the publication calls write().

A DataWriter has a "send window" that is the maximum number of unacknowledged samples allowed in the send queue at a time. The send window enables configuration of the number of samples queued for reliability to be done independently from the number of samples queued for history. This is of great benefit when the size of the history queue is much different than the size of the reliability queue. For example, you may want to resend a large history to late-joining DataReaders, so the send queue size is large. However, you do not want performance to suffer due to a large send queue; this can happen when the send rate is greater than the read rate, and the DataWriter has to resend many samples from its large historical send queue. If the send queue size was both the historical and reliability queue size, then both these goals could not be met. Now, with the send window, having a large history with good live reliability performance is possible.

The send window is determined by the DataWriterProtocolQosPolicy, specifically the fields min_send_window_size and max_send_window_size within the rtps_reliable_writer field of type DDS_RtpsReliableWriterProtocol_t. Other fields control a dynamic send window, where the send window size changes in response to network congestion to maximize the effective send rate. Like for max_samples, the appropriate values depend on application parameters.

Strict reliability: If a DataWriter does not receive ACKNACKs from one or more reliable DataReaders, it is possible for the reliability send queue—either its finite send window, or max_samples if its send window is infinite—to fill up. If you want to achieve strict reliability, the kind field in the HISTORY QosPolicy (Section 6.5.10) for both the DataReader and DataWriter must be set to KEEP_ALL, positive acknowledgments must be enabled for both the DataReader and DataWriter, and your publishing application should wait until space is available in the

10-8

reliability queue before writing any more samples. Connext provides two mechanisms to do this:

Allow the write() operation to block until there is space in the reliability queue again to store the sample. The maximum time this call blocks is determined by the max_blocking_time field in the RELIABILITY QosPolicy (Section 6.5.19) (also discussed in Section 10.3.1.1).

Use the DataWriter’s Listener to be notified when the reliability queue fills up or empties again.

When the HISTORY QosPolicy (Section 6.5.10) on the DataWriter is set to KEEP_LAST, strict reliability is not guaranteed. When there are depth number of samples in the queue (set in the HISTORY QosPolicy (Section 6.5.10), see Section 10.3.3) the oldest sample will be dropped from the queue when a new sample is written. Note that in such a reliable mode, when the send window is larger than max_samples, the DataWriter will never block, but strict reliability is no longer guaranteed.

If there is a request for the purged sample from any DataReaders, the DataWriter will send a heartbeat that no longer contains the sequence number of the dropped sample (it will not be able to send the sample).

Alternatively, a DataWriter with KEEP_LAST may block on write() when its send window is smaller than its send queue. The DataWriter will block when its send window is full. Only after the blocking time has elapsed, the DataWriter will purge a sample, and then strict reliability is no longer guaranteed.

The send queue size is set in the max_samples field of the RESOURCE_LIMITS QosPolicy (Section 6.5.20). The appropriate size for the send queue depends on application parameters (such as the send rate), channel parameters (such as end-to-end delay and probability of packet loss), and quality of service requirements (such as maximum acceptable probability of sample loss).

The DataReader’s receive queue size should generally be larger than the DataWriter’s send queue size. Receive queue size is discussed in Section 10.3.2.2.

A good rule of thumb, based on a simple model that assumes individual packet drops are not correlated and time-independent, is that the size of the reliability send queue, N, is as shown in Figure 10.3.

Figure 10.3 Calculating Minimum Send Queue Size for a Desired Level of Reliability

NRTlog ( 1 Q)

=2 -------------------------

log ( p)

Simple formula for determining the minimum size of the send queue required for strict reliability.

In the above equation, R is the rate of sending samples, T is the round-trip transmission time, p is the probability of a packet loss in a round trip, and Q is the required probability that a sample is eventually successfully delivered. Of course, network-transport dropouts must also be taken into account and may influence or dominate this calculation.

Table 10.2 gives the required size of the send queue for several common scenarios.

Table 10.2 Required Size of the Send Queue for Different Network Parameters

Qa

pb

Tc

 

Rd

Ne

 

 

 

 

 

 

99%

1%

0.001f sec

100 Hz

 

1

99%

1%

0.001 sec

2000 Hz

 

2

 

 

 

 

 

 

99%

5%

0.001 sec

100 Hz

 

1

 

 

 

 

 

 

99%

5%

0.001 sec

2000 Hz

 

4

 

 

 

 

 

 

99.99%

1%

0.001 sec

100 Hz

 

1

 

 

 

 

 

 

10-9

Table 10.2 Required Size of the Send Queue for Different Network Parameters

Qa

pb

 

Tc

 

Rd

Ne

 

 

 

 

 

 

 

99.99%

1%

0.001 sec

 

2000 Hz

 

6

 

 

 

 

 

 

 

99.99%

5%

0.001 sec

 

100 Hz

 

1

 

 

 

 

 

 

 

99.99%

5%

0.001 sec

 

2000 Hz

 

8

 

 

 

 

 

 

 

a."Q" is the desired level of reliability measured as the probability that any data update will eventually be delivered successfully. In other words, percentage of samples that will be successfully delivered.

b."p" is the probability that any single packet gets lost in the network.

c."T" is the round-trip transport delay in the network

d."R" is the rate at which the publisher is sending updates.

e."N" is the minimum required size of the send queue to accomplish the desired level of reliability "Q".

f.The typical round-trip delay for a dedicated 100 Mbit/second ethernet is about 0.001 seconds.

Note: Packet loss on a network frequently happens in bursts, and the packet loss events are correlated. This means that the probability of a packet being lost is much higher if the previous packet was lost because it indicates a congested network or busy receiver. For this situation, it may be better to use a queue size that can accommodate the longest period of network congestion, as illustrated in Figure 10.4.

Figure 10.4 Calculating Minimum Send Queue Size for Networks with Dropouts

N = RD( Q)

Send queue size as a function of send rate "R" and maximum dropout time D.

In the above equation R is the rate of sending samples, D(Q) is a time such that Q percent of the dropouts are of equal or lesser length, and Q is the required probability that a sample is eventually successfully delivered. The problem with the above formula is that it is hard to determine the value of D(Q) for different values of Q.

For example, if we want to ensure that 99.9% of the samples are eventually delivered successfully, and we know that the 99.9% of the network dropouts are shorter than 0.1 seconds, then we would use N = 0.1*R. So for a rate of 100Hz, we would use a send queue of N = 10; for a rate of 2000Hz, we would use N = 200.

10.3.2.2Understanding the Receive Queue and Setting Its Size

Samples are stored in the DataReader’s receive queue, which is accessible to the user’s application.

A sample is removed from the receive queue after it has been accessed by take(), as described in Accessing Data Samples with Read or Take (Section 7.4.3). Note that read() does not remove samples from the queue.

A DataReader's receive queue size is limited by its RESOURCE_LIMITS QosPolicy (Section 6.5.20), specifically the max_samples field. The storage of out-of-order samples for each DataWriter is also allocated from the DataReader’s receive queue; this sample resource is shared among all reliable DataWriters. That is, max_samples includes both ordered and out-of-order samples.

A DataReader can maintain reliable communications with multiple DataWriters (e.g., in the case of the OWNERSHIP_STRENGTH QosPolicy (Section 6.5.16) setting of SHARED). The maximum number of out-of-order samples from any one DataWriter that can occupy in the receive queue is set in the max_samples_per_remote_writer field of the DATA_READER_RESOURCE_LIMITS QosPolicy (DDS Extension) (Section 7.6.2); this value can be used to prevent a single DataWriter from using all the space in the receive queue. max_samples_per_remote_writer must be set to be <= max_samples.

10-10

The DataReader will cache samples that arrive out of order while waiting for missing samples to be resent. (Up to 256 samples can be resent; this limitation is imposed by the wire protocol.) If there is no room, the DataReader has to reject out-of-order samples and request them again later after the missing samples have arrived.

The appropriate size of the receive queue depends on application parameters, such as the DataWriter’s sending rate and the probability of a dropped sample. However, the receive queue size should generally be larger than the send queue size. Send queue size is discussed in Section 10.3.2.1.

Figure 10.5 and Figure 10.6 compare two hypothetical DataReaders, both interacting with the same DataWriter. The queue on the left represents an ordering cache, allocated from receive queue—samples are held here if they arrive out of order. The DataReader in Figure 10.5 on page 10-11 has a sufficiently large receive queue (max_samples) for the given send rate of the DataWriter and other operational parameters. In both cases, we assume that all samples are taken from the DataReader in the Listener callback. (See Accessing Data Samples with Read or Take (Section 7.4.3) for information on take() and related operations.)

In Figure 10.6 on page 10-12, max_samples is too small to cache out-of-order samples for the same operational parameters. In both cases, the DataReaders eventually receive all the samples in order. However, the DataReader with the larger max_samples will get the samples earlier and with fewer transactions. In particular, sample “4” is never resent for the DataReader with the larger queue size.

Figure 10.5 Effect of Receive-Queue Size on Performance: Large Queue Size

DataWriter DataReader

Send Sample “1”

Send Sample “2”

1

Sample

Send Sample “3” “2” lost. Send HeartBeat

max_samples is 4. This also limits the number of unordered samples that can be cached.

Sample 1 is taken

Note: no unordered samples cached

Send Sample “4”

Re-send Sample “2”

C

A

 

 

 

Send Sample “5”

 

 

 

 

 

(

 

 

 

 

 

 

3

 

 

 

 

 

 

 

 

1-

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

)

 

 

 

 

 

 

 

 

 

Space reserved for missing sample “2”.

 

 

K

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

C

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Samples “3” and “4” are cached

 

 

 

 

 

 

 

3

4

 

 

 

 

 

 

while waiting for missing sample “2”.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2

3

4

 

 

 

 

 

 

Samples 2-4 are taken

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5

 

 

 

 

 

 

 

 

 

Sample 5 is taken

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

10-11

Figure 10.6 Effect of Receive Queue Size on Performance: Small Queue Size

DataWriter DataReader

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Send Sample “1”

 

 

 

 

 

 

 

 

 

 

 

Send Sample “2”

 

 

 

 

 

 

 

 

 

 

 

Send Sample “3”

 

 

 

Sample

 

 

 

 

 

“2” lost

 

 

 

Send Heartbeat

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

H

 

 

 

 

 

 

 

 

 

 

 

 

 

B (

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

-

 

 

 

 

 

 

 

 

 

 

 

 

 

3

 

 

 

 

 

 

Send Sample “4”

 

 

 

 

)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2

)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(

 

 

 

 

 

 

 

 

 

 

 

K

 

 

 

 

 

 

 

 

 

 

 

C

 

 

 

 

 

 

 

C

 

A

 

 

 

 

 

Re-send Sample “2”

A

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Send Sample “5”

 

 

 

 

 

 

 

 

 

 

 

Send Heartbeat

 

 

 

 

 

 

 

 

 

 

 

 

 

H

 

 

 

 

 

 

 

 

 

 

 

 

 

B (

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

-

 

 

 

 

 

 

 

 

 

 

 

 

 

5

 

 

 

 

 

 

 

 

 

 

 

 

 

)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

4

)

 

 

 

 

 

 

 

 

 

 

(

 

 

 

 

 

 

 

 

 

 

 

K

 

 

 

 

 

 

 

 

 

 

 

C

 

 

 

 

 

 

 

 

 

 

 

A

 

 

 

 

 

Re-send Sample “4”

 

 

N

 

 

 

 

 

 

CK

 

 

 

 

 

 

 

 

 

 

 

A

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

max_samples is 2. This also limits the

 

 

 

 

 

 

 

 

 

 

 

number of unordered samples that

 

 

 

 

 

 

 

 

 

 

 

can be cached.

1

 

 

 

 

 

 

 

 

 

 

Move sample 1 to receive queue.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Note: no unordered samples cached

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Space reserved for missing sample “2”.

 

 

3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3

 

 

 

 

 

Sample “4” must be dropped

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

 

because it does not fit in the queue.

 

 

 

 

 

 

 

 

 

Move samples 2 and 3 to receive queue.

 

3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Space reserved for missing sample “4”.

4

 

 

 

 

 

 

 

 

Move samples 4 and 5 to receive queue.

 

5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

10-12

10.3.3Controlling Queue Depth with the History QosPolicy

If you want to achieve strict reliability, set the kind field in the HISTORY QosPolicy (Section 6.5.10) for both the DataReader and DataWriter to KEEP_ALL; in this case, the depth does not matter.

Or, for non-strict reliability, you can leave the kind set to KEEP_LAST (the default). This will provide non-strict reliability; some samples may not be delivered if the resource limit is reached.

The depth field in the HISTORY QosPolicy (Section 6.5.10) controls how many samples Connext will attempt to keep on the DataWriter’s send queue or the DataReader’s receive queue. For reliable communications, depth should be >= 1. The depth can be set to 1, but cannot be more than the max_samples_per_instance in RESOURCE_LIMITS QosPolicy (Section 6.5.20).

Example:

DataWriter

writer_qos.history.depth = <number of samples to keep in send queue>;

DataReader

reader_qos.history.depth = <number of samples to keep in receive queue>;

10.3.4Controlling Heartbeats and Retries with DataWriterProtocol QosPolicy

In the Connext reliability model, the DataWriter sends data samples and heartbeats to reliable DataReaders. A DataReader responds to a heartbeat by sending an ACKNACK, which tells the DataWriter what the DataReader has received so far.

In addition, the DataReader can request missing samples (by sending an ACKNACK) and the DataWriter will respond by resending the missing samples. This section describes some advanced timing parameters that control the behavior of this mechanism. Many applications do not need to change these settings. These parameters are contained in the

DATA_WRITER_PROTOCOL QosPolicy (DDS Extension) (Section 6.5.3).

The protocol described in Overview of the Reliable Protocol (Section 10.2) uses very simple rules such as piggybacking HB messages to each DATA message and responding immediately to ACKNACKs with the requested repair messages. While correct, this protocol would not be capable of accommodating optimum performance in more advanced use cases.

This section describes some of the parameters configurable by means of the rtps_reliable_writer structure in the DATA_WRITER_PROTOCOL QosPolicy (DDS Extension) (Section 6.5.3) and how they affect the behavior of the RTPS protocol.

10.3.4.1How Often Heartbeats are Resent (heartbeat_period)

If a DataReader does not acknowledge a sample that has been sent, the DataWriter resends the heartbeat. These heartbeats are resent at the rate set in the DATA_WRITER_PROTOCOL QosPolicy (DDS Extension) (Section 6.5.3), specifically its heartbeat_period field.

For example, a heartbeat_period of 3 seconds means that if a DataReader does not receive the latest sample (for example, it gets dropped by the network), it might take up to 3 seconds before the DataReader realizes it is missing data. The application can lower this value when it is important that recovery from packet loss is very fast.

The basic approach of sending HB messages as a piggyback to DATA messages has the advantage of minimizing network traffic. However, there is a situation where this approach, by itself, may result in large latencies. Suppose there is a DataWriter that writes bursts of data, separated by relatively long periods of silence. Furthermore assume that the last message in one of the bursts is lost by the network. This is the case shown for message DATA(B, 2) in Figure 10.7. If HBs were only sent piggybacked to DATA messages, the DataReader would not realize it missed the ‘B’ DATA message with sequence number ‘2’ until the DataWriter wrote the next message. This may be a long time if data is written sporadically. To avoid this situation,

10-13

Connext can be configured so that HBs are sent periodically as long as there are samples that have not been acknowledged even if no data is being sent. The period at which these HBs are sent is configurable by setting the rtps_reliable_writer.heartbeat_period field in the DATA_WRITER_PROTOCOL QosPolicy (DDS Extension) (Section 6.5.3).

Note that a small value for the heartbeat_period will result in a small worst-case latency if the last message in a burst is lost. This comes at the expense of the higher overhead introduced by more frequent HB messages.

Also note that the heartbeat_period should not be less than the rtps_reliable_reader.heartbeat_suppression_duration in the DATA_READER_PROTOCOL QosPolicy (DDS Extension) (Section 7.6.1); otherwise those HBs will be lost.

Figure 10.7 Use of heartbeat_period

DataWriter

DataReader

write(A)

1 A X

cache (A, 1)

DATA (A,1)

cache (A,1) 1

A 4

write(B)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

cache(B,2)

 

 

 

 

 

1

 

A

X

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

DATA (B,2)

 

 

 

2

 

B

X

 

heartbeat peri

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

acked(1)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

HB(1-2)

 

 

 

get(2)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

K(2)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

KNAC

 

 

 

 

 

 

 

 

 

 

 

AC

 

 

 

 

 

 

 

 

 

 

 

 

 

 

D

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

AT

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A(

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

B,

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2)

 

 

 

 

 

 

 

 

 

 

 

H

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

B(

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1-

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

A

4

 

 

 

 

 

 

 

 

 

 

K(3)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

KNAC

 

2

 

B

4

 

 

 

 

 

 

AC

 

 

 

 

 

 

 

 

 

acked(1-2)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

time

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

r

check(1-2)

 

 

 

1

A

4

cache (B,2)

 

 

 

2

B

4

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

check(1-2)

 

 

 

 

 

 

 

 

 

time

See Figure 10.1 for

 

meaning of table

 

columns.

10.3.4.2How Often Piggyback Heartbeats are Sent (heartbeats_per_max_samples)

A DataWriter will automatically send heartbeats with new samples to request regular ACKNACKs from the DataReader. These are called “piggyback” heartbeats.

10-14

If batching is disabled1: one piggyback heartbeat will be sent every [max_samples2/ heartbeats_per_max_samples] number of samples.

If batching is enabled: one piggyback heartbeat will be sent every [max_batches3/ heartbeats_per_max_samples] number of samples.

Furthermore, one piggyback heartbeat will be sent per send window. If the above calculation is greater than the send window size, then the DataWriter will send a piggyback heartbeat for every [send window size] number of samples.

The heartbeats_per_max_samples field is part of the rtps_reliable_writer structure in the

DATA_WRITER_PROTOCOL QosPolicy (DDS Extension) (Section 6.5.3). If heartbeats_per_max_samples is set equal to max_samples, this means that a heartbeat will be sent with each sample. A value of 8 means that a heartbeat will be sent with every 'max_samples/ 8' samples. Say max_samples is set to 1024, then a heartbeat will be sent once every 128 samples. If you set this to zero, samples are sent without any piggyback heartbeat. The max_samples field is part of the RESOURCE_LIMITS QosPolicy (Section 6.5.20).

Figure 10.1 on page 10-4 and Figure 10.2 on page 10-5 seem to imply that a HB is sent as a piggyback to each DATA message. However, in situations where data is sent continuously at high rates, piggybacking a HB to each message may result in too much overhead; not so much on the HB itself, but on the ACKNACKs that would be sent back as replies by the DataReader.

There are two reasons to send a HB:

To request that a DataReader confirm the receipt of data via an ACKNACK, so that the

DataWriter can remove it from its send queue and therefore prevent the DataWriter’s history from filling up (which could cause the write() operation to temporarily block4).

To inform the DataReader of what data it should have received, so that the DataReader can send a request for missing data via an ACKNACK.

The DataWriter’s send queue can buffer many data-samples while it waits for ACKNACKs, and the DataReader’s receive queue can store out-of-order samples while it waits for missing ones. So it is possible to send HB messages much less frequently than DATA messages. The ratio of piggyback HB messages to DATA messages is controlled by the rtps_reliable_writer.heartbeats_per_max_samples field in the DATA_WRITER_PROTOCOL QosPolicy (DDS Extension) (Section 6.5.3).

A HB is used to get confirmation from DataReaders so that the DataWriter can remove acknowledged samples from the queue to make space for new samples. Therefore, if the queue size is large, or new samples are added slowly, HBs can be sent less frequently.

In Figure 10.8 on page 10-16, the DataWriter sets the heartbeats_per_max_samples to certain value so that a piggyback HB will be sent for every three samples. The DataWriter first writes sample A and B. The DataReader receives both. However, since no HB has been received, the DataReader won’t send back an ACKNACK. The DataWriter will still keep all the samples in its queue. When the DataWriter sends sample C, it will send a piggyback HB along with the sample. Once the DataReader receives the HB, it will send back an ACKNACK for samples up to sequence number 3, such that the DataWriter can remove all three samples from its queue. ,

10.3.4.3Controlling Packet Size for Resent Samples (max_bytes_per_nack_response)

A repair packet is the maximum amount of data that a DataWriter will resend at a time. For example, if the DataReader requests 20 samples, each 10K, and the max_bytes_per_nack_response is set to 100K, the DataWriter will only send the first 10 samples. The DataReader will have to ACKNACK again to receive the next 10 samples.

1.Batching is enabled with the BATCH QosPolicy (DDS Extension) (Section 6.5.2).

2.max_samples is set in the RESOURCE_LIMITS QosPolicy (Section 6.5.20)

3.max_batches is set in the DATA_WRITER_RESOURCE_LIMITS QosPolicy (DDS Extension) (Section 6.5.4)

4.Note that data could also be removed from the DataWriter’s send queue if it is no longer relevant due to some other QoS such a HISTORY KEEP_LAST (Section 6.5.10) or LIFESPAN (Section 6.5.12).

10-15

Figure 10.8 Use of heartbeats_per_max_samples

DataWriter

DataReader

write(A)

1 A X cache (A, 1)

write(B)

DATA (A,1)

cache (A,1)

 

1

A

4

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

cache(B,2)

 

 

1

A

X

 

 

 

 

 

 

 

 

 

2

B

X

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

write(C)

DATA(B,2)

cache (B,2)

 

1

A

4

 

 

 

 

 

 

 

2

B

4

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

A

X

 

 

 

2

B

X

 

 

 

3

C

X

 

 

 

 

 

 

1

A

4

 

 

 

2

B

4

 

 

 

3

C

4

 

 

 

 

 

 

cache(C,3)

D

 

AT

 

A(

 

C,3);H

 

B(1-3)

1 A 4

cache (C,3)

2 B 4 check(1-3) 3 C 4

acked(1-3)

 

 

 

 

4)

 

 

 

K(

 

 

AC

 

 

KN

 

 

 

AC

 

 

 

time

 

 

 

time

See Figure 10.1 for meaning of table columns.

A DataWriter may resend multiple missed samples in the same packet. The max_bytes_per_nack_response field in the DATA_WRITER_PROTOCOL QosPolicy (DDS Extension) (Section 6.5.3) limits the size of this ‘repair’ packet.

10.3.4.4Controlling How Many Times Heartbeats are Resent (max_heartbeat_retries)

If a DataReader does not respond within max_heartbeat_retries number of heartbeats, it will be dropped by the DataWriter and the reliable DataWriter’s Listener will be called with a

RELIABLE_READER_ACTIVITY_CHANGED Status (DDS Extension) (Section 6.3.6.8).

If the dropped DataReader becomes available again (perhaps its network connection was down temporarily), it will be added back to the DataWriter the next time the DataWriter receives some message (ACKNACK) from the DataReader.

When a DataReader is ‘dropped’ by a DataWriter, the DataWriter will not wait for the DataReader to send an ACKNACK before any samples are removed. However, the DataWriter will still send data and HBs to this DataReader as normal.

10-16

The max_heartbeat_retries field is part of the DATA_WRITER_PROTOCOL QosPolicy (DDS Extension) (Section 6.5.3).

10.3.4.5Treating Non-Progressing Readers as Inactive Readers (inactivate_nonprogressing_readers)

In addition to max_heartbeat_retries, if inactivate_nonprogressing_readers is set, then not only are non-responsive DataReaders considered inactive, but DataReaders sending non-progressing NACKs can also be considered inactive. A non-progressing NACK is one which requests the same oldest sample as the previously received NACK. In this case, the DataWriter will not consider a non-progressing NACK as coming from an active reader, and hence will inactivate the DataReader if no new NACKs are received before max_heartbeat_retries number of heartbeat periods has passed.

One example for which it could be useful to turn on inactivate_nonprogressing_readers is when a DataReader’s (keep-all) queue is full of untaken historical samples. Each subsequent heartbeat would trigger the same NACK, and nominally the DataReader would not be inactivated. A user not requiring strict-reliability could consider setting inactivate_nonprogressing_readers to allow the DataWriter to progress rather than being held up by this non-progressing DataReader.

10.3.4.6Coping with Redundant Requests for Missing Samples (max_nack_response_delay)

When a DataWriter receives a request for missing samples from a DataReader and responds by resending the requested samples, it will ignore additional requests for the same samples during the time period max_nack_response_delay.

The rtps_reliable_writer.max_nack_response_delay field is part of the DATA_WRITER_PROTOCOL QosPolicy (DDS Extension) (Section 6.5.3).

If your send period is smaller than the round-trip delay of a message, this can cause unnecessary sample retransmissions due to redundant ACKNACKs. In this situation, an ACKNACK triggered by an out-of-order sample is not received before the next sample is sent. When a DataReader receives the next message, it will send another ACKNACK for the missing sample. As illustrated in Figure 10.9 on page 10-18, duplicate ACKNACK messages cause another resending of missing sample “2” and lead to wasted CPU usage on both the publication and the subscription sides.

While these redundant messages provide an extra cushion for the level of reliability desired, you can conserve the CPU and network bandwidth usage by limiting how often the same ACKNACK messages are sent; this is controlled by min_nack_response_delay.

Reliable subscriptions are prevented from resending an ACKNACK within min_nack_response_delay seconds from the last time an ACKNACK was sent for the same sample. Our testing shows that the default min_nack_response_delay of 0 seconds achieves an optimal balance for most applications on typical Ethernet LANs.

However, if your system has very slow computers and/or a slow network, you may want to consider increasing min_nack_response_delay. Sending an ACKNACK and resending a missing sample inherently takes a long time in this system. So you should allow a longer time for recovery of the lost sample before sending another ACKNACK. In this situation, you should increase min_nack_response_delay.

If your system consists of a fast network or computers, and the receive queue size is very small, then you should keep min_nack_response_delay very small (such as the default value of 0). If the queue size is small, recovering a missing sample is more important than conserving CPU and network bandwidth (new samples that are too far ahead of the missing sample are thrown away). A fast system can cope with a smaller min_nack_response_delay value, and the reliable sample stream can normalize more quickly.

10-17

Figure 10.9 Resending Missing Samples due to Duplicate ACKNACKs

DataWriter DataReader

Send Sample “1”

Send Sample “2”

1

Send Sample “3”

Send Sample “4”

Resend Sample “2” Send Sample “5”

Resend Sample “2”

 

 

 

 

 

 

 

 

)

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

(

 

 

 

 

 

 

 

 

K

 

 

 

 

 

 

 

 

C

 

 

 

 

 

 

 

 

A

 

 

 

 

 

 

 

 

N

 

 

 

 

 

 

 

 

K

 

 

 

 

 

 

 

 

C

 

 

 

 

 

 

 

 

 

A

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

)

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

(

 

 

 

 

 

 

 

 

K

 

 

 

 

 

 

 

 

C

 

 

 

 

 

 

 

 

A

 

 

 

 

 

 

 

 

N

 

 

 

 

 

 

 

 

K

 

 

 

 

 

 

 

 

C

 

 

 

 

 

 

 

 

 

A

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3

3 4

2 3 4

5

Space must be reserved for missing sample “2”.

Samples “3” and “4” are cached while waiting for missing sample “2”.

Sample “2” is dropped since it is older than the last sample that has been handed to the application.

10.3.4.7Disabling Positive Acknowledgements (disable_postive_acks_min_sample_keep_duration)

When ACKNACK storms are a primary concern in a system, an alternative to tuning heartbeat and ACKNACK response delays is to disable positive acknowledgments (ACKs) and rely just on NACKs to maintain reliability. Systems with non-strict reliability requirements can disable ACKs to reduce network traffic and directly solve the problem of ACK storms. ACKs can be disabled for the DataWriter and the DataReader; when disabled for the DataWriter, none of its DataReaders will send ACKs, whereas disabling it at the DataReader allows per-DataReader configuration.

Normally when ACKs are enabled, strict reliability is maintained by the DataWriter, guaranteeing that a sample stays in its send queue until all DataReaders have positively acknowledged it (aside from relevant DURABILITY, HISTORY, and LIFESPAN QoS policies). When ACKs are disabled, strict reliability is no longer guaranteed, but the DataWriter should still keep the sample for a sufficient duration for ACK-disabled DataReaders to have a chance to NACK it. Thus, a configurable “keep-duration” (disable_postive_acks_min_sample_keep_duration) applies for samples written for ACK- disabled DataReaders, where samples are kept in the queue for at least that keep-duration. After the keep-duration has elapsed for a sample, the sample is considered to be “acknowledged” by its ACK-disabled DataReaders.

The keep duration should be configured for the expected worst-case from when the sample is written to when a NACK for the sample could be received. If set too short, the sample may no longer be queued when a NACK requests it, which is the cost of not enforcing strict reliability.

If the peak send rate is known and writer resources are available, the writer queue can be sized so that writes will not block. For this case, the queue size must be greater than the send rate multiplied by the keep duration.

10-18

10.3.5Avoiding Message Storms with DataReaderProtocol QosPolicy

DataWriters send data samples and heartbeats to DataReaders. A DataReader responds to a heartbeat by sending an acknowledgement that tells the DataWriter what the DataReader has received so far and what it is missing. If there are many DataReaders, all sending ACKNACKs to the same DataWriter at the same time, a message storm can result. To prevent this, you can set a delay for each DataReader, so they don’t all send ACKNACKs at the same time. This delay is set in the DATA_READER_PROTOCOL QosPolicy (DDS Extension) (Section 7.6.1).

If you have several DataReaders per DataWriter, varying this delay for each one can avoid ACKNACK message storms to the DataWriter. If you are not concerned about message storms, you do not need to change this QosPolicy.

Example:

reader_qos.protocol.rtps_reliable_reader.min_heartbeat_response_delay.sec = 0; reader_qos.protocol.rtps_reliable_reader.min_heartbeat_response_delay.nanosec = 0; reader_qos.protocol.rtps_reliable_reader.max_heartbeat_response_delay.sec = 0; reader_qos.protocol.rtps_reliable_reader.max_heartbeat_response_delay.nanosec =

0.5 * 1000000000UL; // 0.5 sec

As the name suggests, the minimum and maximum response delay bounds the random wait time before the response. Setting both to zero will force immediate response, which may be necessary for the fastest recovery in case of lost samples.

10.3.6Resending Samples to Late-Joiners with the Durability QosPolicy

The DURABILITY QosPolicy (Section 6.5.7) is also somewhat related to Reliability. Connext requires a finite time to "discover" or match DataReaders to DataWriters. If an application attempts to send data before the DataReader and DataWriter "discover" one another, then the sample will not actually get sent. Whether or not samples are resent when the DataReader and DataWriter eventually "discover" one another depends on how the DURABILITY and HISTORY QoS are set. The default setting for the Durability QosPolicy is VOLATILE, which means that the DataWriter will not store samples for redelivery to late-joining DataReaders.

Connext also supports the TRANSIENT_LOCAL setting for the Durability, which means that the samples will be kept stored for redelivery to late-joining DataReaders, as long as the DataWriter is around and the RESOURCE_LIMITS QosPolicy (Section 6.5.20) allows. The samples are not stored beyond the lifecycle of the DataWriter.

See also: Waiting for Historical Data (Section 7.3.6).

10.3.7Use Cases

This section contains advanced material that discusses practical applications of the reliability related QoS.

10.3.7.1Importance of Relative Thread Priorities

For high throughput, the Connext Event thread’s priority must be sufficiently high on the sending application. Unlike an unreliable writer, a reliable writer relies on internal Connext threads: the Receive thread processes ACKNACKs from the DataReaders, and the Event thread schedules the events necessary to maintain reliable data flow.

When samples are sent to the same or another application on the same host, the Receive thread priority should be higher than the writing thread priority (priority of the thread calling write() on the DataWriter). This will allow the Receive thread to process the messages as they are sent by the writing thread. A sustained reliable flow requires the reader to be able to process the samples from the writer at a speed equal to or faster than the writer emits.

10-19

The default Event thread priority is low. This is adequate if your reliable transfer is not sustained; queued up events will eventually be processed when the writing thread yields the CPU. The Connext can automatically grow the event queue to store all pending events. But if the reliable communication is sustained, reliable events will continue to be scheduled, and the event queue will eventually reach its limit. The default Event thread priority is unsuitable for maintaining a fast and sustained reliable communication and should be increased through the participant_qos.event.thread.priority. This value maps directly to the OS thread priority, see EVENT QosPolicy (DDS Extension) (Section 8.5.5)).

The Event thread should also be increased to minimize the reliable latency. If events are processed at a higher priority, dropped packets will be resent sooner.

Now we consider some practical applications of the reliability related QoS:

Aperiodic Use Case: One-at-a-Time (Section 10.3.7.2)

Aperiodic, Bursty (Section 10.3.7.3)

Periodic (Section 10.3.7.4)

10.3.7.2Aperiodic Use Case: One-at-a-Time

Suppose you have aperiodically generated data that needs to be delivered reliably, with minimum latency, such as a series of commands (“Ready,” “Aim,” “Fire”). If a writing thread may block between each sample to guarantee reception of the just sent sample on the reader’s middleware end, a smaller queue will provide a smaller upper bound on the sample delivery time. Adequate writer QoS for this use case are presented in Figure 10.10.

Figure 10.10 QoS for an Aperiodic, One-at-a-time Reliable Writer

1 qos->reliability.kind = DDS_RELIABLE_RELIABILITY_QOS;

2qos->history.kind = DDS_KEEP_ALL_HISTORY_QOS;

3qos->protocol.push_on_write = DDS_BOOLEAN_TRUE;

5//use these hard coded value unless you use a key

6qos->resource_limits.initial_samples = qos->resource_limits.max_samples = 1;

7qos->resource_limits.max_samples_per_instance =

8qos->resource_limits.max_samples;

9qos->resource_limits.initial_instances =

10qos->resource_limits.max_instances = 1;

12// want to piggyback HB w/ every sample.

13qos->protocol.rtps_reliable_writer.heartbeats_per_max_samples =

14qos->resource_limits.max_samples;

15

16qos->protocol.rtps_reliable_writer.high_watermark = 1;

17qos->protocol.rtps_reliable_writer.low_watermark = 0;

18qos->protocol.rtps_reliable_writer.min_nack_response_delay.sec = 0;

19qos->protocol.rtps_reliable_writer.min_nack_response_delay.nanosec = 0;

20//consider making non-zero for reliable multicast

21qos->protocol.rtps_reliable_writer.max_nack_response_delay.sec = 0;

22qos->protocol.rtps_reliable_writer.max_nack_response_delay.nanosec = 0;

24// should be faster than the send rate, but be mindful of OS resolution

25qos->protocol.rtps_reliable_writer.fast_heartbeat_period.sec = 0;

26qos->protocol.rtps_reliable_writer.fast_heartbeat_period.nanosec =

27alertReaderWithinThisMs * 1000000;

29qos->reliability.max_blocking_time = blockingTime;

30qos->protocol.rtps_reliable_writer.max_heartbeat_retries = 7;

32// essentially turn off slow HB period

33qos->protocol.rtps_reliable_writer.heartbeat_period.sec = 3600 * 24 * 7;

10-20

Line 1 (Figure 10.10): This is the default setting for a writer, shown here strictly for clarity.

Line 2 (Figure 10.10): Setting the History kind to KEEP_ALL guarantees that no sample is ever lost.

Line 3 (Figure 10.10): This is the default setting for a writer, shown here strictly for clarity. ‘Push’ mode reliability will yield lower latency than ‘pull’ mode reliability in normal situations where there is no sample loss. (See DATA_WRITER_PROTOCOL QosPolicy (DDS Extension) (Section 6.5.3).) Furthermore, it does not matter that each packet sent in response to a command will be small, because our data sent with each command is likely to be small, so that maximizing throughput for this data is not a concern.

Line 5 - Line 10 (Figure 10.10): For this example, we assume a single writer is writing samples one at a time. If we are not using keys (see Section 2.2.2), there is no reason to use a queue with room for more than one sample, because we want to resolve a sample completely before moving on to the next. While this negatively impacts throughput, it minimizes memory usage. In this example, a written sample will remain in the queue until it is acknowledged by all active readers (only 1 for this example).

Line 12 - Line 14 (Figure 10.10): The fastest way for a writer to ensure that a reader is up-to-date is to force an acknowledgement with every sample. We do this by appending a Heartbeat with every sample. This is akin to a certified mail; the writer learns—as soon as the system will allow—whether a reader has received the letter, and can take corrective action if the reader has not. As with certified mail, this model has significant overhead compared to the unreliable case, trading off lower packet efficiency in favor of latency and fast recovery.

Line 16-Line 17 (Figure 10.10): Since the writer takes responsibility for pushing the samples out to the reader, a writer will go into a “heightened alert” mode as soon as the high water mark is reached (which is when any sample is written for this writer) and only come out of this mode when the low water mark is reached (when all samples have been acknowledged for this writer). Note that the selected high and low watermarks are actually the default values.

Line 18-Line 22 (Figure 10.10): When a reader requests a lost sample, we respond to the reader immediately in the interest of faster recovery. If the readers receive packets on unicast, there is no reason to wait, since the writer will eventually have to feed individual readers separately anyway. In case of multicast readers, it makes sense to consider further. If the writer delayed its response enough so that all or most of the readers have had a chance to NACK a sample, the writer may coalesce the requests and send just one packet to all the multicast readers. Suppose that all multicast readers do indeed NACK within approximately 100 sec. Setting the minimum and maximum delays at 100 sec will allow the writer to collect all these NACKs and send a single response over multicast. (See DATA_WRITER_PROTOCOL QosPolicy (DDS Extension) (Section 6.5.3) for information on setting min_nack_response_delay and max_nack_response_delay.) Note that Connext relies on the OS to wait for this 100 sec. Unfortunately, not all operating systems can sleep for such a fine duration. On Windows systems, for example, the minimum achievable sleep time is somewhere between 1 to 20 milliseconds, depending on the version. On VxWorks systems, the minimum resolution of the wait time is based on the tick resolution, which is 1/system clock rate (thus, if the system clock rate is 100 Hz, the tick resolution is 10 millisecond). On such systems, the achievable minimum wait is actually far larger than the desired wait time. This could have an unintended consequence due to the delay caused by the OS; at a minimum, the time to repair a packet may be longer than you specified.

Line 24-Line 27 (Figure 10.10): If a reader drops a sample, the writer recovers by notifying the reader of what it has sent, so that the reader may request resending of the lost sample. Therefore, the recovery time depends primarily on how quickly the writer pings the reader that has fallen behind. If commands will not be generated faster than one every few seconds, it may be acceptable for the writer to ping the reader several hundred milliseconds after the sample is sent.

10-21

Suppose that the round-trip time of fairly small packets between the writer and the reader application is 50 microseconds, and that the reader does not delay response to a Heartbeat from the writer (see DATA_READER_PROTOCOL QosPolicy (DDS Extension) (Section 7.6.1) for how to change this). If a sample is dropped, the writer will ping the reader after a maximum of the OS delay resolution discussed above and alertReaderWithinThisMs (let’s say 10 ms for this example). The reader will request the missing sample immediately, and with the code set as above, the writer will feed the missing sample immediately. Neglecting the processing time on the writer or the reader

end, and assuming that this retry succeeds, the time to recover the sample from the original publication time is: alertReaderWithinThisMs + 50 sec + 25 sec.

If the OS is capable of micro-sleep, the recovery time can be within 100 sec, barely noticeable to a human operator. If the OS minimum wait resolution is much larger, the recovery time is dominated by the wait resolution of the OS. Since ergonomic studies suggest that delays in excess of a 0.25 seconds start hampering operations that require low latency data, even a 10 ms limitation seems to be acceptable.

What if two packets are dropped in a row? Then the recovery time would be

2 * alertReaderWithinThisMs + 2 * 50 sec + 25 sec. If alertReaderWithinThisMs is 100 ms, the recovery time now exceeds 200 ms, and can perhaps degrade user experience.

Line 29-Line 30 (Figure 10.10): What if another command (like another button press) is issued before the recovery? Since we must not drop this new sample, we block the writer until the recovery completes. If alertReaderWithinThisMs is 10 ms, and we assume no more than 7 consecutive drops, the longest time for recovery will be just above (alertReaderWithinThisMs * max_heartbeat_retries), or 70 ms.

So if we set blockingTime to about 80 ms, we will have given enough chance for recovery. Of course, in a dynamic system, a reader may drop out at any time, in which case max_heartbeat_retries will be exceeded, and the unresponsive reader will be dropped by the writer. In either case, the writer can continue writing. Inappropriate values will cause a writer to prematurely drop a temporarily unresponsive (but otherwise healthy) reader, or be stuck trying unsuccessfully to feed a crashed reader. In the unfortunate case where a reader becomes temporarily unresponsive for a duration exceeding (alertReaderWithinThisMs * max_heartbeat_retries), the writer may issue gaps to that reader when it becomes active again; the dropped samples are irrecoverable. So estimating the worst case unresponsive time of all potential readers is critical if sample drop is unacceptable.

Line 32-Line 33 (Figure 10.10): Since the command may not be issued for hours or even days on end, there is no reason to keep announcing the writer’s state to the readers.

Figure 10.11 shows how to set the QoS for the reader side, followed by a line-by-line explanation.

Figure 10.11 QoS for an Aperiodic, One-at-a-time Reliable Reader

1qos->reliability.kind = DDS_RELIABLE_RELIABILITY_QOS;

2qos->history.kind = DDS_KEEP_ALL_HISTORY_QOS;

3

4// 1 is ok for normal use. 2 allows fast infinite loop

5qos->reader_resource_limits.max_samples_per_remote_writer = 2;

6qos->resource_limits.initial_samples = 2;

7qos->resource_limits.initial_instances = 1;

9qos->protocol.rtps_reliable_reader.max_heartbeat_response_delay.sec = 0;

10qos->protocol.rtps_reliable_reader.max_heartbeat_response_delay.nanosec = 0;

11qos->protocol.rtps_reliable_reader.min_heartbeat_response_delay.sec = 0;

12qos->protocol.rtps_reliable_reader.min_heartbeat_response_delay.nanosec = 0;

Line 1-Line 2 (Figure 10.11): Unlike a writer, the reader’s default reliability setting is best-effort, so reliability must be turned on. Since we don’t want to drop anything, we choose KEEP_ALL history.

10-22

Line 4-Line 6 (Figure 10.11): Since we enforce reliability on each sample, it would be sufficient to keep the queue size at 1, except in the following case: suppose that the reader takes some action in response to the command received, which in turn causes the writer to issue another command right away. Because Connext passes the user data up to the application even before acknowledging the sample to the writer (for minimum latency), the first sample is still pending for acknowledgement in the writer’s queue when the writer attempts to write the second sample, and will cause the writing thread to block until the reader completes processing the first sample and acknowledges it to the writer; all are as they should be. But if you want to run this infinite loop at full throttle, the reader should buffer one more sample. Let’s follow the packets flow under a normal circumstance:

1.The sender application writes sample 1 to the reader. The receiver application processes it and sends a user-level response 1 to the sender application, but has not yet ACK’d sample 1.

2.The sender application writes sample 2 to the receiving application in response to response 1. Because the reader’s queue is 2, it can accept sample 2 even though it may not yet have acknowledged sample 1. Otherwise, the reader may drop sample 2, and would have to recover it later.

3.At the same time, the receiver application acknowledges sample 1, and frees up one slot in the queue, so that it can accept sample 3, which it on its way.

The above steps can be repeated ad-infinitum in a continuous traffic.

Line 7 (Figure 10.11): Since we are not using keys, there is just one instance.

Line 9-Line 12 (Figure 10.11): We choose immediate response in the interest of fastest recovery. In high throughput, multicast scenario, delaying the response (with event thread priority set high of course) may decrease the likelihood of NACK storm causing a writer to drop some NACKs. This random delay reduces this chance by staggering the NACK response. But the minimum delay achievable once again depends on the OS.

10.3.7.3Aperiodic, Bursty

Suppose you have aperiodically generated bursts of data, as in the case of a new aircraft approaching an airport. The data may be the same or different, but if they are written by a single writer, the challenge to this writer is to feed all readers as quickly and efficiently as possible when this burst of hundreds or thousands of samples hits the system.

If you use an unreliable writer to push this burst of data, some of them may be dropped over an unreliable transport such as UDP.

If you try to shape the burst according to however much the slowest reader can process, the system throughput may suffer, and places an additional burden of queueing the samples on the sender application.

If you push the data reliably as fast they are generated, this may cost dearly in repair packets, especially to the slowest reader, which is already burdened with application chores.

Connext pull mode reliability offers an alternative in this case by letting each reader pace its own data stream. It works by notifying the reader what it is missing, then waiting for it to request only as much as it can handle. As in the aperiodic one-at-a-time case (Section 10.3.7.2), multicast is supported, but its performance depends on the resolution of the minimum delay supported by the OS. At the cost of greater latency, this model can deliver reliability while using far fewer packets than in the push mode. The writer QoS is given in Figure 10.12, with a line-by-line explanation below.

Line 1 (Figure 10.12): This is the default setting for a writer, shown here strictly for clarity.

Line 2 (Figure 10.12): Since we do not want any data lost, we want the History kind set to KEEP_ALL.

10-23

Figure 10.12 QoS for an Aperiodic, Bursty Writer

1 qos->reliability.kind = DDS_RELIABLE_RELIABILITY_QOS;

2qos->history.kind = DDS_KEEP_ALL_HISTORY_QOS;

3qos->protocol.push_on_write = DDS_BOOLEAN_FALSE;

5//use these hard coded value until you use key

6qos->resource_limits.initial_instances =

7qos->resource_limits.max_instances = 1;

8qos->resource_limits.initial_samples = qos->resource_limits.max_samples

9= worstBurstInSample;

10qos->resource_limits.max_samples_per_instance =

11 qos->resource_limits.max_samples;

12

13// piggyback HB not used

14qos->protocol.rtps_reliable_writer.heartbeats_per_max_samples = 0;

16qos->protocol.rtps_reliable_writer.high_watermark = 1;

17qos->protocol.rtps_reliable_writer.low_watermark = 0;

19qos->protocol.rtps_reliable_writer.min_nack_response_delay.sec = 0;

20qos->protocol.rtps_reliable_writer.min_nack_response_delay.nanosec = 0;

21qos->protocol.rtps_reliable_writer.max_nack_response_delay.sec = 0;

22qos->protocol.rtps_reliable_writer.max_nack_response_delay.nanosec = 0;

23qos->reliability.max_blocking_time = blockingTime;

25// should be faster than the send rate, but be mindful of OS resolution

26qos->protocol.rtps_reliable_writer.fast_heartbeat_period.sec = 0;

27qos->protocol.rtps_reliable_writer.fast_heartbeat_period.nanosec =

28alertReaderWithinThisMs * 1000000;

29qos->protocol.rtps_reliable_writer.max_heartbeat_retries = 5;

31// essentially turn off slow HB period

32qos->protocol.rtps_reliable_writer.heartbeat_period.sec = 3600 * 24 * 7;

Line 3 (Figure 10.12): The default Connext reliable writer will push, but we want the reader to pull instead.

Line 5-Line 11 (Figure 10.12): We assume a single instance, in which case the maximum sample count will be the same as the maximum sample count per writer. In contrast to the one-at-a-time case discussed in Section 10.3.7.2, the writer’s queue is large; as big as the burst size in fact, but no more because this model tries to resolve a burst within a reasonable period, to be computed shortly. Of course, we could block the writing thread in the middle of the burst, but that might complicate the design of the sending application.

Line 13-Line 14 (Figure 10.12): By a ‘piggyback’ Heartbeat, we mean only a Heartbeat that is appended to data being pushed from the writer. Strictly speaking, the writer will also append a Heartbeat with each reply to a reader’s lost sample request, but we call that a ‘framing’ Heartbeat. Since data is pulled, heartbeats_per_max_samples is ignored.

Line 16-Line 17 (Figure 10.12): Similar to the previous aperiodic writer, this writer spends most of its time idle. But as the name suggests, even a single new sample implies more sample to follow in a burst. Putting the writer into a fast mode quickly will allow readers to be notified soon. Only when all samples have been delivered, the writer can rest.

Line 19- Line 23 (Figure 10.12): Similar to the one-at-a-time case, there is no reason to delay response with only one reader. In this case, we can estimate the time to resolve a burst with only a few parameters. Let’s say that the reader figures it can safely receive and process 20 samples at a time without being overwhelmed, and that the time it takes a writer to fetch these 20 samples and send a single packet containing these 20 samples, plus the time it takes a reader to receive and process these sample samples, and send another request back to the writer for the next 20 samples is 11 ms. Even on the same hardware, if the reader’s processing time can be reduced,

10-24

this time will decrease; other factors such as the traversal time through Connext and the transport are typically in microseconds range (depending on machines of course).

For example, let’s also say that the worst case burst is 1000 samples. The writing thread will of course not block because it is merely copying each of the 1000 samples to the Connext queue on the writer side; on a typical modern machine, the act of writing these 1000 samples will probably take no more than a few ms. But it would take at least 1000/20 = 50 resend packets for the reader to catch up to the writer, or 50 times 11 ms = 550 ms. Since the burst model deals with one burst at a time, we would expect that another burst would not come within this time, and that we are allowed to block for at least this period. Including a safety margin, it would appear that we can comfortably handle a burst of 1000 every second or so.

But what if there are multiple readers? The writer would then take more time to feed multiple readers, but with a fast transport, a few more readers may only increase the 11 ms to only 12 ms or so. Eventually, however, the number of readers will justify the use of multicast. Even in pull mode, Connext supports multicast by measuring how many multicast readers have requested sample repair. If the writer does not delay response to NACK, then repairs will be sent in unicast. But a suitable NACK delay allows the writer to collect potentially NACKs from multiple readers, and feed a single multicast packet. But as discussed in Section 10.3.7.2, by delaying reply to coalesce response, we may end up waiting much longer than desired. On a Windows system with 10 ms minimum sleep achievable, the delay would add at least 10 ms to the 11 ms delay, so that the time to push 1000 samples now increases to 50 times 21 ms = 1.05 seconds. It would appear that we will not be able to keep up with incoming burst if it came at roughly 1 second, although we put fewer packets on the wire by taking advantage of multicast.

Line 25-Line 28 (Figure 10.12): We now understand how the writer feeds the reader in response to the NACKs. But how does the reader realize that it is behind? The writer notifies the reader with a Heartbeat to kick-start the exchange. Therefore, the latency will be lower bound by the writer’s fast heartbeat period. If the application is not particularly sensitive to latency, the minimum wait time supported by the OS (10 ms on Windows systems, for example) might be a reasonable value.

Line 29 (Figure 10.12): With a fast heartbeat period of 50 ms, a writer will take 500 ms (50 ms times the default max_heartbeat_retries of 10) to write-off an unresponsive reader. If a reader crashes while we are writing a lot of samples per second, the writer queue may completely fill up before the writer has a chance to drop the crashed reader. Lowering max_heartbeat_retries will prevent that scenario.

Line 31-Line 32 (Figure 10.12): For an aperiodic writer, turning off slow periodic Heartbeats will remove unwanted traffic from the network.

10-25

Figure 10.13 shows example code for a corresponding aperiodic, bursty reader.

Figure 10.13 QoS for an Aperiodic, Bursty Reader

1qos->reliability.kind = DDS_RELIABLE_RELIABILITY_QOS;

2qos->history.kind = DDS_KEEP_ALL_HISTORY_QOS;

3qos->resource_limits.initial_samples =

4qos->resource_limits.max_samples =

5qos->reader_resource_limits.max_samples_per_remote_writer = 32;

7//use these hard coded value until you use key

8qos->resource_limits.max_samples_per_instance =

9qos->resource_limits.max_samples;

10qos->resource_limits.initial_instances =

11qos->resource_limits.max_instances = 1;

13// the writer probably has more for the reader; ask right away

14qos->protocol.rtps_reliable_reader.min_heartbeat_response_delay.sec = 0;

15qos->protocol.rtps_reliable_reader.min_heartbeat_response_delay.nanosec = 0;

16qos->protocol.rtps_reliable_reader.max_heartbeat_response_delay.sec = 0;

17qos->protocol.rtps_reliable_reader.max_heartbeat_response_delay.nanosec = 0;

Line 1-Line 2 (Figure 10.13): Unlike a writer, the reader’s default reliability setting is best-effort, so reliability must be turned on. Since we don’t want to drop anything, we choose KEEP_ALL for the History QoS kind.

Line 3-Line 5 (Figure 10.13): Unlike the writer, the reader’s queue can be kept small, since the reader is free to send ACKs for as much as it wants anyway. In general, the larger the queue, the larger the packet needs to be, and the higher the throughput will be. When the reader NACKs for lost sample, it will only ask for this much.

Line 7-Line 11 (Figure 10.13): We do not use keys in this example.

Line 13-Line 17 (Figure 10.13): We respond immediately to catch up as soon as possible. When there are many readers, this may cause a NACK storm, as discussed in the reader code for one- at-a-time reliable reader.

10.3.7.4Periodic

In a periodic reliable model, we can use the writer and the reader queue to keep the data flowing at a smooth rate. The data flows from the sending application to the writer queue, then to the transport, then to the reader queue, and finally to the receiving application. Unless the sending application or any one of the receiving applications becomes unresponsive (including a crash) for a noticeable duration, this flow should continue uninterrupted.

The latency will be low in most cases, but will be several times higher for the recovered and many subsequent samples. In the event of a disruption (e.g., loss in transport, or one of the readers becoming temporarily unresponsive), the writer’s queue level will rise, and may even block in the worst case. If the writing thread must not block, the writer’s queue must be sized sufficiently large to deal with any fluctuation in the system. Figure 10.14 shows an example, with line-by-line analysis below.

Line 1 (Figure 10.14): This is the default setting for a writer, shown here strictly for clarity.

Line 2 (Figure 10.14): Since we do not want any data lost, we set the History kind to KEEP_ALL.

Line 3 (Figure 10.14): This is the default setting for a writer, shown here strictly for clarity. Pushing will yield lower latency than pulling.

Line 5-Line 7 (Figure 10.14): We do not use keys in this example, so there is only one instance.

Line 9-Line 11 (Figure 10.14): Though a simplistic model of queue, this is consistent with the idea that the queue size should be proportional to the data rate and the wort case jitter in communication.

10-26

Figure 10.14 QoS for a Periodic Reliable Writer

1 qos->reliability.kind = DDS_RELIABLE_RELIABILITY_QOS;

2qos->history.kind = DDS_KEEP_ALL_HISTORY_QOS;

3qos->protocol.push_on_write = DDS_BOOLEAN_TRUE;

5//use these hard coded value until you use key

6qos->resource_limits.initial_instances =

7qos->resource_limits.max_instances = 1;

9int unresolvedSamplePerRemoteWriterMax =

10worstCaseApplicationDelayTimeInMs * dataRateInHz / 1000;

11qos->resource_limits.max_samples = unresolvedSamplePerRemoteWriterMax;

12qos->resource_limits.initial_samples = qos->resource_limits.max_samples/2;

13qos->resource_limits.max_samples_per_instance =

14

qos->resource_limits.max_samples;

15

 

16int piggybackEvery = 8;

17qos->protocol.rtps_reliable_writer.heartbeats_per_max_samples =

18qos->resource_limits.max_samples / piggybackEvery;

19

20qos->protocol.rtps_reliable_writer.high_watermark = piggybackEvery * 4;

21qos->protocol.rtps_reliable_writer.low_watermark = piggybackEvery * 2;

22qos->reliability.max_blocking_time = blockingTime;

23

24qos->protocol.rtps_reliable_writer.min_nack_response_delay.sec = 0;

25qos->protocol.rtps_reliable_writer.min_nack_response_delay.nanosec = 0;

27qos->protocol.rtps_reliable_writer.max_nack_response_delay.sec = 0;

28qos->protocol.rtps_reliable_writer.max_nack_response_delay.nanosec = 0;

30qos->protocol.rtps_reliable_writer.fast_heartbeat_period.sec = 0;

31qos->protocol.rtps_reliable_writer.fast_heartbeat_period.nanosec =

32alertReaderWithinThisMs * 1000000;

33qos->protocol.rtps_reliable_writer.max_heartbeat_retries = 7;

35// essentially turn off slow HB period

36qos->protocol.rtps_reliable_writer.heartbeat_period.sec = 3600 * 24 * 7;

Line 12 (Figure 10.14): Even though we have sized the queue according to the worst case, there is a possibility for saving some memory in the normal case. Here, we initially size the queue to be only half of the worst case, hoping that the worst case will not occur. When it does, Connext will keep increasing the queue size as necessary to accommodate new samples, until the maximum is reached. So when our optimistic initial queue size is breached, we will incur the penalty of dynamic memory allocation. Furthermore, you will wind up using more memory, as the initially allocated memory will be orphaned (note: does not mean a memory leak or dangling pointer); if the initial queue size is M_i and the maximal queue size is M_m, where M_m = M_i * 2^n, the memory wasted in the worst case will be (M_m - 1) * sizeof(sample) bytes. Note that the memory allocation can be avoided by setting the initial queue size equal to its max value.

Line 13-Line 14 (Figure 10.14): If there is only one instance, maximum samples per instance is the same as maximum samples allowed.

Line 16-Line 18 (Figure 10.14): Since we are pushing out the data at a potentially rapid rate, the piggyback heartbeat will be useful in letting the reader know about any missing samples. The piggybackEvery can be increased if the writer is writing at a fast rate, with the cost that more samples will need to queue up for possible resend. That is, you can consider the piggyback heartbeat to be taking over one of the roles of the periodic heartbeat in the case of a push. So sending fewer samples between piggyback heartbeats is akin to decreasing the fast heartbeat period seen in previous sections. Please note that we cannot express piggybackEvery directly as its own QoS, but indirectly through the maximum samples.

Line 20-Line 22 (Figure 10.14): If piggybackEvery was exactly identical to the fast heartbeat, there would be no need for fast heartbeat or the high watermark. But one of the important roles

10-27

for the fast heartbeat period is to allow a writer to abandon inactive readers before the queue fills. If the high watermark is set equal to the queue size, the writer would not doubt the status of an unresponsive reader until the queue completely fills—blocking on the next write (up to blockingTime). By lowering the high watermark, you can control how vigilant a writer is about checking the status of unresponsive readers. By scaling the high watermark to piggybackEvery, the writer is expressing confidence that an alive reader will respond promptly within the time it would take a writer to send 4 times piggybackEvery samples. If the reader does not delay the response too long, this would be a good assumption. Even if the writer estimated on the low side and does go into fast mode (suspecting that the reader has crashed) when a reader is temporarily unresponsive (e.g., when it is performing heavy computation for a few milliseconds), a response from the reader in question will resolve any doubt, and data delivery can continue uninterrupted. As the reader catches up to the writer and the queue level falls below the low watermark, the writer will pop out to the normal, relaxed mode.

Line 24-Line 28 (Figure 10.14): When a reader is behind (including a reader whose Durability QoS is non-VOLATILE and therefore needs to catch up to the writer as soon as it is created), how quickly the writer responds to the reader’s request will determine the catch-up rate. While a multicast writer (that is, a writer with multicast readers) may consider delaying for some time to take advantage of coalesced multicast packets. Keep in mind the OS delay resolution issue discussed in the previous section.

Line 30-Line 33 (Figure 10.14): The fast heartbeat mechanism allows a writer to detect a crashed reader and move along with the remaining readers when a reader does not respond to any of the max_heartbeat_retries number of heartbeats sent at the fast_heartbeat_period rate. So if you want a more cautious writer, decrease either numbers; conversely, increasing either number will result in a writer that is more reluctant to write-off an unresponsive reader.

Line 35-Line 36 (Figure 10.14): Since this a periodic model, a separate periodic heartbeat to notify the writer’s status would seem unwarranted; the piggyback heartbeat sent with samples takes over that role.

Figure 10.15 shows how to set the QoS for a matching reader, followed by a line-by-line explanation.

Figure 10.15 QoS for a Periodic Reliable Reader

1qos->reliability.kind = DDS_RELIABLE_RELIABILITY_QOS;

2qos->history.kind = DDS_KEEP_ALL_HISTORY_QOS;

3qos->resource_limits.initial_samples =

4qos->resource_limits.max_samples =

5qos->reader_resource_limits.max_samples_per_remote_writer =

6((2*piggybackEvery - 1) + dataRateInHz * delayInMs / 1000);

8//use these hard coded value until you use key

9qos->resource_limits.max_samples_per_instance =

10qos->resource_limits.max_samples;

11qos->resource_limits.initial_instances =

12qos->resource_limits.max_instances = 1;

14qos->protocol.rtps_reliable_reader.min_heartbeat_response_delay.sec = 0;

15qos->protocol.rtps_reliable_reader.min_heartbeat_response_delay.nanosec = 0;

16qos->protocol.rtps_reliable_reader.max_heartbeat_response_delay.sec = 0;

17qos->protocol.rtps_reliable_reader.max_heartbeat_response_delay.nanosec = 0;

Line 1-Line 2 (Figure 10.15): Unlike a writer, the reader’s default reliability setting is best-effort, so reliability must be turned on. Since we don’t want to drop anything, we choose KEEP_ALL for the History QoS.

Line 3-Line 6 (Figure 10.15) Unlike the writer, the reader queue is sized not according to the jitter of the reader, but rather how many samples you want to cache speculatively in case of a gap in sequence of samples that the reader must recover. Remember that a reader will stop giving a sequence of samples as soon as an unintended gap appears, because the definition of strict reliability includes in-order delivery. If the queue size were 1, the reader would have no choice

10-28

but to drop all subsequent samples received until the one being sought is recovered. Connext uses speculative caching, which minimizes the disruption caused by a few dropped samples. Even for the same duration of disruption, the demand on reader queue size is greater if the writer will send more rapidly. In sizing the reader queue, we consider 2 factors that comprise the lost sample recovery time:

How long it takes a reader to request a resend to the writer.

The piggyback heartbeat tells a reader about the writer’s state. If only samples between two piggybacked samples are dropped, the reader must cache piggybackEvery samples before asking the writer for resend. But if a piggybacked sample is also lost, the reader will not get around to asking the writer until the next piggybacked sample is received. Note that in this worst case calculation, we are ignoring stand-alone heartbeats (i.e., not piggybacked heartbeat from the writer). Of course, the reader may drop any number of heartbeats, including the stand-alone heartbeat; in this sense, there is no such thing as the absolute worst case—just reasonable worst case, where the probability of consecutive drops is acceptably low. For the majority of applications, even two consecutive drops is unlikely, in which case we need to cache at most (2*piggybackEvery - 1) samples before the reader will ask the writer to resend, assuming no delay (Line 14-Line 17).

How long it takes for the writer to respond to the request.

Even ignoring the flight time of the resend request through the transport, the writer takes a finite time to respond to the repair request--mostly if the writer delays reply for multi- cast readers. In case of immediate response, the processing time on the writer end, as well as the flight time of the messages to and from the writer do not matter unless very larger data rate; that is, it is the product term that matters. In case the delay for multicast is random (that is, the minimum and the maximum delay are not equal), one would have to use the maximum delay to be conservative.

Line 8-Line 12 (Figure 10.15): Since we are not using keys, there is just one instance.

Line 14-Line 17 (Figure 10.15): If we are not using multicast, or the number of readers being fed by the writer, there is no reason to delay.

10-29

Chapter 11 Collaborative DataWriters

The Collaborative DataWriters feature allows you to have multiple DataWriters publishing samples from a common logical data source. The DataReaders will combine the samples coming from these DataWriters in order to reconstruct the correct order in which they were produced at the source. This combination process for the DataReaders can be configured using the AVAILABILITY QosPolicy (DDS Extension) (Section 6.5.1). It requires the middleware to provide a way to uniquely identify every sample published in a domain independently of the actual DataWriter that published the sample.

In Connext, every modification (sample) to the global dataspace made by a DataWriter within a domain is identified by a pair (virtual GUID, sequence number).

The virtual GUID (Global Unique Identifier) is a 16-byte character identifier associated with the logical data source. DataWriters can be assigned a virtual GUID using virtual_guid in the DATA_WRITER_PROTOCOL QosPolicy (DDS Extension) (Section 6.5.3).

The virtual sequence number is a 64-bit integer that identifies changes within the logical data source.

Several DataWriters can be configured with the same virtual GUID. If each of these DataWriters publishes a sample with sequence number '0', the sample will only be received once by the DataReaders subscribing to the content published by the DataWriters (see Figure 11.1).

Figure 11.1 Global Dataspace Changes

11-1

11.1Collaborative DataWriters Use Cases

Ordered delivery of samples in high availability scenarios

One example of this is RTI Persistence Service1. When a late-joining DataReader configured with DURABILITY QosPolicy (Section 6.5.7) set to PERSISTENT or TRANSIENT joins a DDS domain, it will start receiving samples from multiple DataWriters. For example, if the original DataWriter is still alive, the newly created DataReader will receive samples from the original DataWriter and one or more RTI Persistence Service DataWriters (PRSTDataWriters).

Ordered delivery of samples in load-balanced scenarios

Multiple instances of the same application can work together to process and deliver samples. When the samples arrive through different data-paths out of order, the DataReader will be able to reconstruct the order at the source. An example of this is when multiple instances of RTI Persistence Service are used to persist the data. Persisting data to a database on disk can impact performance. By dividing the workload (e.g., samples larger than 10 are persisted by Persistence Service 1, samples smaller or equal to 10 are persisted by Persistence Service 2) across different instances of RTI Persistence Service using different databases the user can improve scalability and performance.

Ordered delivery of samples with Group Ordered Access

The Collaborative DataWriters feature can also be used to configure the sample ordering process when the Subscriber is configured with PRESENTATION QosPolicy (Section 6.4.6) access_scope set to GROUP. In this case, the Subscriber must deliver in order the samples published by a group of DataWriters that belong to the same Publisher and have access_scope set to GROUP.

Figure 11.2 Load-Balancing with Persistence Service

1. For more information on Persistence Service, see Part 6: RTI Persistence Service.

11-2

11.2Sample Combination (Synchronization) Process in a DataReader

A DataReader will deliver a sample (VGUIDn, VSNm) to the application only when if one of the following conditions is satisfied:

(GUIDn, SNm-1) has already been delivered to the application.

All the known DataWriters publishing VGUIDn have announced that they do not have (VGUIDn, VSNm-1).

None of the known DataWriters publishing VGUIDn have announced potential availability of (VGUIDn, VSNm-1) and a configurable timeout (max_data_availability_waiting_time) expires.

For additional details on how the reconstruction process works see the AVAILABILITY QosPolicy (DDS Extension) (Section 6.5.1).

11.3Configuring Collaborative DataWriters

11.3.1Assocating Virtual GUIDs with Data Samples

There are two ways to associate a virtual GUID with the samples published by a DataWriter.

Per DataWriter: Using virtual_guid in DATA_WRITER_PROTOCOL QosPolicy (DDS Extension) (Section 6.5.3).

Per Sample: By setting the writer_guid in the identity field of the WriteParams_t structure provided to the write_w_params operation (see Writing Data (Section 6.3.8)). Since the writer_guid can be set per sample, the same DataWriter can potentially write samples from independent logical data sources. One example of this is RTI Persistence Service where a single persistence service DataWriter can write samples on behalf of multiple original DataWriters.

11.3.2Assocating Virtual Sequence Numbers with Data Samples

You can associate a virtual sequence number with a sample published by a DataWriter by setting the sequence_number in the identity field of the WriteParams_t structure provided to the write_w_params operation (see Writing Data (Section 6.3.8)). Virtual sequence numbers for a given virtual GUID must be strictly monotonically increasing. If you try to write a sample with a sequence number less than or equal to the last sequence number, the write operation will fail.

11.3.3Specifying which DataWriters will Deliver Samples to the DataReader from a Logical Data Source

The required_matched_endpoint_groups field in the AVAILABILITY QosPolicy (DDS Extension) (Section 6.5.1) can be used to specify the set of DataWriter groups that are expected to provide samples for the same data source (virtual GUID). The quorum count in a group represents the number of DataWriters that must be discovered for that group before the DataReader is allowed to provide non-consecutive samples to the application.

A DataWriter becomes a member of an endpoint group by configuring the role_name in ENTITY_NAME QosPolicy (DDS Extension) (Section 6.5.9).

11-3

11.3.4Specifying How Long to Wait for a Missing Sample

A DataReader’s AVAILABILITY QosPolicy (DDS Extension) (Section 6.5.1) specifies how long to wait for a missing sample. For example, this is important when the first sample is received: how long do you wait to determine the lowest sequence number available in the system?

The max_data_availability_waiting_time defines how much time to wait before delivering a sample to the application without having received some of the previous samples.

The max_endpoint_availability_waiting_time defines how much time to wait to discover DataWriters providing samples for the same data source (virtual GUID).

11.4Collaborative DataWriters and Persistence Service

The DataWriters created by persistence service are automatically configured to do collaboration:

Every sample published by the Persistence Service DataWriter keeps its original identity.

Persistence Service associates the role name PERSISTENCE_SERVICE with all the DataWriters that it creates. You can overwrite that setting by changing the DataWriter QoS configuration in persistence service.

For more information, see Part 6: RTI Persistence Service.

11-4

Chapter 12 Mechanisms for Achieving Information Durability and Persistence

12.1Introduction

Connext offers the following mechanisms for achieving durability and persistence:

Durable Writer History This feature allows a DataWriter to persist its historical cache, perhaps locally, so that it can survive shutdowns, crashes and restarts. When an application restarts, each DataWriter that has been configured to have durable writer history automatically load all of the data in this cache from disk and can carry on sending data as if it had never stopped executing. To the rest of the system, it will appear as if the DataWriter had been temporarily disconnected from the network and then reappeared.

Durable Reader State This feature allows a DataReader to persist its state and remember which data it has already received. When an application restarts, each DataReader that has been configured to have durable reader state automatically loads its state from disk and can carry on receiving data as if it had never stopped executing. Data that had already been received by the DataReader before the restart will be suppressed so that it is not even sent over the network.

Data Durability This feature is a full implementation of the OMG DDS Persistence Profile. The DURABILITY QosPolicy (Section 6.5.7) allows an application to configure a DataWriter so that the information written by the DataWriter survives beyond the lifetime of the DataWriter. In this manner, a late-joining DataReader can subscribe to and receive the information even after the DataWriter application is no longer executing. To use this feature, you need Persistence Service, a separate application described in Chapter 26: Introduction to RTI Persistence Service.

These features can be configured separately or in combination. To use Durable Writer State and Durable Reader State, you need a relational database, which is not included with Connext. Supported databases are listed in the Release Notes. Persistence Service does not require a database when used in TRANSIENT mode (see Section 12.5.1) or in PERSISTENT mode with file-system storage (see Section 12.5.1 and Section 27.5).

To understand how these features interact we will examine the behavior of the system using the following scenarios:

Scenario 1. DataReader Joins after DataWriter Restarts (Durable Writer History) (Section 12.1.1)

Scenario 2: DataReader Restarts While DataWriter Stays Up (Durable Reader State) (Section 12.1.2)

12-1

Scenario 3. DataReader Joins after DataWriter Leaves Domain (Durable Data) (Section 12.1.3)

12.1.1Scenario 1. DataReader Joins after DataWriter Restarts (Durable Writer History)

In this scenario, a DomainParticipant joins the domain, creates a DataWriter and writes some data, then the DataWriter shuts down (gracefully or due to a fault). The DataWriter restarts and a DataReader joins the domain. Depending on whether the DataWriter is configured with durable history, the late-joining DataReader may or may not receive the data published already by the DataWriter before it restarted. This is illustrated in Figure 12.1. For more information, see Durable Writer History (Section 12.3)

Figure 12.1 Durable Writer History

DataWriter

 

DataWriter

 

 

 

a

 

 

 

a

 

 

 

 

 

 

 

b

 

 

b

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

DataReader

 

 

 

 

 

 

 

DataReader

 

 

 

 

 

 

 

 

 

 

 

 

 

 

a

b

Without Durable Writer History:

the late-joining DataReader will not receive data (a and b) that was published before the DataWriter’s restart.

With Durable Writer History:

the restarted DataWriter will recover its history and deliver its data to the late- joining DataReader

12.1.2Scenario 2: DataReader Restarts While DataWriter Stays Up (Durable Reader State)

In this scenario, two DomainParticipants join a domain; one creates a DataWriter and the other a DataReader on the same Topic. The DataWriter publishes some data ("a" and "b") that is received by the DataReader. After this, the DataReader shuts down (gracefully or due to a fault) and then restarts—all while the DataWriter remains present in the domain.

Depending on whether the DataReader is configured with Durable Reader State, the DataReader may or may not receive a duplicate copy of the data it received before it restarted. This is illustrated in Figure 12.2. For more information, see Durable Reader State (Section 12.4).

12-2

Figure 12.2 Durable Reader State

DataWriter DataReader

a

a

b

b

 

 

 

 

 

 

 

a b

Without Durable Reader State:

the DataReader will receive the data that was already received before the restart.

DataWriter DataReader

a a

b b

With Durable Reader State:

the DataReader remembers that it already received the data and does not request it again.

12.1.3Scenario 3. DataReader Joins after DataWriter Leaves Domain (Durable Data)

In this scenario, a DomainParticipant joins a domain, creates a DataWriter, publishes some data on a Topic and then shuts down (gracefully or due to a fault). Later, a DataReader joins the domain and subscribes to the data. Persistence Service is running.

Depending on whether Durable Data is enabled for the Topic, the DataReader may or may not receive the data previous published by the DataWriter. This is illustrated in Figure 12.3. For more information, see Data Durability (Section 12.5)

Figure 12.3 Durable Data

DataWriter

a

b

DataReader

Without Durable Data:

the late-joining DataReader will not receive data (a and b) that was published before the DataWriter quit.

DataWriter Persistence

Service

a a

b b

DataReader

a

b

With Durable Data:

Persistence Service remembers what data was published and delivers it to the late-joining DataReader.

This third scenario is similar to Scenario 1. DataReader Joins after DataWriter Restarts (Durable Writer History) (Section 12.1.1) except that in this case the DataWriter does not need to restart for the DataReader to get the data previously written by the DataWriter. This is because Persistence Service acts as an intermediary that stores the data so it can be given to late-joining DataReaders.

12-3

12.2Durability and Persistence Based on Virtual GUIDs

Every modification to the global dataspace made by a DataWriter is identified by a pair (virtual GUID, sequence number).

The virtual GUID (Global Unique Identifier) is a 16-byte character identifier associated with a DataWriter or DataReader; it is used to uniquely identify this entity in the global data space.

The sequence number is a 64-bit identifier that identifies changes published by a specific

DataWriter.

Several DataWriters can be configured with the same virtual GUID. If each of these DataWriters publishes a sample with sequence number '0', the sample will only be received once by the DataReaders subscribing to the content published by the DataWriters (see Figure 12.4).

Figure 12.4 Global Dataspace Changes

 

DataWriter

 

(vg: 1, sn: 0)

 

 

 

 

 

(vg: 1)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(vg: 1, sn: 0)

 

 

 

 

 

 

 

(vg: 1, sn: 0)

DataReader

 

 

 

 

 

 

 

 

 

DataWriter

 

 

 

 

 

 

 

 

 

 

 

 

 

(vg: 1)

 

 

 

 

 

 

 

 

 

(vg: 1)

 

 

 

 

(vg: 2, sn: 0)

(vg: 2, sn: 0)

 

 

 

 

 

 

 

 

 

(vg: 1, sn: 0)

 

 

 

 

 

 

 

 

 

 

 

DataWriter

 

 

 

 

 

 

 

(vg: 2)

 

(vg: 2, sn: 0)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Additionally, Connext uses the virtual GUID to associate a persisted state (state in permanent storage) to the corresponding Entity.

For example, the history of a DataWriter will be persisted in a database table with a name generated from the virtual GUID of the DataWriter. If the DataWriter is restarted, it must have associated the same virtual GUID to restore its previous history.

Likewise, the state of a DataReader will be persisted in a database table whose name is generated from the DataReader virtual GUID (see Figure 12.5).

Figure 12.5 History/State Persistence Based on the Virtual GUID

DataWriter

 

DataReader

 

 

 

vg: 1

 

 

 

vg: 1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A DataWriter’s virtual GUID can be configured using the member virtual_guid in the DATA_WRITER_PROTOCOL QosPolicy (DDS Extension) (Section 6.5.3).

A DataReader’s virtual GUID can be configured using the member virtual_guid in the DATA_READER_PROTOCOL QosPolicy (DDS Extension) (Section 7.6.1).

12-4

The DDS_PublicationBuiltinTopicData and DDS_SubscriptionBuiltinTopicData structures include the virtual GUID associated with the discovered publication or subscription (see Built-in DataReaders (Section 16.2)).

12.3Durable Writer History

The DURABILITY QosPolicy (Section 6.5.7) controls whether or not, and how, published samples are stored by the DataWriter application for DataReaders that are found after the samples were initially written. The samples stored by the DataWriter constitute the DataWriter’s history.

Connext provides the capability to make the DataWriter history durable, by persisting its content in a relational database. This makes it possible for the history to be restored when the DataWriter restarts. See the Release Notes for the list of supported relational databases.

The association between the history stored in the database and the DataWriter is done using the virtual GUID.

12.3.1Durable Writer History Use Case

The following use case describes the durable writer history functionality:

1.A DataReader receives two samples with sequence number 1 and 2 published by a DataWriter with virtual GUID 1.

1, 2

 

DataWriter

1, 2

 

DataReader

1, 2

 

 

 

 

 

 

 

 

 

(vg: 1)

 

 

 

(vg: 1)

 

 

 

 

 

 

 

 

 

 

 

2.The process running the DataWriter is stopped and a new late-joining DataReader is created.

DataReader (vg: 1)

DataReader (vg: 2)

The new DataReader with virtual GUID 2 does not receive samples 1 and 2 because the original DataWriter has been destroyed. If the samples must be available to late-joining

DataReaders after the DataWriter deletion, you can use Persistence Service, described in Chapter 26: Introduction to RTI Persistence Service.

3. The DataWriter is restarted using the same virtual GUID.

DataWriter

DataReader

(vg: 1)

1, 2

(vg: 1)

DataReader

 

1, 2

(vg: 2)

 

 

 

 

 

 

12-5

After being restarted, the DataWriter restores its history. The late-joining DataReader will receive samples 1 and 2 because they were not received previously. The DataReader with virtual GUID 1 will not receive samples 1 and 2 because it already received them

4. The DataWriter publishes two new samples.

 

 

 

3, 4

DataReader

3, 4

 

 

 

 

 

 

 

DataWriter

 

 

(vg: 1)

 

 

 

 

(vg: 1)

 

3, 4

 

 

3, 4

 

 

 

 

 

 

 

 

 

 

 

 

DataReader

 

 

 

 

 

 

 

 

 

 

(vg: 2)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The two new samples with sequence numbers 3 and 4 will be received by both DataRead- ers.

12.3.2How To Configure Durable Writer History

Connext allows a DataWriter’s history to be stored in a relational database that provides an ODBC driver.

For each DataWriter history that is configured to be durable, Connext will create a maximum of two tables:

The first table is used to store the samples associated with the writer history. The name of that table is WS<32 uuencoding of the writer virtual GUID>.

The second table is only created for keyed-topic and it is used to store the instances associated with the writer history. The name of the second table is WI<32 uuencoding of the writer virtual GUID>.

To configure durable writer history, use the PROPERTY QosPolicy (DDS Extension) (Section 6.5.17) associated with DataWriters and DomainParticipants.

A ‘durable writer history’ property defined in the DomainParticipant will be applicable to all the DataWriters belonging to the DomainParticipant unless it is overwritten by the DataWriter. Table 12.1 lists the supported ‘durable writer history’ properties.

Table 12.1 Durable Writer History Properties

Property

Description

 

 

 

Required.

dds.data_writer.history.plugin_name

Must be set to "dds.data_writer.history.odbc_plugin.builtin" to enable

 

durable writer history in the DataWriter.

 

 

dds.data_writer.history.odbc_plugin.

Required.

The ODBC DSN (Data Source Name) associated with the database where

dsn

the writer history must be persisted.

 

 

 

 

Tells Connext which ODBC driver to load. If the property is not

dds.data_writer.history.odbc_plugin.

specified, Connext will try to use the standard ODBC driver manager

driver

library (UnixOdbc on UNIX/Linux systems, the Windows ODBC driver

 

manager on Windows systems).

 

 

dds.data_writer.history.odbc_plugin.

 

username

Configures the username/password used to connect to the database.

 

Default: No password or username

dds.data_writer.history.odbc_plugin.

password

 

 

 

 

When set to 1, Connext will create a single connection per DSN that will

dds.data_writer.history.odbc_plugin.

be shared across DataWriters within the same Publisher.

shared

A DataWriter can be configured to create its own database connection by

 

setting this property to 0 (the default).

 

 

12-6

Table 12.1 Durable Writer History Properties

Property

 

 

Description

 

 

 

 

 

 

dds.data_writer.history.odbc_plugin.

These properties configure the resource limits associated with the ODBC

instance_cache_max_size

writer history caches.

 

 

 

 

 

To minimize the number of accesses to the database, Connext uses two

dds.data_writer.history.odbc_plugin.

instance_cache_init_size

caches, one for samples and one for instances. The initial size and the

 

maximum size of these caches are configured using these properties.

 

dds.data_writer.history.odbc_plugin.

 

sample_cache_max_size

The resource limits, initial_instances, max_instances, initial_samples,

max_samples,

and

max_samples_per_instance

defined

in

 

 

 

RESOURCE_LIMITS QosPolicy (Section 6.5.20) are used to configure the

 

maximum number of samples and instances that can be stored in the

 

relational database.

 

 

 

 

 

Defaults:

 

 

 

 

 

 

instance_cache_max_size:

max_instances

in

 

RESOURCE_LIMITS QosPolicy (Section 6.5.20)

 

 

dds.data_writer.history.odbc_plugin.

instance_cache_init_size:

initial_instances

in

sample_cache_init_size

RESOURCE_LIMITS QosPolicy (Section 6.5.20)

 

 

 

sample_cache_max_size: 32

 

 

 

 

sample_cache_init_size: 32

 

 

 

 

Note: If the property in_memory_state (see below in this table) is 1,

 

then instance_cache_max_size is always equal to max_instances in

 

RESOURCE_LIMITS QosPolicy (Section 6.5.20)—it cannot be

 

changed.

 

 

 

 

 

 

 

 

This property indicates whether or not the persisted writer history must

 

be restored once the DataWriter is restarted.

 

 

dds.data_writer.history.odbc_plugin.

If this property is 0, the content of the database associated with the

DataWriter being restarted will be deleted.

 

 

 

restore

If it is 1, the DataWriter will restore its previous state from the database

 

 

content.

 

 

 

 

 

 

Default: 1

 

 

 

 

 

 

 

 

This property determines how much state will be kept in memory by the

 

ODBC writer history in order to avoid accessing the database.

 

 

If this property is 1, then the property

instance_cache_max_size (see

 

above in this table) is always equal to max_instances

in

 

RESOURCE_LIMITS QosPolicy (Section 6.5.20)—it cannot be changed.

 

In addition, the ODBC writer history will keep in memory a fixed state

dds.data_writer.history.odbc_plugin.

overhead of 24 bytes per sample. This mode provides the best ODBC

in_memory_state

writer history performance. However, the restore operation will be

 

slower and the maximum number of samples that the writer history can

 

manage is limited by the available physical memory.

 

 

 

If it is 0, all the state will be kept in the underlying database. In this

 

mode, the maximum number of samples in the writer history is not

 

limited by the physical memory available.

 

 

 

 

Default: 1

 

 

 

 

 

 

 

 

 

 

 

 

Note: Durable Writer History is not supported for Multi-channel DataWriters (see Chapter 18) or when Batching is enabled (see Section 6.5.2); an error is reported if this type of DataWriter tries to configure Durable Writer History.

See also: Durable Reader State (Section 12.4).

Example C++ Code

/* Get default QoS */

...

retcode = DDSPropertyQosPolicyHelper::add_property (writerQos.property, "dds.data_writer.history.plugin_name", "dds.data_writer.history.odbc_plugin.builtin",

12-7

DDS_BOOLEAN_FALSE);

if (retcode != DDS_RETCODE_OK) { /* Report error */

}

retcode = DDSPropertyQosPolicyHelper::add_property (writerQos.property, "dds.data_writer.history.odbc_plugin.dsn",

"<user DSN>", DDS_BOOLEAN_FALSE);

if (retcode != DDS_RETCODE_OK) { /* Report error */

}

retcode = DDSPropertyQosPolicyHelper::add_property (writerQos.property, "dds.data_writer.history.odbc_plugin.driver",

"<ODBC library>", DDS_BOOLEAN_FALSE);

if (retcode != DDS_RETCODE_OK) { /* Report error */

}

retcode = DDSPropertyQosPolicyHelper::add_property (writerQos.property, "dds.data_writer.history.odbc_plugin.shared", "<0|1>",

DDS_BOOLEAN_FALSE); if (retcode != DDS_RETCODE_OK) {

/* Report error */

}

/* Create Data Writer */

...

12.4Durable Reader State

Durable reader state allows a DataReader to locally store its state in disk and remember the data that has already been processed by the application1. When an application restarts, each DataReader configured to have durable reader state automatically reads its state from disk. Data that has already been processed by the application before the restart will not be provided to the application again.

Important: The DataReader does not persist the full contents of the data in its historical cache; it only persists an identification (e.g. sequence numbers) of the data the application has processed. This distinction is not meaningful if your application always uses the ‘take’ methods to access your data, since these methods remove the data from the cache at the same time they deliver it to your application. (See Read vs. Take (Section 7.4.3.1)) However, if your application uses the ‘read’ methods, leaving the data in the DataReader's cache after you've accessed it for the first time, those previously viewed samples will not be restored to the DataReader's cache in the event of a restart.

Connext requires a relational database to persist the state of a DataReader. This database is accessed using ODBC. See the Release Notes for the list of supported relational databases.

12.4.1Durable Reader State With Protocol Acknowledgment

For each DataReader configured to have durable state, Connext will create one database table with the following naming convention: RS<32 uuencoding of the reader virtual GUID>. This table will store the last sequence number processed from each virtual GUID. For DataReaders on

1.The circumstances under which a data sample is considered “processed by the application” are described in the sections that follow.

12-8

keyed topics requesting instance-ordering (see PRESENTATION QosPolicy (Section 6.4.6)), this state will be stored per instance per virtual GUID..

Criteria to consider a sample “processed by the application”

For the read/take methods that require calling return_loan(), a sample 's1' with sequence number 's1_seq_num' and virtual GUID ‘vg1’ is considered processed by the application when the DataReader’s return_loan() operation is called for sample 's1' or any other sample with the same virtual GUID and a sequence number greater than 's1_seq_num'. For example:

retcode = Foo_reader->take(data_seq, info_seq, DDS_LENGTH_UNLIMITED, DDS_ANY_SAMPLE_STATE, DDS_ANY_VIEW_STATE, DDS_ANY_INSTANCE_STATE);

if (retcode == DDS_RETCODE_NO_DATA) { return;

}else if (retcode != DDS_RETCODE_OK) { /* report error */

return;

}

for (i = 0; i < data_seq.length(); ++i) { /* Operate with the data */

}

/* Return the loan */

retcode = Foo_reader->return_loan(data_seq, info_seq); if (retcode != DDS_RETCODE_OK) {

/* Report and error */

}

/* At this point the samples contained in data_seq will be considered as received. If the DataReader restarts, the samples will not be received again */

For the read/take methods that do not require calling return_loan(), a sample 's1' with sequence number 's1_seq_num' and virtual GUID ‘vg1’ will be considered processed after the application reads or takes the sample 's1' or any other sample with the same virtual GUID and with a sequence number greater than 's1_seq_num'. For example:

retcode = Foo_reader->take_next_sample(data,info);

/* At this point the sample contained in data will be considered as received. All the samples with a sequence number smaller than the sequence number associated with data will also be considered as received. If the DataReader restarts these sample will not be received again */

Important: If you access the samples in the DataReader cache out of order—for example via QueryCondition, specifying an instance state, or reading by instance when the PRESENTATION QoS is not set to INSTANCE_PRESENTATION_QOS—then the samples that have not yet been taken or read by the application may still be considered as ”processed by the application”.

12.4.1.1Bandwidth Utilization

To optimize network usage, if a DataReader configured with durable reader state is restarted and it discovers a DataWriter with a virtual GUID ‘vg’, the DataReader will ACK all the samples with a sequence number smaller than ‘sn’, where ‘sn’ is the first sequence number that has not been being processed by the application for ‘vg’.

Notice that the previous algorithm can significantly reduce the number of duplicates on the wire. However, it does not suppress them completely in the case of keyed DataReaders where the durable state is kept per (instance, virtual GUID). In this case, and assuming that the application has read samples out of order (e.g., by reading different instances), the ACK is sent for the

12-9

lowest sequence number processed across all instances and may cause samples already processed to flow on the network again. These redundant samples waste bandwidth, but they will be dropped by the DataReader and not be delivered to the application.

12.4.2Durable Reader State with Application Acknowledgment

This section assumes you are familiar with the concept of Application Acknowledgment as described in Section 6.3.12.

For each DataReader configured to be durable and that uses application acknowledgement (see Section 6.3.12), Connext will create one database table with the following naming convention:

RS<32 uuencoding of the reader virtual GUID>. This table will store the list of sequence number intervals that have been acknowledged for each virtual GUID. The size of the column that stores the sequence number intervals is limited to 32767 bytes. If this size is exceeded for a given virtual GUID, the operation that persists the DataReader state into the database will fail.

12.4.2.1Bandwidth Utilization

To optimize network usage, if a DataReader configured with durable reader state is restarted and it discovers a DataWriter with a virtual GUID ‘vg’, the DataReader will send an APP_ACK message with all the samples that were auto-acknowledged or explicitly acknowledged in previous executions.

Notice that this algorithm can significantly reduce the number of duplicates on the wire. However, it does not suppress them completely since the DataReader may send a NACK and receive some samples from the DataWriter before the DataWriter receives the APP_ACK message.

12.4.3Durable Reader State Use Case

The following use case describes the durable reader state functionality:

1.A DataReader receives two samples with sequence number 1 and 2 published by a DataWriter with virtual GUID 1. The application takes those samples.

1, 2

 

DataWriter

 

1, 2

 

DataReader

 

take 1, 2

 

 

(vg: 1)

 

 

 

 

(vg: 1)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2.After the application returns the loan on samples 1 and 2, the DataReader considers them as processed and it persists the state change.

 

DataWriter

 

 

DataReader

return loan 1, 2

 

(vg: 1)

 

 

(vg: 1)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(dw vg: 1,last sn: 2)

3. The process running the DataReader is stopped.

12-10

4. The DataReader is restarted.

DataWriter

 

DataReader

(vg: 1)

 

(vg: 1)

 

 

 

(dw vg: 1,last sn: 2)

Because all the samples with sequence number smaller or equal than 2 were considered as received, the reader will not ask for these samples to the DataWriter.

12.4.4How To Configure a DataReader for Durable Reader State

To configure a DataReader with durable reader state, use the PROPERTY QosPolicy (DDS Extension) (Section 6.5.17) associated with DataReaders and DomainParticipants.

A property defined in the DomainParticipant will be applicable to all the DataReaders contained in the participant unless it is overwritten by the DataReaders. Table 12.2 lists the supported properties.

Table 12.2 Durable Reader State Properties

Property

Description

 

 

 

Required.

dds.data_reader.state.odbc.dsn

The ODBC DSN (Data Source Name) associated with the database where

 

the DataReader state must be persisted.

 

 

 

To enable durable reader state, this property must be set to 1.

dds.data_reader.state.

When set to 0, the reader state is not maintained and Connext does not

filter_redundant_samples

filter duplicate samples that may be coming from the same virtual writer.

 

Default: 1

 

 

 

This property indicates which ODBC driver to load. If the property is not

dds.data_reader.state.odbc.driver

specified, Connext will try to use the standard ODBC driver manager

library (UnixOdbc on UNIX/Linux systems, the Windows ODBC driver

 

 

manager on Windows systems).

 

 

dds.data_reader.state.odbc.username

These two properties configure the username and password used to

 

connect to the database.

dds.data_reader.state.odbc.password

Default: No password or username

 

 

 

This property indicates if the persisted DataReader state must be restored

 

or not once the DataReader is restarted.

dds.data_reader.state.restore

If this property is 0, the previous state will be deleted from the database.

If it is 1, the DataReader will restore its previous state from the database

 

 

content.

 

Default: 1

 

 

 

This property controls how often the reader state is stored into the

 

database. A value of N means store the state once every N samples.

dds.data_reader.state.

A high frequency will provide better performance. However, if the

reader is restarted it may receive some duplicate samples. These samples

checkpoint_frequency

will be filtered by Connext and they will not be propagated to the

 

 

application.

 

Default: 1

 

 

dds.data_reader.state.persistence_

This property indicates how many of the most recent historical samples

the persisted DataReader wants to receive upon start-up.

service.request_depth

Default: 0

 

 

 

12-11

Example (C++ code):

/* Get default QoS */

...

retcode = DDSPropertyQosPolicyHelper::add_property( readerQos.property, "dds.data_reader.state.odbc.dsn", "<user DSN>",

DDS_BOOLEAN_FALSE); if (retcode != DDS_RETCODE_OK) {

/* Report error */

}

retcode = DDSPropertyQosPolicyHelper::add_property(readerQos.property, "dds.data_reader.state.odbc.driver", "<ODBC library>", DDS_BOOLEAN_FALSE);

if (retcode != DDS_RETCODE_OK) { /* Report error */

}

retcode = DDSPropertyQosPolicyHelper::add_property(readerQos.property, "dds.data_reader.state.restore", "<0|1>", DDS_BOOLEAN_FALSE);

if (retcode != DDS_RETCODE_OK) { /* Report error */

}

/* Create Data Reader */

...

12.5Data Durability

The data durability feature is an implementation of the OMG DDS Persistence Profile. The DURABILITY QosPolicy (Section 6.5.7) allows an application to configure a DataWriter so that the information written by the DataWriter survives beyond the lifetime of the DataWriter.

Connext implements TRANSIENT and PERSISTENT durability using an external service called Persistence Service, available for purchase as a separate RTI product.

Persistence Service receives information from DataWriters configured with TRANSIENT or PERSISTENT durability and makes that information available to late-joining DataReaders—even if the original DataWriter is not running.

The samples published by a DataWriter can be made durable by setting the kind field of the DURABILITY QosPolicy (Section 6.5.7) to one of the following values:

DDS_TRANSIENT_DURABILITY_QOS: Connext will store previously published samples in memory using Persistence Service, which will send the stored data to newly discovered DataReaders.

DDS_PERSISTENT_DURABILITY_QOS: Connext will store previously published samples in permanent storage, like a disk, using Persistence Service, which will send the stored data to newly discovered DataReaders.

A DataReader can request TRANSIENT or PERSISTENT data by setting the kind field of the corresponding DURABILITY QosPolicy (Section 6.5.7). A DataReader requesting PERSISTENT data will not receive data from DataWriters or Persistence Service applications that are configured with TRANSIENT durability.

12-12

12.5.1RTI Persistence Service

Persistence Service is a Connext application that is configured to persist topic data. Persistence Service is included with Connext Messaging. For each one of the topics that must be persisted for a specific domain, the service will create a DataWriter (known as PRSTDataWriter) and a DataReader (known as PRSTDataReader). The samples received by the PRSTDataReaders will be published by the corresponding PRSTDataWriters to be available for late-joiners DataReaders.

For more information on Persistence Service, please see:

Chapter 26: Introduction to RTI Persistence Service

Chapter 27: Configuring Persistence Service

Chapter 28: Running RTI Persistence Service

Persistence Service can be configured to operate in PERSISTENT or TRANSIENT mode:

TRANSIENT mode The PRSTDataReaders and PRSTDataWriters will be created with TRANSIENT durability and Persistence Service will keep the received samples in memory. Samples published by a TRANSIENT DataWriter will survive the DataWriter lifecycle but will not survive the lifecycle of Persistence Service (unless you are running multiple copies).

PERSISTENT mode The PRSTDataWriters and PRSTDataReaders will be created with PERSISTENT durability and Persistence Service will store the received samples in files or in an external relational database. Samples published by a PERSISTENT DataWriter will survive the DataWriter lifecycle as well as any restarts of Persistence Service.

Peer-to-Peer Communication:

By default, a PERSISTENT/TRANSIENT DataReader will receive samples directly from the original DataWriter if it is still alive. In this scenario, the DataReader may also receive the same samples from Persistence Service. Duplicates will be discarded at the middleware level. This Peer- To-Peer communication pattern is illustrated inFigure 12.6. To use this peer-to-peer communication pattern, set the direct_communication field in the DURABILITY QosPolicy (Section 6.5.7) to TRUE. A PERSISTENT/TRANSIENT DataReader will receive information directly from PERSISTENT/TRANSIENT DataWriters.

Figure 12.6 Peer-to-Peer Communication

 

DataWriter

 

 

 

 

 

(vg: 1, sn: 0)

 

 

 

 

 

 

 

DataReader

 

(vg: 1)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(vg: 1)

 

 

 

(vg: 1, sn: 0)

 

(vg: 1, sn: 0)

 

 

 

 

 

 

 

(vg: 1, sn: 0)

 

 

 

 

 

 

 

 

 

 

 

(vg: 1, sn: 0)

 

(vg: 1, sn: 0)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

RTI Persistence

The application only

receives one sample.

Service

 

Relay Communication:

A PERSISTENT/TRANSIENT DataReader may also be configured to not receive samples from the original DataWriter. In this case the traffic is relayed by Persistence Service. This ‘relay communication’ pattern is illustrated in Figure 12.7. To use relay communication, set the direct_communication field in the DURABILITY QosPolicy (Section 6.5.7) to FALSE. A PERSISTENT/TRANSIENT DataReader will receive all the information from Persistence Service.

12-13

Figure 12.7 Relay Communication

 

 

 

 

 

(vg: 1, sn: 0)

 

 

 

 

 

 

 

 

 

 

DataReader

 

DataWriter

 

 

RTI Persistence

 

 

 

 

 

 

(vg: 1)

 

(vg: 1)

 

 

Service

 

 

 

 

 

 

 

(vg: 1, sn: 0)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(vg: 1, sn: 0)

12-14

Chapter 13 Guaranteed Delivery of Data

13.1Introduction

Some application scenarios need to ensure that the information produced by certain producers is delivered to all the intended consumers. This chapter describes the mechanisms available in Connext to guarantee the delivery of information from producers to consumers such that the delivery is robust to many kinds of failures in the infrastructure, deployment, and even the producing/consuming applications themselves.

Guaranteed information delivery is not the same as protocol-level reliability (described in Chapter 10: Reliable Communications) or information durability (described in Chapter 12: Mechanisms for Achieving Information Durability and Persistence). Guaranteed information delivery is an end-to-end application-level QoS, whereas the others are middleware-level QoS. There are significant differences between these two:

With protocol-level reliability alone, the producing application knows that the information is received by the protocol layer on the consuming side. However the producing application cannot be certain that the consuming application read that information or was able to successfully understand and process it. The information could arrive in the consumer’s protocol stack and be placed in the DataReader cache but the consuming application could either crash before it reads it from the cache, not read its cache, or read the cache using queries or conditions that prevent that particular data sample from being accessed. Furthermore, the consuming application could access the sample, but not be able to interpret its meaning or process it in the intended way.

With information durability alone, there is no way to specify or characterize the intended consumers of the information. Therefore the infrastructure has no way to know when the information has been consumed by all the intended recipients. The information may be persisted such that it is not lost and is available to future applications, but the infrastructure and producing applications have no way to know that all the intended consumers have joined the system, received the information, and processed it successfully.

The guaranteed data-delivery mechanism provided in Connext overcomes the limitations described above by providing the following features:

Required subscriptions. This feature provides a way to configure, identify and detect the applications that are intended to consume the information. See Required Subscriptions (Section 6.3.13).

Application-level acknowledgments. This feature provides the means ensure that the information was successfully processed by the application-layer in a consumer application. See Application Acknowledgment (Section 6.3.12).

13-1

Durable subscriptions. This feature leverages the RTI Persistence Service to persist samples intended for the required subscriptions such that they are delivered even if the originating application is not available. See Configuring Durable Subscriptions in Persistence Service (Section 27.9).

These features used in combination with the mechanisms provided for Information Durability and Persistence (see Chapter 12: Mechanisms for Achieving Information Durability and Persistence) enable the creation of applications where the information delivery is guaranteed despite application and infrastructure failures. Scenarios (Section 13.2) describes various guaranteed-delivery scenarios and how to configure the applications to achieve them.

When implementing an application that needs guaranteed data delivery, we have to consider three key aspects:

Key Aspects to Consider

 

Related Features and QoS

 

 

 

 

Required subscriptions

Identifying the required consumers of information

Durable subscriptions

EntityName QoS policy

 

 

Availability QoS policy

 

 

 

Application-level acknowledgment

Ensuring the intended consumer applications

• Acknowledgment by a quorum of required and

 

durable subscriptions

process the data successfully

 

• Reliability QoS policy (acknowledgment mode)

 

 

Availability QoS policy

 

 

 

 

Persistence Service

Ensuring information is available to late joining

Durable Subscriptions

applications

Durability QoS

 

Durable Writer History

 

 

 

13.1.1Identifying the Required Consumers of Information

The first step towards ensuring that information is processed by the intended consumers is the ability to specify and recognize those intended consumers. This is done using the required subscriptions feature (Required Subscriptions (Section 6.3.13)) configured via the ENTITY_NAME QosPolicy (DDS Extension) (Section 6.5.9) and AVAILABILITY QosPolicy (DDS Extension) (Section 6.5.1)).

Connext DDS DataReader entities (as well as DataWriter and DomainParticipant entities) can have a name and a role_name. These names are configured using the ENTITY_NAME QosPolicy (DDS Extension) (Section 6.5.9), which is propagated via DDS discovery and is available as part of the builtin-topic data for the Entity (see Chapter 16: Built-In Topics).

The DDS DomainParticipant, DataReader and DataWriter entities created by RTI-provided applications and services, specifically services such as RTI Persistence Service, automatically configure the ENTITY_NAME QoS policy according to their function. For example the

DataReaders created by RTI Persistence Service have their role_name set to “PERSISTENCE_SERVICE”.

Unless explicitly set by the user, the DomainParticipant, DataReader and DataWriter entities created by end-user applications have their name and role_name set to NULL. However applications may modify this using the ENTITY_NAME QosPolicy (DDS Extension) (Section 6.5.9).

Connext uses the role_name of DataReaders to identify the consumer’s logical function. For this reason Connext’s required subscriptions feature relies on the role_name to identify intended consumers of information. The use of the DataReader’s role_name instead of the name is intentional. From the point of view of the information producer, the important thing is not the

13-2

concrete DataReader (identified by its name, for example, “Logger123”) but rather its logical function in the system (identified by its role_name, for example “LoggingService”).

A DataWriter that needs to ensure its information is delivered to all the intended consumers uses the AVAILABILITY QosPolicy (DDS Extension) (Section 6.5.1) to configure the role names of the consumers that must receive the information.

The AVAILABILITY QoS Policy set on a DataWriter lets an application configure the required consumers of the data produced by the DataWriter. The required consumers are specified in the required_matched_endpoint_groups attribute within the AVAILABILITY QoS Policy. This attribute is a sequence of DDS EndpointGroup structures. Each EndpointGroup represents a required information consumer characterized by the consumer’s role_name and quorum. The role_name identifies a logical consumer; the quorum specifies the minimum number of consumers with that role_name that must acknowledge the sample before the DataWriter can consider it delivered to that required consumer.

For example, an application that wants to ensure data written by a DataWriter is delivered to at least two Logging Services and one Display Service would configure the DataWriter’s AVAILABILITY QoS Policy with a required_matched_endpoint_groups consisting of two elements. The first element would specify a required consumer with the role_name “LoggingService” and a quorum of 2. The second element would specify a required consumer with the role_name “DisplayService” and a quorum of 1. Furthermore, the application would set the logging service DataReader ENTITY_NAME policy to have a role_name of “LoggingService” and similarly the display service DataReader ENTITY_NAME policy to have the role_name of “DisplayService.”

A DataWriter that has been configured with an AVAILABILITY QoS policy will not remove samples from the DataWriter cache until they have been “delivered” to both the already- discovered DataReaders and the minimum number (quorum) of DataReaders specified for each role. In particular, samples will be retained by the DataWriter if the quorum of matched DataReaders with a particular role_name have not been discovered yet.

We used the word “delivered” in quotes above because the level of assurance a DataWriter has that a particular sample has been delivered depends on the setting of the RELIABILITY QosPolicy (Section 6.5.19). We discuss this next in Section 13.1.2.

13.1.2Ensuring Consumer Applications Process the Data Successfully

Section 13.1.1 described mechanisms by which an application could configure who the required consumers of information are. This section is about the criteria, mechanisms, and assurance provided by Connext to ensure consumers have the information delivered to them and process it in a successful manner.

RTI provides four levels of information delivery guarantee. You can set your desired level using the RELIABILITY QosPolicy (Section 6.5.19). The levels are:

Best-effort, relying only on the underlying transport The DataWriter considers the sample delivered/acknowledged as soon as it is given to the transport to send to the DataReader’s destination. Therefore, the only guarantee is the one provided by the underlying transport itself. Note that even if the underlying transport is reliable (e.g., shared memory or TCP) the reliability is limited to the transport-level buffers. There is no guarantee that the sample will arrive to the DataReader cache because after the transport delivers to the DataReader’s transport buffers, it is possible for the sample to be dropped because it exceeds a resource limit, fails to deserialize properly, the receiving application crashes, etc.

Reliable with protocol acknowledgment The DDS-RTPS reliability protocol used by Connext provides acknowledgment at the RTPS protocol level: a DataReader will acknowledge it has deserialized the sample correctly and stored it in the DataReader’s

13-3

cache. However, there is no guarantee the application actually processed the sample. The application might crash before processing the sample, or it might simply fail to read it from the cache.

Reliable with Application Acknowledgment (Auto) Application Acknowledgment in Auto mode causes Connext to send an additional application-level acknowledgment (above and beyond the RTPS protocol level acknowledgment) after the consuming application has read the sample from the DataReader cache and the application has subsequently called the DataReader’s return_loan() operation (see Section 7.4.2) for that sample. This mode guarantees that the application has fully read the sample all the way until it indicates it is done with it. However it does not provide a guarantee that the application was able to successfully interpret or process the sample. For example, the sample could be a command to execute a certain action and the application may read the sample and not understand the command or may not be able to execute the action.

Reliable with Application Acknowledgment (Explicit) Application Acknowledgment in Explicit mode causes Connext to send an application-level acknowledgment only after the consuming application has read the sample from the DataReader cache and subsequently called the DataReader’s acknowledge_sample() operation (see Section 7.4.4) for that sample. This mode guarantees that the application has fully read the sample and completed operating on it as indicated by explicitly calling acknowledge_sample(). In contrast with the Auto mode described above, the application can delay the acknowledgment of the sample beyond the time it holds onto the data buffers, allowing it to be process in a more flexible manner. Similar to the Auto mode, it does not provide a guarantee that the application was able to successfully interpret or process the sample. For example, the sample could be a command to execute a certain action and the application may read the sample and not understand the command or may not be able to execute the action. Applications that need guarantees that the data was successfully processed and interpreted should use a request-reply interaction, which is available as part of RTI Connext Messaging (see Part 4: Request-Reply Communication Pattern).

13.1.3Ensuring Information is Available to Late-Joining Applications

The third aspect of guaranteed data delivery addresses situations where the application needs to ensure that the information produced by a particular DataWriter is available to DataReaders that join the system after the data was produced. The need for data delivery may even extend beyond the lifetime of the producing application; that is, it may be required that the information is delivered to applications that join the system after the producing application has left the system.

Connext provides four mechanisms to handle these scenarios:

The DDS Durability QoS Policy. The DURABILITY QosPolicy (Section 6.5.7) specifies whether samples should be available to late joiners. The policy is set on the DataWriter and the DataReader and supports four kinds: VOLATILE, TRANSIENT_LOCAL, TRANSIENT, or PERSISTENT. If the DataWriter’s Durability QoS policy is set to VOLATILE kind, the DataWriter’s samples will not be made available to any late joiners. If the DataWriter’s policy kind is set to TRANSIENT_LOCAL, TRANSIENT, or PERSISTENT, the samples will be made available for late-joining DataReaders who also set their DURABILITY QoS policy kind to something other than VOLATILE.

Durable Writer History. A DataWriter configured with a DURABILITY QoS policy kind other than VOLATILE keeps its data in a local cache so that it is available when the late- joining application appears. The data is maintained in the DataWriter’s cache until it is considered to be no longer needed. The precise criteria depends on the configuration of additional QoS policies such as LIFESPAN QoS Policy (Section 6.5.12), HISTORY QosPolicy (Section 6.5.10), RESOURCE_LIMITS QosPolicy (Section 6.5.20), etc. For the purposes of guaranteeing information delivery it is important to note that the

13-4

DataWriter’s cache can be configured to be a memory cache or a durable (disk-based) cache. A memory cache will not survive an application restart. However, a durable (disk- based) cache can survive the restart of the producing application. The use a durable writer history, including the use of an external ODBC database as a cache is described in Durable Writer History (Section 12.3).

RTI Persistence Service. This service allows the information produced by a DataWriter to survive beyond the lifetime of the producing application. Persistence Service is an stand- alone application that runs on many supported platforms. This service complies with the Persistent Profile of the OMG DDS specification. The service uses DDS to subscribe to the DataWriters that specify a DURABILITY QosPolicy (Section 6.5.7) kind of TRANSIENT or PERSISTENT. Persistence Service receives the data from those DataWriters, stores the data in its internal caches, and makes the data available via DataWriters (which are automatically created by Persistence Service) to late-joining DataReaders that specify a Durability kind of TRANSIENT or PERSISTENT. Persistence Service can operate as a relay for the information from the original writer, preserving the source_timestamp of the data, as well as the original sample virtual writer GUID (see RTI Persistence Service (Section 12.5.1)). In addition, you can configure Persistence Service itself to use a memory- based cache or a durable (disk-based or database-based) cache. See Configuring Persistent Storage (Section 27.6). Configuration of redundant and load-balanced persistence services is also supported.

Durable Subscriptions. This is a Persistence Service configuration setting that allows configuration of the required subscriptions (Identifying the Required Consumers of Information (Section 13.1.1)) for the data stored by Persistence Service (Managing Data Instances (Working with Keyed Data Types) (Section 6.3.14)). Configuring required subscriptions for Persistence Service ensures that the service will store the samples until they have been delivered to the configured number (quorum) of DataReaders that have each of the specified roles.

13.2Scenarios

In each of the scenarios below, we assume both the DataWriter and DataReader are configured for strict reliability (RELIABLE ReliabilityQosPolicyKind and KEEP_ALL HistoryQosPolicyKind, see Section 10.3.3). As a result, when the DataWriter’s cache is full of unacknowledged samples, the write() operation will block until samples are acknowledged by all the intended consumers.

13.2.1Scenario 1: Guaranteed Delivery to a-priori known subscribers

A common use case is to guarantee delivery to a set of known subscribers. These subscribers may be already running and have been discovered, they may be temporarily non-responsive, or it could be that some of those subscribers are still not present in the system. See Figure 13.1 on page 13-6.

To guarantee delivery, the list of required subscribers should be configured using the AVAILABILITY QosPolicy (DDS Extension) (Section 6.5.1) on the DataWriters to specify the role_name and quorum for each required subscription. Similarly the ENTITY_NAME QosPolicy (DDS Extension) (Section 6.5.9) should be used on the DataReaders to specify their role_name. In

13-5

addition we use Application Acknowledgment (Section 6.3.12) to guarantee the sample was delivered and processed by the DataReader.

Figure 13.1 Guaranteed Delivery Scenario 1

The DataWriter and DataReader RELIABILITY QoS Policy can be configured for either AUTO or EXPLICIT application acknowledgment kind. As the DataWriter publishes the sample, it will await acknowledgment from the DataReader (through the protocol-level acknowledgment) and from the subscriber application (though the additional application-level acknowledgment). The DataWriter will only consider the sample acknowledged when it has been acknowledged by all discovered active DataReaders and also by the quorum of each required subscription.

In this specific scenario, DataReader #1 is configured for EXPLICIT application acknowledgment. After reading and processing the sample, the subscribing application calls acknowledge_sample() or acknowledge_all() (see Section 7.4.4). As a result, Connext will send an application-level acknowledgment to the DataWriter, which will in its turn confirm the acknowledgment.

If the sample was lost in transit, the reliability protocol will repair the sample. Since it has not been acknowledged, it remains available in the writer’s queue to be automatically resent by Connext. The sample will remain available until acknowledged by the application. If the subscribing application crashes while processing the sample and restarts, Connext will repair the unacknowledged sample. Samples which already been processed and acknowledged will not be resent.

In this scenario, DataReader #2 may be a late joiner. When it starts up, because it is configured with TRANSIENT_LOCAL Durability, the reliability protocol will re-send the samples previously sent by the writer. These samples were considered unacknowledged by the

13-6

DataWriter because they had not been confirmed yet by the required subscription (identified by its role_name: ‘logger’).

DataReader #2 does not explicitly acknowledge the samples it reads. It is configured to use AUTO application acknowledgment, which will automatically acknowledge samples that have been read or taken after the application calls the DataReader return_loan operation.

This configuration works well for situations where the DataReader may not be immediately available or may restart. However, this configuration does not provide any guarantee if the DataWriter restarts. When the DataWriter restarts, samples previously unacknowledged are lost and will no longer be available to any late joining DataReaders.

13.2.2Scenario 2: Surviving a Writer Restart when Delivering Samples to a priori Known Subscribers

Scenario 1 describes a use case where samples are delivered to a list of a priori known subscribers. In that scenario, Connext will deliver samples to the late-joining or restarting subscriber. However, if the producer is re-started the samples it had written will no longer be available to future subscribers.

To handle a situation where the producing application is restarted, we will use the Durable Writer History (Section 12.3) feature. See Figure 13.2 on page 13-8.

A DataWriter can be configured to maintain its data and state in durable storage. This configuration is done using the PROPERTY QoS policy as described in Section 12.3.2.. With this configuration the data samples written by the DataWriter and any necessary internal state is persisted by the DataWriter into durable storage As a result, when the DataWriter restarts, samples which had not been acknowledged by the set of required subscriptions will be resent and late-joining DataReaders specifying DURABILITY kind different from VOLATILE will receive the previously-written samples.

13.2.3Scenario 3: Delivery Guaranteed by Persistence Service (Store and Forward) to a priori Known Subscribers

Previous scenarios illustrated that using the DURABILITY, RELIABILITY, and AVAILABILITY QoS policies we can ensure that as long as the DataWriter is present in the system, samples written by a DataWriter will be delivered to the intended consumers. The use of the durable writer history in the previous scenario extended this guarantee even in the presence of a restart of the application writing the data.

This scenario addresses the situation where the originating application that produced the data is no longer available. For example, the network could have become partitioned, the application could have been terminated, it could have crashed and not have been restarted, etc.

In order to deliver data to applications that appear after the producing application is no longer available on the network it is necessary to have another service that stores those samples and delivers them. This is the purpose of the RTI Persistence Service.

The RTI Persistence Service can be configured to automatically discover DataWriters that specify a DURABILITY QoS with kind TRANSIENT or PERSISTENT and automatically create pairs (DataReader, DataWriter) that receive and store that information (see Chapter 26: Introduction to RTI Persistence Service). All the DataReaders created by the RTI Persistence Service have the ENTITY_QOS policy set with the role_name of “PERSISTENCE_SERVICE”. This allows an application to specify Persistence Service as one of the required subscriptions for its DataWriters.

In this third scenario, we take advantage of this capability to configure the DataWriter to have the RTI Persistence Service as a required subscription. See Figure 13.3 on page 13-8.

The RTI Persistence Service can also have its DataWriters configured with required subscriptions. This feature is known as Persistence Service “durable subscriptions”. DataReader #1 is pre configured in Persistence Service as a Durable Subscription. (Alternatively, DataReader #1 could

13-7

Figure 13.2 Guaranteed Delivery Scenario 2

Figure 13.3 Guaranteed Delivery Scenario 3

13-8

have registered itself dynamically as Durable Subscription using the DomainParticipant register_durable_subscription() operation).

We also configure the RELIBILITY QoS policy setting of the AcknowledgmentKind to APPLICATION_AUTO_ACKNOWLEDGMENT_MODE in order to ensure samples are stored in the Persistence Service and properly processed on the consuming application prior to them being removed from the DataWriter cache.

With this configuration in place the DataWriter will deliver samples to the DataReader and to the Persistence Service reliably and wait for the Application Acknowledgment from both. Delivery of samples to DataReader #1 and the Persistence Service occurs concurrently. The Persistence Service in turn takes responsibility to deliver the samples to the configured “logger” durable subscription. If the original publisher is no longer available, samples can still be delivered by the Persistence Service. to DataReader #1 and any other late-joining DataReaders.

When DataReader #1 acknowledges the sample through an application-acknowledgment message, both the original DataWriter and Persistence Service will receive the application- acknowledgment. RTI Connext takes advantage of this to reduce or eliminate delivery if duplicate samples, that is, the Persistence Service can notice that DataReader #1 has acknowledged a sample and refrain from separately sending the same sample to DataReader #1.

13.2.3.1Variation: Using Redundant Persistence Services

Using a single Persistence Service to guarantee delivery can still raise concerns about having the Persistence Service as a single point of failure. To provide a level of added redundancy, the publisher may be configured to await acknowledgment from a quorum of multiple persistence services (role_name remains PERSISTENCE). Using this configuration we can achieve higher levels of redundancy

Figure 13.4 Guaranteed Delivery Scenario 3 with Redundant Persistence Service

The RTI Persistence Services will automatically share information to keep each other synchronized. This includes both the data and also the information on the durable subscriptions. That is, when a Persistence Service discovers a durable subscription, information about durable subscriptions is automatically replicated and synchronized among persistence services (CITE: New section to be written in Persistence Service Chapter).

13-9

13.2.3.2Variation: Using Load-Balanced Persistent Services

The Persistence Service will store samples on behalf of many DataWriters and, depending on the configuration, it might write those samples to a database or to disk. For this reason the Persistence Service may become a bottleneck in systems with high durable sample throughput.

It is possible to run multiple instances of the Persistence Service in a manner where each is only responsible for the guaranteed delivery of certain subset of the durable data being published. These Persistence Service can also be run different computers and in this manner achieve much higher throughput. For example, depending on the hardware, using typical hard-drives a single a Persistence Service may be able to store only 30000 samples per second. By running 10 persistence services in 10 different computers we would be able to handle storing 10 times that system-wide, that is, 300000 samples per second.

The data to be persisted can be partitioned among the persistence services by specifying different Topics to be persisted by each Persistence Service. If a single Topic has more data that can be handled y a single Persistence Service it is also possible to specify a content-filter so that only the data within that Topic that matches the filter will be stored by the Persistence Service. For example assume the Topic being persisted has an member named “x” of type float. It is possible to configure two Persistence Services one with the filter “x>10”, and the other “x <=10”, such that each only stores a subject of the data published on the Topic. See also: Configuring Durable Subscriptions in Persistence Service (Section 27.9).

13-10

Chapter 14 Discovery

This chapter discusses how Connext objects on different nodes find out about each other using the default Simple Discovery Protocol (SDP). It describes the sequence of messages that are passed between Connext on the sending and receiving sides.

This chapter includes the following sections:

What is Discovery? (Section 14.1)

Configuring the Peers List Used in Discovery (Section 14.2)

Discovery Implementation (Section 14.3)

Debugging Discovery (Section 14.4)

Ports Used for Discovery (Section 14.5)

The discovery process occurs automatically, so you do not have to implement any special code. We recommend that all users read What is Discovery? (Section 14.1) and Configuring the Peers List Used in Discovery (Section 14.2). The remaining sections contain advanced material for those who have a particular need to understand what is happening ‘under the hood.’ This information can help you debug a system in which objects are not communicating.

You may also be interested in reading Chapter 15: Transport Plugins , as well as learning about these QosPolicies:

TRANSPORT_SELECTION QosPolicy (DDS Extension) (Section 6.5.22)

TRANSPORT_BUILTIN QosPolicy (DDS Extension) (Section 8.5.7)

TRANSPORT_UNICAST QosPolicy (DDS Extension) (Section 6.5.23)

TRANSPORT_MULTICAST QosPolicy (DDS Extension) (Section 7.6.5)

14.1What is Discovery?

Discovery is the behind-the-scenes way in which Connext objects (DomainParticipants, DataWriters, and DataReaders) on different nodes find out about each other. Each DomainParticipant maintains a database of information about all the active DataReaders and DataWriters that are in the same domain. This database is what makes it possible for DataWriters and DataReaders to communicate. To create and refresh the database, each application follows a common discovery process.

This chapter describes the default discovery mechanism known as the Simple Discovery Protocol, which includes two phases: Simple Participant Discovery (Section 14.1.1) and Simple

14-1

Endpoint Discovery (Section 14.1.2). (Discovery can also be performed using the Enterprise Discovery Protocol—this requires a separately purchased package, RTI Enterprise Discovery Service.)

The goal of these two phases is to build, for each DomainParticipant, a complete picture of all the entities that belong to the remote participants that are in its peers list. The peers list is the list of nodes with which a participant may communicate. It starts out the same as the initial_peers list that you configure in the DISCOVERY QosPolicy (DDS Extension) (Section 8.5.2). If the accept_unknown_peers flag in that same QosPolicy is TRUE, then other nodes may also be added as they are discovered; if it is FALSE, then the peers list will match the initial_peers list, plus any peers added using the DomainParticipant’s add_peer() operation.

14.1.1Simple Participant Discovery

This phase of the Simple Discovery Protocol is performed by the Simple Participant Discovery Protocol (SPDP).

During the Participant Discovery phase, DomainParticipants learn about each other. The DomainParticipant’s details are communicated to all other DomainParticipants in the same domain by sending participant declaration messages, also known as participant DATA submessages. The details include the DomainParticipant’s unique identifying key (GUID or Globally Unique ID described below), transport locators (addresses and port numbers), and QoS. These messages are sent on a periodic basis using best-effort communication.

Participant DATAs are sent periodically to maintain the liveliness of the DomainParticipant. They are also used to communicate changes in the DomainParticipant’s QoS. Only changes to QosPolicies that are part of the DomainParticipant’s built-in data (namely, the USER_DATA QosPolicy (Section 6.5.25)) need to be propagated.

When a DomainParticipant is deleted, a participant DATA (delete) submessage with the

DomainParticipant's identifying GUID is sent.

The GUID is a unique reference to an entity. It is composed of a GUID prefix and an Entity ID. By default, the GUID prefix is calculated from the IP address and the process ID. (For more on how the GUID is calculated, see Controlling How the GUID is Set (rtps_auto_id_kind) (Section 8.5.9.4).) The IP address and process ID are stored in the DomainParticipant’s WIRE_PROTOCOL QosPolicy (DDS Extension) (Section 8.5.9). The entityID is set by Connext (you may be able to change it in a future version).

Once a pair of remote participants have discovered each other, they can move on to the Endpoint Discovery phase, which is how DataWriters and DataReaders find each other.

14.1.2Simple Endpoint Discovery

This phase of the Simple Discovery Protocol is performed by the Simple Endpoint Discovery Protocol (SEDP).

During the Endpoint Discovery phase, Connext matches DataWriters and DataReaders. Information (GUID, QoS, etc.) about your application’s DataReaders and DataWriters is exchanged by sending publication/subscription declarations in DATA messages that we will refer to as publication DATAs and subscription DATAs. The Endpoint Discovery phase uses reliable communication.

As described in Section 14.3, these declaration or DATA messages are exchanged until each DomainParticipant has a complete database of information about the participants in its peers list and their entities. Then the discovery process is complete and the system switches to a steady state. During steady state, participant DATAs are still sent periodically to maintain the liveliness status of participants. They may also be sent to communicate QoS changes or the deletion of a

DomainParticipant.

14-2

When a remote DataWriter/DataReader is discovered, Connext determines if the local application has a matching DataReader/DataWriter. A ‘match’ between the local and remote entities occurs only if the DataReader and DataWriter have the same Topic, same data type, and compatible QosPolicies (which includes having the same partition name string, see Section 6.4.5). Furthermore, if the DomainParticipant has been set up to ignore certain DataWriters/DataReaders, those entities will not be considered during the matching process. See Section 16.4.2 for more on ignoring specific publications and subscriptions.

This ‘matching’ process occurs as soon as a remote entity is discovered, even if the entire database is not yet complete: that is, the application may still be discovering other remote entities.

A DataReader and DataWriter can only communicate with each other if each one’s application has hooked up its local entity with the matching remote entity. That is, both sides must agree to the connection.

Section 14.3 describes the details about the discovery process.

14.2Configuring the Peers List Used in Discovery

The Connext discovery process will try to contact all possible participants on each remote node in the ‘initial peers list,’ which comes from the initial_peers field of the DomainParticipant’s DISCOVERY QosPolicy.

The ‘initial peers list’ is just that: an initial list of peers to contact. Furthermore, the peers list merely contains potential peers—there is no requirement that there actually be Connext applications on the hosts in the list.

After startup, you can add to the ‘peers list’ with the add_peer() operation (see Adding and Removing Peers List Entries (Section 8.5.2.3)). The ‘peer list’ may also grow as peers are automatically discovered (if accept_unknown_peers is TRUE, see Controlling Acceptance of Unknown Peers (Section 8.5.2.6)).

When you call get_default_participant_qos() for a DomainParticipantFactory, the values used for the DiscoveryQosPolicy’s initial_peers and multicast_receive_addresses may come from the following:

A file named NDDS_DISCOVERY_PEERS, which is formatted as described in NDDS_DISCOVERY_PEERS File Format (Section 14.2.3). The file must be in the same directory as your application’s executable.

An environment variable named NDDS_DISCOVERY_PEERS, defined as a comma- separated list of peer descriptors (see NDDS_DISCOVERY_PEERS Environment Variable Format (Section 14.2.2)).

The value specified in the default XML QoS profile (see Overwriting Default QoS Values (Section 17.9.4)).

If NDDS_DISCOVERY_PEERS (file or environment variable) does not contain a multicast address, then multicast_receive_addresses is cleared and the RTI discovery process will not listen for discovery messages via multicast.

If NDDS_DISCOVERY_PEERS (file or environment variable) contains one or more multicast addresses, the addresses are stored in multicast_receive_addresses, starting at element 0. They will be stored in the order in which they appear in NDDS_DISCOVERY_PEERS.

Note: Setting initial_peers in the default XML QoS Profile does not modify the value of multicast_receive_address.

If both the file and environment variable are found, the file takes precedence and the environment variable will be ignored.1 The settings in the default XML QoS Profile take

14-3

precedence over the file and environment variable. In the absence of a file, environment variable, or default XML QoS profile values, Connext will use a default value. See the API Reference HTML documentation for details (in the section on the DISCOVERY QosPolicy).

If initial peers are specified in both the currently loaded QoS XML profile and in the NDDS_DISCOVERY_PEERS file, the values in the profile take precedence.

The file, environment variable, and default XML QoS Profile make it easy to reconfigure which nodes will take part in the discovery process—without recompiling your application.

The file, environment variable, and default XML QoS Profile are the possible sources for the default initial peers list. You can, of course, explicitly set the initial list by changing the values in the QoS provided to the DomainParticipantFactory's create_participant() operation, or by adding to the list after startup with the DomainParticipant’s add_peer() operation (see Section 8.5.2.3).

If you set NDDS_DISCOVERY_PEERS and You Want to Communicate over Shared Memory:

Suppose you want to communicate with other Connext applications on the same host and you are explicitly setting NDDS_DISCOVERY_PEERS (generally in order to use unicast discovery with applications on other hosts).

If the local host platform does not support the shared memory transport, then you can include the name of the local host in the NDDS_DISCOVERY_PEERS list. (To check if your platform supports shared memory, see the Platform Notes document.)

If the local host platform supports the shared memory transport, then you must do one of the following:

Include "shmem://" in the NDDS_DISCOVERY_PEERS list. This will cause shared memory to be used for discovery and data traffic for applications on the same host.

or:

Include the name of the local host in the NDDS_DISCOVERY_PEERS list, and disable the shared memory transport in the TRANSPORT_BUILTIN QosPolicy (DDS Extension) (Section 8.5.7) of the DomainParticipant. This will cause UDP loopback to be used for discovery and data traffic for applications on the same host.

14.2.1Peer Descriptor Format

A peer descriptor string specifies a range of participants at a given locator. Peer descriptor strings are used in the DISCOVERY QosPolicy (DDS Extension) (Section 8.5.2) initial_peers field (see Section 8.5.2.2) and the DomainParticipant’s add_peer() and remove_peer() operations (see Section 8.5.2.3).

The anatomy of a peer descriptor is illustrated in Figure 14.1 using a special "StarFabric" transport example.

A peer descriptor consists of:

[optional] A participant ID. If a simple integer is specified, it indicates the maximum participant ID to be contacted by the Connext discovery mechanism at the given locator. If that integer is enclosed in square brackets (e.g., [2]), then only that Participant ID will be used. You can also specify a range in the form of [a,b]: in this case only the Participant IDs in that specific range are contacted. If omitted, a default value of 4 is implied.

A locator, as described in Section 14.2.1.1.

These are separated by the '@' character. The separator may be omitted if a participant ID limit is not explicitly specified.

1. This is true even if the file is empty.

14-4

Figure 14.1 Peer Descriptor Address String

The "participant ID limit" only applies to unicast locators; it is ignored for multicast locators (and therefore should be omitted for multicast peer descriptors).

14.2.1.1Locator Format

A locator string specifies a transport and an address in string format. Locators are used to form peer descriptors. A locator is equivalent to a peer descriptor with the default participant ID limit

(4).

A locator consists of:

[optional] Transport name (alias or class). This identifies the set of transport plug-ins (transport aliases) that may be used to parse the address portion of the locator. Note that a transport class name is an implicit alias used to refer to all the transport plug-in instances of that class.

[optional] An address, as described in Section 14.2.1.2.

These are separated by the "://" string. The separator is specified if and only if a transport name is specified.

If a transport name is specified, the address may be omitted; in that case all the unicast addresses (across all transport plug-in instances) associated with the transport class are implied. Thus, a locator string may specify several addresses.

If an address is specified, the transport name and the separator string may be omitted; in that case all the available transport plug-ins for the Entity may be used to parse the address string.

The transport names for the built-in transport plug-ins are:

shmem - Shared Memory Transport

udpv4 - UDPv4 Transport

udpv6 - UDPv6 Transport

14-5

14.2.1.2Address Format

An address string specifies a transport-independent network address that qualifies a transport- dependent address string. Addresses are used to form locators. Addresses are also used in the DISCOVERY QosPolicy (DDS Extension) (Section 8.5.2) multicast_receive_addresses and the

DDS_TransportMulticastSettings_t::receive_address fields. An address is equivalent to a locator in which the transport name and separator are omitted.

An address consists of:

[optional] A network address in IPv4 or IPv6 string notation. If omitted, the network address of the transport is implied.

[optional] A transport address, which is a string that is passed to the transport for processing. The transport maps this string into

NDDS_Transport_Property_t::address_bit_count bits. If omitted, the network address is used as the fully qualified address.

These are separated by the '#' character. If a separator is specified, it must be followed by a non- empty string which is passed to the transport plug-in.

The bits resulting from the transport address string are prepended with the network address. The least significant NDDS_Transport_Property_t::address_bit_count bits of the network address are ignored.

If you omit the ‘#’ separator and the string is not a valid IPv4 or IPv6 address, it is treated as a transport address with an implicit network address (of the transport plug-in).

14.2.2NDDS_DISCOVERY_PEERS Environment Variable Format

You can set the default value for the initial peers list in an environment variable named NDDS_DISCOVERY_PEERS. Multiple peer descriptor entries must be separated by commas. Table 14.1 shows some examples. The examples use an implied maximum participant ID of 4 unless otherwise noted. (If you need instructions on how to set environment variables, see the Getting Started Guide).

Table 14.1 NDDS_DISCOVERY_PEERS Environment Variable Examples

NDDS_DISCOVERY_PEERS

Description of Host(s)

 

 

 

 

239.255.0.1

multicast

 

 

localhost

localhost

 

 

192.168.1.1

10.10.30.232 (IPv4)

 

 

FAA0::1

FAA0::0 (IPv6)

 

 

himalaya,gangotri

himalaya and gangotri

 

 

1@himalaya,1@gangotri

himalaya and gangotri (with a maximum participant ID of 1 on each

host)

 

 

 

FAA0::0localhost

FAA0::0localhost (could be a UDPv4 transport plug-in registered at

network address of FAA0::0) (IPv6)

 

 

 

udpv4://himalaya

himalaya accessed using the "udpv4" transport plug-ins) (IPv4)

 

 

udpv4://FAA0::0localhost

localhost using the "udpv4" transport plug-ins) registered at network

address FAA0::0

 

 

 

udpv4://

all unicast addresses accessed via the "udpv4" (UDPv4) transport

plug-ins)

 

 

 

0/0/R

0/0/R (StarFabric)

#0/0/R

 

 

 

starfabric://0/0/R

0/0/R (StarFabric) using the "starfabric" (StarFabric) transport plug-

starfabric://#0/0/R

ins

 

 

14-6

Table 14.1 NDDS_DISCOVERY_PEERS Environment Variable Examples

NDDS_DISCOVERY_PEERS

Description of Host(s)

 

 

 

 

starfabric://FBB0::0#0/0/R

0/0/R (StarFabric) using the "starfabric" (StarFabric) transport plug-

ins registered at network address FAA0::0

 

 

starfabric://

all unicast addresses accessed via the "starfabric" (StarFabric)

transport plug-ins

 

 

 

shmem://

all unicast addresses accessed via the "shmem" (shared memory)

transport plug-ins

 

 

 

shmem://FCC0::0

all unicast addresses accessed via the "shmem" (shared memory)

transport plug-ins registered at network address FCC0::0

 

 

 

14.2.3NDDS_DISCOVERY_PEERS File Format

You can set the default value for the initial peers list in a file named NDDS_DISCOVERY_PEERS. The file must be in the your application’s current working directory.

The file is optional. If it is found, it supersedes the values in any environment variable of the same name.

Entries in the file must contain a sequence of peer descriptors separated by whitespace or the comma (',') character. The file may also contain comments starting with a semicolon (';') character until the end of the line.

Example file contents:

;;NDDS_DISCOVERY_PEERS - Default Discovery Configuration File

;;Multicast builtin.udpv4://239.255.0.1 ; default discovery multicast addr

;;Unicast

localhost,192.168.1.1

; A comma can be used a separator

FAA0::1 FAA0::0#localhost ; Whitespace can be used as a separator

1@himalaya

; Max participant ID of 1 on 'himalaya'

1@gangotri

 

;; UDPv4

 

udpv4://himalaya

; 'himalaya' via 'udpv4' transport plugin(s)

udpv4://FAA0::0#localhost ; 'localhost' via 'updv4' transport plugin

 

;

registered at network address FAA0::0

;; Shared Memory

 

 

shmem://

; All 'shmem' transport plugin(s)

builtin.shmem://

; The builtin builtin 'shmem' transport plugin

shmem://FCC0::0

; Shared memory transport plugin registered

 

;

at network address FCC0::0

;; StarFabric

 

 

0/0/R

; StarFabric node 0/0/R

starfabric://0/0/R

; 0/0/R accessed via 'starfabric'

 

;

transport plugin(s)

starfabric://FBB0::0#0/0/R ; StarFabric transport plugin registered

 

;

at network address FBB0::0

starfabric://

; All 'starfabric' transport plugin(s)

14-7

14.3Discovery Implementation

Note: this section contains advanced material not required by most users.

Discovery is implemented using built-in DataWriters and DataReaders. These are the same class of entities your application uses to send/receive data. That is, they are also of type

DDSDataWriter/DDSDataReader. For each DomainParticipant, three built-in DataWriters and three built-in DataReaders are automatically created for discovery purposes. Figure 14.2 shows how these objects are used. (For more on built-in DataReaders and DataWriters, see Chapter 16: Built-In Topics).

Figure 14.2 Built-in Writers and Readers for Discovery

DomainParticipant

 

 

 

 

 

participant DATA

 

Advertises this

Builtin

 

 

 

 

 

 

 

 

Participant

participant

DataWriter

 

“DCPSParticipant” builtin topic

Discovery

 

 

 

 

 

 

 

Phase

Discovers other

Builtin

 

 

participant DATA

 

DataReader

 

 

 

 

 

 

participants

 

“DCPSParticipant” builtin topic

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

publication DATA

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Builtin

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Advertises this

 

 

 

 

DataWriter

 

“DCPSPublication” builtin topic

 

 

 

 

 

 

 

 

 

 

 

participant’s

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

DataWriters and

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

subscription DATA

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

DataReaders

 

 

 

 

 

Builtin

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

DataWriter

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Endpoint

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

“DCPSSubscription” builtin topic

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(Writer/

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Reader)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Discovery

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

publication DATA

 

 

 

Builtin

 

 

 

 

 

 

 

Phase

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Discovers other

 

 

 

 

DataReader

 

“DCPSPublication” builtin topic

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

participants’

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

DataWriters and

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

DataReaders

 

 

 

 

 

Builtin

 

 

 

 

 

subscription DATA

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

DataReader

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

“DCPSSubscription” builtin topic

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Network

For each DomainParticipant, there are six objects automatically created for discovery purposes. The top two objects are used to send/receive participant DATA messages, which are used in the Participant Discovery phase to find remote DomainParticipants. This phase uses best-effort communications. Once the participants are aware of each other, they move on to the Endpoint Discovery Phase to learn about each other’s DataWriters and DataReaders. This phase uses reliable communications.

The implementation is split into two separate protocols:

Simple Participant Discovery Protocol (SPDP)

+Simple Endpoint Discovery Protocol (SEDP)

=Simple Discovery Protocol (SDP)

14.3.1Participant Discovery

When a DomainParticipant is created, a DataWriter and a DataReader are automatically created to exchange participant DATA messages in the network. These DataWriters and DataReaders are "special" because the DataWriter can send to a given list of destinations, regardless of whether there is a Connext application at the destination, and the DataReader can receive data from any

14-8

source, whether the source is previously known or not. In other words, these special readers and writers do not need to discover the remote entity and perform a match before they can communicate with each other.

When a DomainParticipant joins or leaves the network, it needs to notify its peer participants. The list of remote participants to use during discovery comes from the peer list described in the DISCOVERY QosPolicy (DDS Extension) (Section 8.5.2). The remote participants are notified via participant DATA messages. In addition, if a participant’s QoS is modified in such a way that other participants need to know about the change (that is, changes to the USER_DATA QosPolicy (Section 6.5.25)), a new participant DATA will be sent immediately.

Participant DATAs are also used to maintain a participant’s liveliness status. These are sent at the rate set in the participant_liveliness_assert_period in the DISCOVERY_CONFIG QosPolicy (DDS Extension) (Section 8.5.3).

Let’s examine what happens when a new remote participant is discovered. If the new remote participant is in the local participant's peer list, the local participant will add that remote participant into its database. If the new remote participant is not in the local application's peer list, it may still be added, if the accept_unknown_peers field in the DISCOVERY QosPolicy (DDS Extension) (Section 8.5.2) is set to TRUE.

Once a remote participant has been added to the Connext database, Connext keeps track of that remote participant’s participant_liveliness_lease_duration. If a participant DATA for that participant (identified by the GUID) is not received at least once within the participant_liveliness_lease_duration, the remote participant is considered stale, and the remote participant, together with all its entities, will be removed from the database of the local participant.

To keep from being purged by other participants, each participant needs to periodically send a participant DATA to refresh its liveliness. The rate at which the participant DATA is sent is controlled by the participant_liveliness_assert_period in the participant’s DISCOVERY_CONFIG QosPolicy (DDS Extension) (Section 8.5.3). This exchange, which keeps Participant A from appearing ‘stale,’ is illustrated in Figure 14.3. Figure 14.4 shows what happens when Participant A terminates ungracefully and therefore needs to be seen as ‘stale.’

14.3.1.1Refresh Mechanism

To ensure that a late-joining participant does not need to wait until the next refresh of the remote participant DATA to discover the remote participant, there is a resend mechanism. If the received participant DATA is from a never-before-seen remote participant, and it is in the local participant's peers list, the application will resend its own participant DATA to all its peers. This resend can potentially be done multiple times, with a random sleep time in between. Figure 14.5 illustrates this scenario.

The number of retries and the random amount of sleep between them are controlled by each participant’s DISCOVERY_CONFIG QosPolicy (DDS Extension) (Section 8.5.3) (see and in Figure 14.5).

Figure 14.6 provides a summary of the messages sent during the participant discovery phase.

14-9

Figure 14.3 Periodic ‘participant DATAs’

Node A

Participant created

Participant’s UserDataQosPolicy modified

Participant A’s DDS_DomainParticipantQos.discovery_config.

participant_liveliness_assert_period

Random time between min_initial_participant_announcement_period and

max_initial_participant_announcement_period (in A’s

DDS_DomainParticipantQos.discovery_config)

Participant destroyed

Node B

participant A DATA

participant A DATA (delete)

The DomainParticipant on Node A sends a ‘participant DATA’ to Node B, which is in Node A’s peers list. This occurs regardless of whether or not there is a Connext application on Node B.

The green short dashed lines are periodic participant DATAs. The time between these messages is controlled by the participant_liveliness_assert_period in the DiscoveryConfig QosPolicy.

In addition to the periodic participant DATAs, ‘initial repeat messages’ (shown in blue, with longer dashes) are sent from A to B. These messages are sent at a random time between min_initial_participant_announcement_period and max_initial_participant_announcement_period (in A’s DiscoveryConfig QosPolicy). The number of these initial repeat messages is set in initial_participant_announcements.

14-10

Figure 14.4 Ungraceful Termination of a Participant

Node A

Node B

 

 

Participant created

Participant created

participant A

DATA

 

 

 

New remote participant A

 

 

added to database

 

 

 

Participant ungracefully terminated

Participant A’s DDS_DomainParticipantQos.discovery_config. participant_liveliness_assert_period

Participant A’s DDS_DomainParticipantQos.discovery_config. participant_liveliness_lease_duration

Remote participant A considered ‘stale,’ removed from database

Participant A is removed from participant B’s database if it is not refreshed within the liveliness lease duration. Dashed lines are periodic participant DATA messages.

(Periodic resends of ‘participant B DATA’ from B to A are omitted from this diagram for simplicity. Initial repeat messages from A to B are also omitted from this diagram—these messages are sent at a random time between min_initial_participant_announcement_period and max_initial_participant_announcement_period, see Figure 14.3.)

14-11

Figure 14.5 Resending ‘participant DATA’ to a Late-Joiner

Node A

Node B

Participant created

 

participant A

 

DATA

 

 

 

 

 

 

participant B

 

 

 

 

 

DATA

New remote participant B

 

 

added to database

 

 

 

 

 

 

 

 

 

 

participant A

 

 

 

 

 

 

 

 

 

 

 

DATA

 

 

 

 

 

 

participant A

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

DATA

participant B already in

 

participant B

database, no action taken

 

DATA