Strings and Wide Strings

17.1.2 Strings and Wide Strings

Connext supports both strings consisting of single-byte characters (the IDL string type) and strings consisting of wide characters (IDL wstring). The wide characters supported by Connext are large enough to store two-byte Unicode/UTF16 characters.

Like sequences, strings may be declared with or without an explicit bound. A string's "bound" is its maximum length (not counting the trailing NULL character in C and C++).

In the Modern C++ API strings map to std::string or to a type with a similar interface, depending on the options. See Table 17.7 Specifying Data Types in IDL for Modern C++ in 17.3.4 Translations for IDL Types.

In C and Traditional C++, strings are mapped to char*. Optionally, the mapping in Traditional C++ can be changed to std::string by generating code with the option -useStdString.

By default, any string found in an IDL file without an explicit bound will be given a default bound of 255 elements. This default value can be overwritten using the RTI Code Generator‘s -stringSize command-line argument (see the RTI Code Generator User's Manual).

You can change the default behavior and use unbounded strings (noting the 17.10 Data Sample Serialization Limits) by using Code Generator's -unboundedSupport command-line argument. When using this option, the generated code will deserialize incoming samples as follows:

First, it will release previous memory associated with the unbounded strings. The memory associated with an unbounded member is not released until the sample containing the member is reused.
Second, it will allocate new memory to accommodate the actual size of the unbounded strings.

To configure unbounded support for code generated with rtiddsgen -unboundedSupport or for DynamicDataWriters/DynamicDataReaders for Topics of types that contain unbounded strings or wide strings:

Use these threshold QoS properties:
- dds.data_writer.history.memory_manager.fast_pool.pool_buffer_max_size on the DataWriter (see 20.1 Sample Memory Management for DataWriters)
- dds.data_reader.history.memory_manager.fast_pool.pool_buffer_max_size on the DataReader (see 20.4 Instance Memory Management for DataReaders)
Set the QoS value reader_resource_limits.dynamically_allocate_fragmented_samples on the DataReader to true.
For the Java API, also set these properties accordingly for the Java serialization buffer:
- dds.data_writer.history.memory_manager.java_stream.min_size
- dds.data_writer.history.memory_manager.java_stream.trim_to_size
- dds.data_reader.history.memory_manager.java_stream.min_size
- dds.data_reader.history.memory_manager.java_stream.trim_to_size

17.1.2.1 IDL String Encoding

The “Extensible and Dynamic Topic Types for DDS specification” (https://www.omg.org/spec/DDS-XTypes/) standardizes the default encoding for strings to UTF-8. This encoding shall be used as the wire format. Language bindings may use the representation that is most natural in that particular language. If this representation is different than UTF-8, the language binding shall manage the transformation to/from the UTF-8 wire representation.

For example, in Java, IDL strings are mapped to Java String, which represents a string in the UTF-16 format. Connext handles the conversion to/from UTF-8 when serializing/deserializing strings in Java.

As an extension, Connext offers ISO_8859_1 as an alternative string wire encoding.

This section describes the encoding for IDL strings across different languages in Connext and how to configure that encoding.

C, Traditional C++

IDL strings are mapped to a NULL-terminated array of DDS_Char (char*). Users are responsible for using the right character encoding (UTF-8 or ISO_8859_1) when populating the string values. This applies to all generated code, DynamicData, and Built-in data types. The middleware does not transform from the language binding encoding to the wire encoding.

Modern C++

IDL strings are mapped to std::string. std::string, which contains any sequence of bytes. Users are responsible for using the right character encoding (UTF-8 or ISO_8859_1) when populating the string values. The middleware does not transform from the language binding encoding to the wire encoding. This applies to all generated code, DynamicData, and Built-in types.

IDL strings are mapped to DDS.String, which is equivalent to a NULL-terminated array of DDS_Char (char*). Users are responsible for using the right character encoding (UTF-8 or ISO_8859_1) when populating the string values. The middleware does not transform from the language binding encoding to the wire encoding. This applies to all generated code and Built-in types.

Java

IDL strings are mapped to Java String, which represents a string in the UTF-16 format. Connext handles the conversion to/from UTF-8/ISO_8859_1 when serializing/deserializing strings. For generated code and Built-in data types, you can configure the IDL wire string encoding on a per-endpoint basis using the following properties:

dds.data_reader.type_support.cdr_string_encoding_kind
dds.data_writer.type_support.cdr_string_encoding_kind

These properties can be set at the endpoint level or the participant level. The only values currently supported are UTF-8 and ISO-8859-1. By default, the wire character encoding is assumed to be UTF-8.

For DynamicData, the user can configure the IDL wire string encoding by setting the value of string_character_encoding in DynamicDataProperty_t. The following values are supported:

StandardCharsets.ISO_8859_1
StandardCharsets.UTF_8 (default)

.NET

IDL strings are mapped to string in C#. The conversion to/from UTF-8/ISO_8859_1 when serializing/deserializing strings is automatically handled by Connext. For generated code and built-in data types, you can configure the IDL wire string encoding on a per-endpoint basis using the following properties:

dds.data_reader.type_support.cdr_string_encoding_kind
dds.data_writer.type_support.cdr_string_encoding_kind

For DynamicData, you can configure the IDL wire string encoding by setting the value of string_character_encoding in DynamicDataProperty_t. The following values are supported:

StringEncodingKind::UTF_8 (default)
StringEncodingKind::ISO_8859_1

17.1.2.1.1 Unicode Normalization when Using UTF-8 Encoding

Connext does not normalize the content of the IDL string fields when they are serialized and sent on the wire. It is the responsibility of the application to do that when needed.

Because the content of the string fields is not guaranteed to be normalized, by default, Connext normalizes the UTF-8 IDL string values and the literals they are compared with in the filter expression and/or filter parameters before the filtering evaluation occurs. The normalization affects the following features:

ContentFilteredTopics (see 18.3 ContentFilteredTopics)
Query conditions (see 15.9.7 ReadConditions and QueryConditions)
TopicQueries (see Chapter 61 Topic Queries)
MultiChannel DataWriters (see Chapter 36 Multi-Channel DataWriters for High-Performance Filtering)

You can turn off filtering normalization by using the DomainParticipant's Property Qos property dds.domain_participant.filtering_unicode_normalization (see 35.8 Unicode Normalization).

17.1.2.1.2 Filtering Character Encoding

The following filtering features use UTF-8 character encoding by default for IDL strings:

ContentFilteredTopics (see 18.3 ContentFilteredTopics)
Query conditions (see 15.9.7 ReadConditions and QueryConditions)
TopicQueries (see Chapter 61 Topic Queries)
MultiChannel DataWriters (see Chapter 36 Multi-Channel DataWriters for High-Performance Filtering)

If the encoding of the IDL strings is ISO 8859-1, change the default filtering behavior by setting the DomainParticipant's Property Qos property dds.domain_participant.filtering_character_encoding to ISO-8859-1. For additional information about this property, see 35.7 Character Encoding.

17.1.2.2 IDL Wide Strings Encoding

The “Extensible and Dynamic Topic Types for DDS specification” (https://www.omg.org/spec/DDS-XTypes/) standardizes the default encoding for wide strings to UTF-16. This encoding shall be used as the wire format.

When the data representation is Extended CDR version 1, wide-string characters have a size of 4 bytes on the wire with UTF-16 encoding. When the data representation is Extended CDR version 2, wide-string characters have a size of 2 bytes on the wire with UTF-16 encoding.

Language bindings may use the representation that is most natural in that particular language. If this representation is different from UTF-16, the language binding shall manage the transformation to/from the UTF-16 wire representation.

C, Traditional C++

IDL wide strings are mapped to a NULL-terminated array of DDS_Wchar (DDS_Wchar*). DDS_WChar is an unsigned 2-byte integer. Users are responsible for using the right character encoding (UTF-16) when populating the wide-string values. This applies to all generated code, DynamicData, and Built-in data types. Connext does not transform from the language binding encoding to the wire encoding.

Modern C++

IDL wide strings are mapped to std::wstring, which contains a sequence of wchar_t. This applies to all generated code, DynamicData, and Built-in data types. When serializing/deserializing, Connext assumes that a wchar_t contains a code unit in UTF-16 encoding, even if the size of wchar_t is 4 bytes.

IDL wide strings are mapped to Standard.DDS.Wide_String, which is a NULL-terminated array of Standard.Wide_Character with UTF-16 encoding. This applies to all generated code and Built-in data types.

Java

IDL wide strings are mapped to Java String, which represents a string in the UTF-16 format. This applies to all generated code, DynamicData, and Built-in data types.

.NET

IDL wide strings are mapped to string in C#. These types use the UTF-16 character encoding form. This applies to all generated code, DynamicData, and Built-in data types.

17.1.2.2.1 Unicode Normalization when Using UTF-16 Encoding

Connext does not normalize the content of the IDL wstring fields when they are serialized and sent on the wire. It is the responsibility of the application to do that when needed.

Unlike with IDL strings, Connext does not normalize the UTF-16 strings used by the filtering operations, either.