7.3. Metrics

This section details the metrics you can collect from Connext observable resources. Each metric has a unique name and specifies a general feature of a Connext observable resource. For example, a DataWriter is an observable resource; the metric dds_data_writer_protocol_sent_heartbeats_total specifies the total number of heartbeats sent by a DataWriter. There are two metric types:

  • Counters. A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart.

  • Gauges. A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.

Observability Framework uses a Prometheus time-series database to store collected metrics. A time series is an instantiation of a metric and represents a stream of timestamped values (measurements) belonging to the same resource as the metric. For example, we could have a time series for the metric dds_data_writer_protocol_sent_heartbeats_total corresponding to a DataWriter DW1 identified by a resource GUID GUID1.

Labels (in Prometheus) or attributes (in Open Telemetry) identify each metric instantiation or time series. A label is a key/value pair that is associated with a metric. Any given combination of labels for the same metric name identifies a specific instantiation of that metric. For example, the metric dds_data_writer_protocol_sent_heartbeats_total for the DataWriter DW1 will have the label {guid= GUID1}. All metrics have at least one label called guid that uniquely identifies a resource in a Connext system.

In Observability Framework there is a special kind of metric called a presence metric. Presence metrics are used to indicate the existence of a resource in a Connext system. For example, the dds_domain_participant_presence indicates the presence of a DomainParticipant in a Connext system. There will be a time series for each DomainParticipant ever created in the system. The labels associated with a presence metric describe the resource, and they are dependent on the type of resource. For example, a DomainParticipant resource has labels such as `domain_id` and `name`.

For metrics that are not presence metrics, the only label is the guid label identifying the resource to which the metrics apply. You can use the guid label to query the description labels of a resource by looking at the presence metric for the resource class.

Observability Framework provides the ability to create an initial configuration for the collection and forwarding of metrics on each observable resource, as well as the ability to dynamically change this configuration at run time. The initial configuration for the collection of metrics is set in the Monitoring Library 2.0, as explained in Monitoring Library 2.0. Dynamic metric collection configuration changes are done using the REST API as detailed in Collector Service REST API Reference. For an example of how to dynamically change the metric collection configuration using the Observability Dashboards see Change the Metric Configuration.

7.3.1. Metric Pattern Definitions

Observability Framework enables you to select the set of metrics collected and forwarded for a resource both before and during run time. To select metrics, you use metric selector strings. When specifying metric selector strings, POSIX® fnmatch pattern matching should be used as described in Table 7.2. The most common use case is an asterisk (*) to match 0 or more non-special characters. Some example metric selectors using POSIX® fnmatch are shown below.

Table 7.6 POSIX® fnmatch Metric Selector Examples

Metric Selector

Description

dds_application_process_memory_usage_resident_memory_bytes

refers to the metric “dds_application_process_memory_usage_resident_memory_bytes”

dds_application_process_*

refers to all metrics that begin with “dds_application_process_”

dds_*_bytes

refers to metrics that start with “dds_” and end with “_bytes”

7.3.2. Application Metrics

The following tables describe the metrics and labels generated for Connext applications. Only the dds_application_presence metric has all of the application labels listed in the table below. All other application metrics have the guid label only.

Table 7.7 Application Labels

Label or Attribute Name

Description

controllability_url

The URL and port for the control server on the Collector Service that forwards data for the application. This URL is used when sending remote commands to the Collector Service to configure the telemetry data for the application. The remote commands use the Collector Service REST API. See Collector Service REST API Reference for details on the Collector Service REST API.

guid

Application resource GUID

hostname

Name of the host computer for the application

process_id

Process ID for the application

name

Fully qualified resource name (/applications/<AppName>)

Table 7.8 Application Metrics

Metric Name

Description

Type

dds_application_presence

Indicates the presence of the application and provides all label values for an application instance

Gauge

dds_application_process_memory_usage_resident_memory_bytes

The application resident memory utilization

Gauge

dds_application_process_memory_usage_virtual_memory_bytes

The application virtual memory utilization

Gauge

dds_application_logging_collection_middleware_level

The middleware collection syslog logging level. See Logs for valid values.

Gauge

dds_application_logging_forwarding_middleware_level

The middleware forwarding syslog logging level. See Logs for valid values.

Gauge

7.3.3. Participant Metrics

The following tables describe the metrics and labels generated for Connext DomainParticipants. Only the dds_domain_participant_presence metric has all of the DomainParticipant labels listed in the table below. All other DomainParticipant metrics have the guid label only.

The DomainParticipant resource contains statistic variable metrics such as dds_domain_participant_udpv4_usage_in_net_pkts_count, dds_domain_participant_udpv4_usage_in_net_pkts_mean, dds_domain_participant_udpv4_usage_in_net_pkts_min, and dds_domain_participant_udpv4_usage_in_net_pkts_max.

These variables are interpreted as follows:

  • The metrics with suffix _count represent the total number of packets or bytes over the last Prometheus scraping period.

  • The metrics with suffix _min represent the minimum mean over the last Prometheus scraping period. For example, dds_domain_participant_udpv4_usage_in_net_pkts_min contains the minimum packets/sec over the last scraping period. The min mean is calculated by choosing the minimum of individual mean values reported by Monitoring Library 2.0 every participant_factory_qos.monitoring.distribution_settings.periodic_settings.polling_period.

  • The metrics with suffix _max represent the maximum mean over the last Prometheus scraping period. For example, dds_domain_participant_udpv4_usage_in_net_pkts_max contains the maximum packets/sec over the last scraping period. The max mean is calculated by choosing the maximum of individual mean values reported by Monitoring Library 2.0 every participant_factory_qos.monitoring.distribution_settings.periodic_settings.polling_period.

  • The metrics with suffix _mean represent the mean over the last Prometheus scraping period. For example, dds_domain_participant_udpv4_usage_in_net_pkts_mean contains the packets/sec over the last scraping period. If the scraping period is 30 seconds, the metric contains the packets/sec generated within the last 30 seconds. The dds_domain_participant_udpv4_usage_in_net_pkts_mean is calculated by averaging all individual mean metrics sent by Monitoring Library 2.0 to Observability Collector Service over the last scraping period.

Table 7.9 Participant Labels

Label or Attribute Name

Description

guid

DomainParticipant resource GUID

owner_guid

Resource GUID of the owner entity (application)

dds_guid

DomainParticipant DDS GUID

hostname

Name of the host computer for the DomainParticipant

process_id

Process ID for the DomainParticipant

domain_id

DDS domain ID for the DomainParticipant

platform

Connext architecture as described in the RTI Architecture Abbreviation column in the Platform Notes.

product_version

Connext product version

name

Fully qualified resource name (/applications/<AppName> /domain_participants/<ParticipantName>)

Table 7.10 Participant Metrics

Metric Name

Description

Type

dds_domain_participant_presence

Indicates the presence of the DomainParticipant and provides all label values for a DomainParticipant instance

Gauge

dds_domain_participant_udpv4_usage_in_net_pkts_count

The UDPv4 transport in packets count over the last scraping period

Gauge

dds_domain_participant_udpv4_usage_in_net_pkts_mean

The UDPv4 transport in packets mean (packets/sec) over the last scraping period

Gauge

dds_domain_participant_udpv4_usage_in_net_pkts_min

The UDPv4 transport in packets min mean (packets/sec) over the last scraping period

Gauge

dds_domain_participant_udpv4_usage_in_net_pkts_max

The UDPv4 transport in packets max mean (packets/sec) over the last scraping period

Gauge

dds_domain_participant_udpv4_usage_in_net_bytes_count

The UDPv4 transport in bytes count over the last scraping period

Gauge

dds_domain_participant_udpv4_usage_in_net_bytes_mean

The UDPv4 transport in bytes mean (bytes/sec) over the last scraping period

Gauge

dds_domain_participant_udpv4_usage_in_net_bytes_min

The UDPv4 transport in bytes min mean (bytes/sec) over the last scraping period

Gauge

dds_domain_participant_udpv4_usage_in_net_bytes_max

The UDPv4 transport in bytes max mean (bytes/sec) over the last scraping period

Gauge

dds_domain_participant_udpv4_usage_out_net_pkts_count

The UDPv4 transport out packets count over the last scraping period

Gauge

dds_domain_participant_udpv4_usage_out_net_pkts_mean

The UDPv4 transport out packets mean (packets/sec) over the last scraping period

Gauge

dds_domain_participant_udpv4_usage_out_net_pkts_min

The UDPv4 transport out packets min mean (packets/sec) over the last scraping period

Gauge

dds_domain_participant_udpv4_usage_out_net_pkts_max

The UDPv4 transport out packets max mean (packets/sec) over the last scraping period

Gauge

dds_domain_participant_udpv4_usage_out_net_bytes_count

The UDPv4 transport out bytes count over the last scraping period

Gauge

dds_domain_participant_udpv4_usage_out_net_bytes_mean

The UDPv4 transport out bytes mean (bytes/sec) over the last scraping period

Gauge

dds_domain_participant_udpv4_usage_out_net_bytes_min

The UDPv4 transport out bytes min mean (bytes/sec) over the last scraping period

Gauge

dds_domain_participant_udpv4_usage_out_net_bytes_max

The UDPv4 transport out bytes max mean (bytes/sec) over the last scraping period

Gauge

dds_domain_participant_udpv6_usage_in_net_pkts_count

The UDPv6 transport in packets count over the last scraping period

Gauge

dds_domain_participant_udpv6_usage_in_net_pkts_mean

The UDPv6 transport in packets mean (packets/sec) over the last scraping period

Gauge

dds_domain_participant_udpv6_usage_in_net_pkts_min

The UDPv6 transport in packets min mean (packets/sec) over the last scraping period

Gauge

dds_domain_participant_udpv6_usage_in_net_pkts_max

The UDPv6 transport in packets max mean (packets/sec) over the last scraping period

Gauge

dds_domain_participant_udpv6_usage_in_net_bytes_count

The UDPv6 transport in bytes count over the last scraping period

Gauge

dds_domain_participant_udpv6_usage_in_net_bytes_mean

The UDPv6 transport in bytes mean (bytes/sec) over the last scraping period

Gauge

dds_domain_participant_udpv6_usage_in_net_bytes_min

The UDPv6 transport in bytes min mean (bytes/sec) over the last scraping period

Gauge

dds_domain_participant_udpv6_usage_in_net_bytes_max

The UDPv6 transport in bytes max mean (bytes/sec) over the last scraping period

Gauge

dds_domain_participant_udpv6_usage_out_net_pkts_count

The UDPv6 transport out packets count over the last scraping period

Gauge

dds_domain_participant_udpv6_usage_out_net_pkts_mean

The UDPv6 transport out packets mean (packets/sec) over the last scraping period

Gauge

dds_domain_participant_udpv6_usage_out_net_pkts_min

The UDPv6 transport out packets min mean (packets/sec) over the last scraping period

Gauge

dds_domain_participant_udpv6_usage_out_net_pkts_max

The UDPv6 transport out packets max mean (packets/sec) over the last scraping period

Gauge

dds_domain_participant_udpv6_usage_out_net_bytes_count

The UDPv6 transport out bytes count over the last scraping period

Gauge

dds_domain_participant_udpv6_usage_out_net_bytes_mean

The UDPv6 transport out bytes mean (bytes/sec) over the last scraping period

Gauge

dds_domain_participant_udpv6_usage_out_net_bytes_min

The UDPv6 transport out bytes min mean (bytes/sec) over the last scraping period

Gauge

dds_domain_participant_udpv6_usage_out_net_bytes_max

The UDPv6 transport out bytes max mean (bytes/sec) over the last scraping period

Gauge

7.3.4. Topic Metrics

The following tables describe the metrics and labels generated for Connext Topics. Only the dds_topic_presence metric has all of the Topic labels listed in the table below. All other Topic metrics have the guid label only.

Table 7.11 Topic Labels

Label or Attribute Name

Description

guid

Topic resource GUID

owner_guid

Resource GUID of the owner entity (DomainParticipant)

dds_guid

Topic DDS GUID

hostname

Name of the host computer for the DomainParticipant this Topic is registered with

domain_id

DDS domain ID for the DomainParticipant this Topic is registered with

topic_name

The Topic name

registered_type_name

The registered type name for this Topic

name

Fully qualified resource name (/applications/<AppName>/domain_participants /<ParticipantName>/topics/<TopicName>)

Table 7.12 Topic Metrics

Metric Name

Description

Type

dds_topic_presence

Indicates the presence of the Topic and provides all label values for a Topic instance

Gauge

dds_topic_inconsistent_total

See total_count field in the INCONSISTENT_TOPIC Status

Counter

7.3.5. DataWriter Metrics

The following tables describe the metrics and labels generated for Connext DataWriters. Only the dds_data_writer_presence metric has all of the DataWriter labels listed in the table below. All other DataWriter metrics have the guid label only.

Table 7.13 DataWriter Labels

Label or Attribute Name

Description

guid

DataWriter resource GUID

owner_guid

Resource GUID of the owner entity (publisher)

dds_guid

DataWriter DDS GUID

hostname

Name of the host computer for the DomainParticipant this DataWriter is registered with

domain_id

DDS domain ID for the DomainParticipant this DataWriter is registered with

topic_name

The Topic name for this DataWriter

registered_type_name

The registered type name for this DataWriter

name

Fully qualified resource name (/applications/<AppName>/domain_participants /<ParticipantName>/publishers/<PublisherName>/data_writers/<DataWriterName>)

participant_guid

Resource GUID of the DomainParticipant this DataWriter is registered with

Table 7.14 DataWriter Metrics

Metric Name

Description

Type

dds_data_writer_presence

Indicates the presence of the DataWriter and provides all label values for a DataWriter instance

Gauge

dds_data_writer_liveliness_lost_total

See total_count field in the LIVELINESS_LOST Status

Counter

dds_data_writer_deadline_missed_total

See total_count field in the OFFERED_DEADLINE_MISSED Status

Counter

dds_data_writer_incompatible_qos_total

See total_count field in the OFFERED_INCOMPATIBLE_QOS Status

Counter

dds_data_writer_reliable_cache_full_total

See full_reliable_writer_cache field in the RELIABLE_WRITER_CACHE_CHANGED Status

Counter

dds_data_writer_reliable_cache_high_watermark_total

See high_watermark_reliable_writer_cache field in the RELIABLE_WRITER_CACHE_CHANGED Status

Counter

dds_data_writer_reliable_cache_unack_samples

See unacknowledged_sample_count field in the RELIABLE_WRITER_CACHE_CHANGED Status

Gauge

dds_data_writer_reliable_cache_unack_samples_peak

See unacknowledged_sample_count_peak field in the RELIABLE_WRITER_CACHE_CHANGED Status

Gauge

dds_data_writer_reliable_cache_replaced_unack_samples_total

See replaced_unacknowledged_sample_count field in the RELIABLE_WRITER_CACHE_CHANGED Status

Counter

dds_data_writer_reliable_reader_activity_inactive_count

See inactive_count field in the RELIABLE_READER_ACTIVITY_CHANGED Status

Gauge

dds_data_writer_cache_samples_peak

See sample_count_peak field in the DATA_WRITER_CACHE_STATUS

Gauge

dds_data_writer_cache_samples

See sample_count field in the DATA_WRITER_CACHE_STATUS

Gauge

dds_data_writer_cache_alive_instances

See alive_instance_count field in the DATA_WRITER_CACHE_STATUS

Gauge

dds_data_writer_cache_alive_instances_peak

See alive_instance_count_peak field in the DATA_WRITER_CACHE_STATUS

Gauge

dds_data_writer_protocol_pushed_samples_total

See pushed_sample_count field in the DATA_WRITER_PROTOCOL_STATUS

Counter

dds_data_writer_protocol_pushed_sample_bytes_total

See pushed_sample_bytes field in the DATA_WRITER_PROTOCOL_STATUS

Counter

dds_data_writer_protocol_sent_heartbeats_total

See sent_heartbeat_count field in the DATA_WRITER_PROTOCOL_STATUS

Counter

dds_data_writer_protocol_pulled_samples_total

See pulled_sample_count field in the DATA_WRITER_PROTOCOL_STATUS

Counter

dds_data_writer_protocol_pulled_sample_bytes_total

See pulled_sample_bytes field in the DATA_WRITER_PROTOCOL_STATUS

Counter

dds_data_writer_protocol_received_nacks_total

See received_nack_count field in the DATA_WRITER_PROTOCOL_STATUS

Counter

dds_data_writer_protocol_received_nack_bytes_total

See received_nack_bytes field in the DATA_WRITER_PROTOCOL_STATUS

Counter

dds_data_writer_protocol_send_window_size

See send_window_size field in the DATA_WRITER_PROTOCOL_STATUS

Gauge

dds_data_writer_protocol_pushed_fragments_total

See pushed_fragment_count field in the DATA_WRITER_PROTOCOL_STATUS

Counter

dds_data_writer_protocol_pushed_fragment_bytes_total

See pushed_fragment_bytes field in the DATA_WRITER_PROTOCOL_STATUS

Counter

dds_data_writer_protocol_pulled_fragments_total

See pulled_fragment_count field in the DATA_WRITER_PROTOCOL_STATUS

Counter

dds_data_writer_protocol_pulled_fragment_bytes_total

See pulled_fragment_bytes field in the DATA_WRITER_PROTOCOL_STATUS

Counter

dds_data_writer_protocol_received_nack_fragments_total

See received_nack_fragment_count field in the DATA_WRITER_PROTOCOL_STATUS

Counter

dds_data_writer_protocol_received_nack_fragment_bytes_total

See received_nack_fragment_bytes field in the DATA_WRITER_PROTOCOL_STATUS

Counter

7.3.6. DataReader Metrics

The following tables describe the metrics and labels generated for Connext DataReaders. Only the ddsd_datareader_presence metric has all of the DataReader labels listed in the table below. All other DataReader metrics have the guid label only.

Table 7.15 DataReader Labels

Label or Attribute Name

Description

guid

DataReader resource GUID

owner_guid

Resource GUID of the owner entity (subscriber)

dds_guid

DataReader DDS GUID

hostname

Name of the host computer for the DomainParticipant this DataReader is registered with

domain_id

DDS domain ID for the DomainParticipant this DataReader is registered with

topic_name

The Topic name for this DataReader

registered_type_name

The registered type name for this DataReader

name

Fully qualified resource name (/applications/<AppName>/domain_participants/<ParticipantName> /subscribers/<SubscriberName>/data_readers/<DataReaderName>)

participant_guid

Resource GUID of the DomainParticipant this DataReader is registered with

Table 7.16 DataReader Metrics

Metric Name

Description

Type

dds_data_reader_presence

Indicates the presence of the DataReader and provides all label values for a DataReader instance

Gauge

dds_data_reader_sample_rejected_total

See total_count field in the SAMPLE_REJECTED Status

Counter

dds_data_reader_liveliness_not_alive_count

See not_alive_count field in the LIVELINESS_CHANGED Status

Gauge

dds_data_reader_deadline_missed_total

See total_count field in the REQUESTED_DEADLINE_MISSED Status

Counter

dds_data_reader_incompatible_qos_total

See total_count field in the REQUESTED_INCOMPATIBLE_QOS Status

Counter

dds_data_reader_sample_lost_total

See total_count field in the SAMPLE_LOST Status

Counter

dds_data_reader_cache_samples_peak

See sample_count_peak field in the DATA_READER_CACHE_STATUS

Gauge

dds_data_reader_cache_samples

See sample_count field in the DATA_READER_CACHE_STATUS

Gauge

dds_data_reader_cache_old_source_ts_dropped_samples_total

See old_source_timestamp_dropped_sample_count field in the DATA_READER_CACHE_STATUS

Counter

dds_data_reader_cache_tolerance_source_ts_dropped_samples_total

See tolerance_source_timestamp_dropped_sample_count field in the DATA_READER_CACHE_STATUS

Counter

dds_data_reader_cache_content_filter_dropped_samples_total

See content_filter_dropped_sample_count field in the DATA_READER_CACHE_STATUS

Counter

dds_data_reader_cache_replaced_dropped_samples_total

See replaced_dropped_sample_count field in the DATA_READER_CACHE_STATUS

Counter

dds_data_reader_cache_samples_dropped_by_instance_replaced_total

See total_samples_dropped_by_instance_replacement field in the DATA_READER_CACHE_STATUS

Counter

dds_data_reader_cache_alive_instances

See alive_instance_count field in the DATA_READER_CACHE_STATUS

Gauge

dds_data_reader_cache_alive_instances_peak

See alive_instance_count_peak field in the DATA_READER_CACHE_STATUS

Gauge

dds_data_reader_cache_no_writers_instances

See no_writers_instance_count field in the DATA_READER_CACHE_STATUS

Gauge

dds_data_reader_cache_no_writers_instances_peak

See no_writers_instance_count_peak field in the DATA_READER_CACHE_STATUS

Gauge

dds_data_reader_cache_disposed_instances

See disposed_instance_count field in the DATA_READER_CACHE_STATUS

Gauge

dds_data_reader_cache_disposed_instances_peak

See disposed_instance_count_peak field in the DATA_READER_CACHE_STATUS

Gauge

dds_data_reader_cache_compressed_samples_total

See compressed_sample_count field in the DATA_READER_CACHE_STATUS

Counter

dds_data_reader_protocol_received_samples_total

See received_sample_count field in the DATA_READER_PROTOCOL_STATUS

Counter

dds_data_reader_protocol_received_sample_bytes_total

See received_sample_bytes field in the DATA_READER_PROTOCOL_STATUS

Counter

dds_data_reader_protocol_duplicate_samples_total

See duplicate_sample_count field in the DATA_READER_PROTOCOL_STATUS

Counter

dds_data_reader_protocol_duplicate_sample_bytes_total

See duplicate_sample_bytes field in the DATA_READER_PROTOCOL_STATUS

Counter

dds_data_reader_protocol_received_heartbeats_total

See received_heartbeat_count field in the DATA_READER_PROTOCOL_STATUS

Counter

dds_data_reader_protocol_sent_nacks_total

See sent_nack_count field in the DATA_READER_PROTOCOL_STATUS

Counter

dds_data_reader_protocol_sent_nack_bytes_total

See sent_nack_bytes field in the DATA_READER_PROTOCOL_STATUS

Counter

dds_data_reader_protocol_rejected_samples_total

See rejected_sample_count field in the DATA_READER_PROTOCOL_STATUS

Counter

dds_data_reader_protocol_out_of_range_rejected_samples_total

See out_of_range_rejected_sample_count field in the DATA_READER_PROTOCOL_STATUS

Counter

dds_data_reader_protocol_received_fragments_total

See received_fragment_count field in the DATA_READER_PROTOCOL_STATUS

Counter

dds_data_reader_protocol_dropped_fragments_total

See dropped_fragment_count field in the DATA_READER_PROTOCOL_STATUS

Counter

dds_data_reader_protocol_reassembled_samples_total

See reassembled_sample_count field in the DATA_READER_PROTOCOL_STATUS

Counter

dds_data_reader_protocol_sent_nack_fragments_total

See sent_nack_fragment_count field in the DATA_READER_PROTOCOL_STATUS

Counter

dds_data_reader_protocol_sent_nack_fragment_bytes_total

See sent_nack_fragment_bytes field in the DATA_READER_PROTOCOL_STATUS

Counter

7.3.7. Derived Metrics Generated by Prometheus Recording Rules

Prometheus provides a capability called Recording Rules. The following text is an excerpt from the Prometheus documentation.

Recording rules allow you to precompute frequently needed or computationally
expensive expressions and save their result as a new set of time series.
Querying the precomputed result will then often be much faster than executing
the original expression every time it is needed. This is especially useful for
dashboards, which need to query the same expression repeatedly every time they
refresh.

A Prometheus recording rule generates a new metric time series with new values calculated at the frequency at which the rule is run. The recording rules in Observability Framework are run every 10 seconds, meaning there is an evaluation and update to the associated derived metric every 10 seconds. Observability Framework uses Prometheus recording rules to generate three types of derived metrics.

  • DDS entity proxy metrics

  • raw error metrics

  • aggregated error metrics.

Each of these derived metric types is discussed in detail below.

The Grafana dashboards provided with Observability Framework make use of the error metrics generated by Prometheus recording rules. The aggregated error metrics are used on the Alert Home dashboard, while the raw error metrics are used on other dashboards.

7.3.7.1. DDS Entity Proxy Metrics

The DDS entity proxy metrics are used in the recording rules for the raw error metrics and are always 0. The proxy metrics are used to make sure the rules evaluate to known good values in cases where the underlying metrics are not available.

Table 7.17 DDS Entity Proxy Metrics

Metric Name

Description

dds_application_empty_metric

A proxy for applications metrics that always provides a value of zero.

dds_domain_participant_empty_metric

A proxy for applications metrics that always provides a value of zero.

dds_topic_empty_metric

A proxy for applications metrics that always provides a value of zero.

dds_data_writer_empty_metric

A proxy for applications metrics that always provides a value of zero.

dds_data_reader_empty_metric

A proxy for applications metrics that always provides a value of zero.

7.3.7.2. Raw Error Metrics

Raw error metrics are derived for select metrics by doing a boolean comparison to a predefined limit. The raw error metrics are created by converting the monotonically increasing value of a counter metric into a rate, comparing that rate to a limit, and returning a boolean value. The returned boolean value is 1 if the limit is exceeded, otherwise 0. In the Grafana dashboards, a value of 0 indicates a healthy condition for the error metric, while a value of 1 indicates a fail condition.

Recording rules have been created to generate a derived raw error metric for all of the metrics listed in Table 7.18 and Table 7.19.

7.3.7.2.1. Enabled Raw Error Metrics

A set of recording rules have been created that are useful for detecting failures in all systems. These rules detect conditions that are not expected to occur in a system that is operating correctly. The rules for these “enabled” metrics test if the underlying metric has exceeded a limit of 0. Note the >bool 0 comparison operator in each of the recording rules. A value greater than 0 in any of these metrics will result in an alert indication in the dashboards. This set of metrics is “enabled” because any increase in the underlying metric indicates an unexpected condition in DDS. Table 7.18 lists derived Raw error metrics that are “enabled”.

Table 7.18 Raw Error Metrics (enabled)

Metric Name

Recording Rule

dds_data_reader_cache_content_filter_dropped_samples_errors

rate(dds_data_reader_cache_content_filter_dropped_samples_total[1m]) >bool 0 or dds_data_reader_empty_metric

dds_data_reader_cache_replaced_dropped_samples_errors

rate(dds_data_reader_cache_replaced_dropped_samples_total[1m]) >bool 0 or dds_data_reader_empty_metric

dds_data_reader_cache_samples_dropped_by_instance_replaced_errors

rate(dds_data_reader_cache_samples_dropped_by_instance_replaced_total[1m]) >bool 0 or dds_data_reader_empty_metric

dds_data_reader_protocol_rejected_samples_errors

rate(dds_data_reader_protocol_rejected_samples_total[1m]) >bool 0 or dds_data_reader_empty_metric

dds_data_reader_protocol_out_of_range_rejected_samples_errors

rate(dds_data_reader_protocol_out_of_range_rejected_samples_total[1m]) >bool 0 or dds_data_reader_empty_metric

dds_data_reader_protocol_dropped_fragments_errors

rate(dds_data_reader_protocol_dropped_fragments_total[1m]) >bool 0 or dds_data_reader_empty_metric

dds_topic_inconsistent_errors

rate(dds_topic_inconsistent_total[1m]) >bool 0 or dds_topic_empty_metric

dds_data_writer_incompatible_qos_errors

rate(dds_data_writer_incompatible_qos_total[1m]) >bool 0 or dds_data_writer_empty_metric

dds_data_reader_incompatible_qos_errors

rate(dds_data_reader_incompatible_qos_total[1m]) >bool 0 or dds_data_reader_empty_metric

dds_data_writer_liveliness_lost_errors

rate(dds_data_writer_liveliness_lost_total[1m]) >bool 0 or dds_data_writer_empty_metric

dds_data_writer_reliable_reader_activity_inactive_count_errors

rate(dds_data_writer_reliable_reader_activity_inactive_count[1m]) >bool 0 or dds_data_writer_empty_metric

dds_data_reader_liveliness_not_alive_count_errors

rate(dds_data_reader_liveliness_not_alive_count[1m]) >bool 0 or dds_data_reader_empty_metric

dds_data_reader_cache_tolerance_source_ts_dropped_samples_errors

rate(dds_data_reader_cache_tolerance_source_ts_dropped_samples_total[1m]) >bool 0 or dds_data_reader_empty_metric

dds_data_writer_deadline_missed_errors

rate(dds_data_writer_deadline_missed_total[1m]) >bool 0 or dds_data_writer_empty_metric

dds_data_reader_deadline_missed_errors

rate(dds_data_reader_deadline_missed_total[1m]) >bool 0 or dds_data_reader_empty_metric

dds_data_writer_reliable_cache_replaced_unack_samples_errors

rate(dds_data_writer_reliable_cache_replaced_unack_samples_total[1m]) >bool 0 or dds_data_writer_empty_metric

dds_data_reader_sample_lost_errors

rate(dds_data_reader_sample_lost_total[1m]) >bool 0 or dds_data_reader_empty_metric

7.3.7.2.2. Disabled Raw Error Metrics

Additional recording rules have been created that by default are not useful for detecting failures since the meaningful rules depend on comparisons to values that will be dependent on actual system requirements. The rules for the “disabled” metrics test to see if the underlying metric is less than a limit of 0, ensuring that the derived raw error metric never indicates a failure, hence disabled. Note the <bool 0 comparison operator in each of the recording rules. This set of metrics is “disabled” because a meaningful limit that would indicate a fail condition cannot be determined without additional knowledge of the system.

Users may modify a “disabled” rule to compare against a value that is meaningful to their system. For example, if users want to be notified when the number of repaired samples over the last minute exceeds 10, then they would modify the rule

rate(dds_data_writer_protocol_pulled_samples_total[1m]) <bool 0 or dds_data_writer_empty_metric

To

rate(dds_data_writer_protocol_pulled_samples_total[1m]) >bool 10 or dds_data_writer_empty_metric

For complete instructions on how to enable these metrics and display them in the dashboards, see Enable a Raw Error Metric.

The “disabled” rules have been created as a convenience for the user. However, only a few of these rules may be useful for any specific system. Table 7.19 lists derived raw error metrics that are “disabled”.

Table 7.19 Raw Error Metrics (disabled)

Metric Name

Recording Rule

dds_data_writer_protocol_sent_heartbeats_errors

rate(dds_data_writer_protocol_sent_heartbeats_total[1m] <bool 0 or dds_data_writer_empty_metric

dds_data_writer_protocol_received_nacks_errors

rate(dds_data_writer_protocol_received_nacks_total[1m]) <bool 0 or dds_data_writer_empty_metric

dds_data_writer_protocol_received_nack_bytes_errors

rate(dds_data_writer_protocol_received_nack_bytes_total[1m]) <bool 0 or dds_data_writer_empty_metric

dds_data_writer_protocol_received_nack_fragments_errors

rate(dds_data_writer_protocol_received_nack_fragments_total[1m]) <bool 0 or dds_data_writer_empty_metric

dds_data_writer_protocol_received_nack_fragment_bytes_errors

rate(dds_data_writer_protocol_received_nack_fragment_bytes_total[1m]) <bool 0 or dds_data_writer_empty_metric

dds_data_reader_protocol_received_heartbeats_errors

rate(dds_data_reader_protocol_received_heartbeats_total[1m]) <bool 0 or dds_data_reader_empty_metric

dds_data_reader_protocol_sent_nacks_errors

rate(dds_data_reader_protocol_sent_nacks_total[1m]) <bool 0 or dds_data_reader_empty_metric

dds_data_reader_protocol_sent_nack_bytes_errors

rate(dds_data_reader_protocol_sent_nack_bytes_total[1m]) <bool 0 or dds_data_reader_empty_metric

dds_data_reader_protocol_sent_nack_fragments_errors

rate(dds_data_reader_protocol_sent_nack_fragments_total[1m]) <bool 0 or dds_data_reader_empty_metric

dds_data_reader_protocol_sent_nack_fragment_bytes_errors

rate(dds_data_reader_protocol_sent_nack_fragment_bytes_total[1m]) <bool 0 or dds_data_reader_empty_metric

dds_data_writer_protocol_pulled_samples_errors

rate(dds_data_writer_protocol_pulled_samples_total[1m]) <bool 0 or dds_data_writer_empty_metric

dds_data_writer_protocol_pulled_sample_bytes_errors

rate(dds_data_writer_protocol_pulled_sample_bytes_total[1m]) <bool 0 or dds_data_writer_empty_metric

dds_data_writer_protocol_pulled_fragments_errors

rate(dds_data_writer_protocol_pulled_fragments_total[1m]) <bool 0 or dds_data_writer_empty_metric

dds_data_writer_protocol_pulled_fragment_bytes_errors

rate(dds_data_writer_protocol_pulled_fragment_bytes_total[1m]) <bool 0 or dds_data_writer_empty_metric

dds_data_writer_protocol_pushed_samples_errors

rate(dds_data_writer_protocol_pushed_samples_total[1m]) <bool 0 or dds_data_writer_empty_metric

dds_data_writer_protocol_pushed_sample_bytes_errors

rate(dds_data_writer_protocol_pushed_sample_bytes_total[1m]) <bool 0 or dds_data_writer_empty_metric

dds_data_writer_protocol_pushed_fragments_errors

rate(dds_data_writer_protocol_pushed_fragments_total[1m]) <bool 0 or dds_data_writer_empty_metric

dds_data_writer_protocol_pushed_fragment_bytes_errors

rate(dds_data_writer_protocol_pushed_fragment_bytes_total[1m]) <bool 0 or dds_data_writer_empty_metric

dds_data_reader_cache_compressed_samples_errors

rate(dds_data_reader_cache_compressed_samples_total[1m]) <bool 0 or dds_data_reader_empty_metric

dds_data_reader_protocol_duplicate_samples_errors

rate(dds_data_reader_protocol_duplicate_samples_total[1m]) <bool 0 or dds_data_reader_empty_metric

dds_data_reader_protocol_duplicate_sample_bytes_errors

rate(dds_data_reader_protocol_duplicate_sample_bytes_total[1m]) <bool 0 or dds_data_reader_empty_metric

dds_data_reader_protocol_received_samples_errors

rate(dds_data_reader_protocol_received_samples_total[1m]) <bool 0 or dds_data_reader_empty_metric

dds_data_reader_protocol_received_sample_bytes_errors

rate(dds_data_reader_protocol_received_sample_bytes_total[1m]) <bool 0 or dds_data_reader_empty_metric

dds_data_reader_protocol_received_fragments_errors

rate(dds_data_reader_protocol_received_fragments_total[1m]) <bool 0 or dds_data_reader_empty_metric

dds_data_reader_protocol_reassembled_samples_errors

rate(dds_data_reader_protocol_reassembled_samples_total[1m]) <bool 0 or dds_data_reader_empty_metric

dds_application_process_memory_usage_resident_memory_bytes_errors

rate(dds_application_process_memory_usage_resident_memory_bytes[1m]) <bool 0 or dds_application_empty_metric

dds_application_process_memory_usage_virtual_memory_bytes_errors

rate(dds_application_process_memory_usage_virtual_memory_bytes[1m]) <bool 0 or dds_application_empty_metric

dds_domain_participant_udpv4_usage_in_net_pkts_errors

rate(dds_domain_participant_udpv4_usage_in_net_pkts_mean[1m]) <bool 0 or dds_domain_participant_empty_metric

dds_domain_participant_udpv4_usage_in_net_bytes_errors

rate(dds_domain_participant_udpv4_usage_in_net_bytes_mean[1m]) <bool 0 or dds_domain_participant_empty_metric

dds_domain_participant_udpv4_usage_out_net_pkts_errors

rate(dds_domain_participant_udpv4_usage_out_net_pkts_mean[1m]) <bool 0 or dds_domain_participant_empty_metric

dds_domain_participant_udpv4_usage_out_net_bytes_errors

rate(dds_domain_participant_udpv4_usage_out_net_bytes_mean[1m]) <bool 0 or dds_domain_participant_empty_metric

dds_domain_participant_udpv6_usage_in_net_pkts_errors

rate(dds_domain_participant_udpv6_usage_in_net_pkts_mean[1m]) <bool 0 or dds_domain_participant_empty_metric

dds_domain_participant_udpv6_usage_in_net_bytes_errors

rate(dds_domain_participant_udpv6_usage_in_net_bytes_mean[1m]) <bool 0 or dds_domain_participant_empty_metric

dds_domain_participant_udpv6_usage_out_net_pkts_errors

rate(dds_domain_participant_udpv6_usage_out_net_pkts_mean[1m]) <bool 0 or dds_domain_participant_empty_metric

dds_domain_participant_udpv6_usage_out_net_bytes_errors

rate(dds_domain_participant_udpv6_usage_out_net_bytes_mean[1m]) <bool 0 or dds_domain_participant_empty_metric

dds_data_writer_reliable_cache_full_errors

rate(dds_data_writer_reliable_cache_full_total[1m]) <bool 0 or dds_data_writer_empty_metric

dds_data_writer_reliable_cache_high_watermark_errors

rate(dds_data_writer_reliable_cache_high_watermark_total[1m]) <bool 0 or dds_data_writer_empty_metric

dds_data_writer_reliable_cache_unack_samples_errors

rate(dds_data_writer_reliable_cache_unack_samples[1m]) <bool 0 or dds_data_writer_empty_metric

dds_data_writer_reliable_cache_unack_samples_peak_errors

rate(dds_data_writer_reliable_cache_unack_samples_peak[1m]) <bool 0 or dds_data_writer_empty_metric

dds_data_writer_protocol_send_window_size_errors

rate(dds_data_writer_protocol_send_window_size[1m]) <bool 0 or dds_data_writer_empty_metric

dds_data_writer_cache_samples_errors

rate(dds_data_writer_cache_samples[1m]) <bool 0 or dds_data_writer_empty_metric

dds_data_writer_cache_samples_peak_errors

rate(dds_data_writer_cache_samples_peak[1m]) <bool 0 or dds_data_writer_empty_metric

dds_data_writer_cache_alive_instances_errors

rate(dds_data_writer_cache_alive_instances[1m]) <bool 0 or dds_data_writer_empty_metric

dds_data_writer_cache_alive_instances_peak_errors

rate(dds_data_writer_cache_alive_instances_peak[1m]) <bool 0 or dds_data_writer_empty_metric

dds_data_reader_sample_rejected_errors

rate(dds_data_reader_sample_rejected_total[1m]) <bool 0 or dds_data_reader_empty_metric

dds_data_reader_cache_samples_errors

rate(dds_data_reader_cache_samples[1m]) <bool 0 or dds_data_reader_empty_metric

dds_data_reader_cache_samples_peak_errors

rate(dds_data_reader_cache_samples_peak[1m]) <bool 0 or dds_data_reader_empty_metric

dds_data_reader_cache_alive_instances_errors

rate(dds_data_reader_cache_alive_instances[1m]) <bool 0 or dds_data_reader_empty_metric

dds_data_reader_cache_alive_instances_peak_errors

rate(dds_data_reader_cache_alive_instances_peak[1m]) <bool 0 or dds_data_reader_empty_metric

dds_data_reader_cache_no_writers_instances_errors

rate(dds_data_reader_cache_no_writers_instances[1m]) <bool 0 or dds_data_reader_empty_metric

dds_data_reader_cache_no_writers_instances_peak_errors

rate(dds_data_reader_cache_no_writers_instances_peak[1m]) <bool 0 or dds_data_reader_empty_metric

dds_data_reader_cache_disposed_instances_errors

rate(dds_data_reader_cache_disposed_instances[1m]) <bool 0 or dds_data_reader_empty_metric

dds_data_reader_cache_disposed_instances_peak_errors

rate(dds_data_reader_cache_disposed_instances_peak[1m]) <bool 0 or dds_data_reader_empty_metric

dds_data_reader_cache_old_source_ts_dropped_samples_errors

rate(dds_data_reader_cache_old_source_ts_dropped_samples_total[1m]) <bool 0 or dds_data_reader_empty_metric

7.3.7.3. Aggregated Error Metrics

The aggregated error metrics create a status roll-up for a group of metrics in a particular category. These aggregated error metrics are used in the Alert Home dashboard to provide a high-level view of alerts grouped by category. The categories are Bandwidth, Saturation, Data Loss, System Errors, and Delays. The aggregated error metrics are created by adding together all of the raw error metrics assigned to a category and clamping the values at 1, the value that indicates a failed condition. Table 7.20 shows all of the aggregated error metrics and the rule used to generate them. Note the use of the raw error metrics in the rules.

Table 7.20 Aggregate Error Metrics

Metric Name

Recording Rule

dds_excessive_bandwidth_errors

clamp_max ((sum (dds_custom_excessive_bandwidth_errors) + sum (dds_data_writer_protocol_sent_heartbeats_errors) + sum (dds_data_writer_protocol_received_nacks_errors) + sum (dds_data_writer_protocol_received_nack_bytes_errors) + sum (dds_data_writer_protocol_received_nack_fragments_errors) + sum (dds_data_writer_protocol_received_nack_fragment_bytes_errors) + sum (dds_data_reader_protocol_received_heartbeats_errors) + sum (dds_data_reader_protocol_sent_nacks_errors) + sum (dds_data_reader_protocol_sent_nack_bytes_errors) + sum (dds_data_reader_protocol_sent_nack_fragments_errors) + sum (dds_data_reader_protocol_sent_nack_fragment_bytes_errors) + sum (dds_data_writer_protocol_pulled_samples_errors) + sum (dds_data_writer_protocol_pulled_sample_bytes_errors) + sum (dds_data_writer_protocol_pulled_fragments_errors) + sum (dds_data_writer_protocol_pulled_fragment_bytes_errors) + sum (dds_data_writer_protocol_pushed_samples_errors) + sum (dds_data_writer_protocol_pushed_sample_bytes_errors) + sum (dds_data_writer_protocol_pushed_fragments_errors) + sum (dds_data_writer_protocol_pushed_fragment_bytes_errors) + sum (dds_data_reader_cache_content_filter_dropped_samples_errors) + sum (dds_data_reader_cache_compressed_samples_errors) + sum (dds_data_reader_protocol_duplicate_samples_errors) + sum (dds_data_reader_protocol_duplicate_sample_bytes_errors) + sum (dds_data_reader_protocol_received_samples_errors) + sum (dds_data_reader_protocol_received_sample_bytes_errors) + sum (dds_data_reader_protocol_received_fragments_errors) + sum (dds_data_reader_protocol_reassembled_samples_errors)), 1)

dds_saturation_errors

clamp_max ((sum (dds_custom_saturation_errors) + sum (dds_application_process_memory_usage_resident_memory_bytes_errors) + sum (dds_application_process_memory_usage_virtual_memory_bytes_errors) + sum (dds_domain_participant_udpv4_usage_in_net_pkts_errors) + sum (dds_domain_participant_udpv4_usage_in_net_bytes_errors) + sum (dds_domain_participant_udpv4_usage_out_net_pkts_errors) + sum (dds_domain_participant_udpv4_usage_out_net_bytes_errors) + sum (dds_domain_participant_udpv6_usage_in_net_pkts_errors) + sum (dds_domain_participant_udpv6_usage_in_net_bytes_errors) + sum (dds_domain_participant_udpv6_usage_out_net_pkts_errors) + sum (dds_domain_participant_udpv6_usage_out_net_bytes_errors) + sum (dds_data_writer_reliable_cache_full_errors) + sum (dds_data_writer_reliable_cache_high_watermark_errors) + sum (dds_data_writer_reliable_cache_unack_samples_errors) + sum (dds_data_writer_reliable_cache_unack_samples_peak_errors) + sum (dds_data_writer_protocol_send_window_size_errors) + sum (dds_data_writer_cache_samples_errors) + sum (dds_data_writer_cache_samples_peak_errors) + sum (dds_data_writer_cache_alive_instances_errors) + sum (dds_data_writer_cache_alive_instances_peak_errors) + sum (dds_data_reader_sample_rejected_errors) + sum (dds_data_reader_cache_samples_errors) + sum (dds_data_reader_cache_samples_peak_errors) + sum (dds_data_reader_cache_replaced_dropped_samples_errors) + sum (dds_data_reader_cache_samples_dropped_by_instance_replaced_errors) + sum (dds_data_reader_cache_alive_instances_errors) + sum (dds_data_reader_cache_alive_instances_peak_errors) + sum (dds_data_reader_cache_no_writers_instances_errors) + sum (dds_data_reader_cache_no_writers_instances_peak_errors) + sum (dds_data_reader_cache_disposed_instances_errors) + sum (dds_data_reader_cache_disposed_instances_peak_errors) + sum (dds_data_reader_protocol_rejected_samples_errors) + sum (dds_data_reader_protocol_out_of_range_rejected_samples_errors) + sum (dds_data_reader_protocol_dropped_fragments_errors)), 1)

dds_errors

clamp_max ((sum (dds_custom_errors) + sum (dds_topic_inconsistent_errors) + sum (dds_data_writer_incompatible_qos_errors) + sum (dds_data_reader_incompatible_qos_errors) + sum (dds_data_writer_liveliness_lost_errors) + sum (dds_data_writer_reliable_reader_activity_inactive_count_errors) + sum (dds_data_reader_liveliness_not_alive_count_errors) + sum (dds_data_reader_cache_old_source_ts_dropped_samples_errors) + sum (dds_data_reader_cache_tolerance_source_ts_dropped_samples_errors)), 1)

dds_delays_errors

clamp_max ((sum (dds_custom_delays_errors) + sum (dds_data_writer_deadline_missed_errors) + sum (dds_data_reader_deadline_missed_errors)), 1)

dds_data_loss_errors

clamp_max ((sum (dds_custom_data_loss_errors) + sum (dds_data_writer_reliable_cache_replaced_unack_samples_errors) + sum (dds_data_reader_sample_lost_errors) + sum (dds_data_reader_cache_replaced_dropped_samples_errors) + sum (dds_data_reader_cache_samples_dropped_by_instance_replaced_errors) + sum (dds_data_reader_cache_tolerance_source_ts_dropped_samples_errors)), 1)

7.3.7.4. Enable a Raw Error Metric

Note

The Grafana user must have Admin privileges to make any changes to the Grafana dashboards.

Use the following steps to enable any of the “disabled” metrics in your system:

  1. Update the raw error rule to enable the calculation and provide a limit. See Update the Recording Rule for the Derived Metric below.

  2. Update the Alert “Category” dashboard to update the background color of the OK/ERROR and State panels for the enabled metric. See Update the Alert “Category” Dashboard below.

  3. Update the “Entity” status dashboard to update the query and background color in the State panel. See Update the “Entity” Status Dashboard below.

The example that follows uses the dds_data_reader_cache_alive_instances_errors metric to update/enable a rule to detect any DataReader that has more than 3 ALIVE instances in its cache.

7.3.7.4.1. Update the Recording Rule for the Derived Metric

Locate the recording rule for the dds_data_reader_cache_alive_instances_errors metric in the monitoring_recording_rules.yml file located in the rti_workspace/<version>/observability/prometheus directory.

 # User Config Required
   - record: dds_data_reader_cache_alive_instances_errors
     expr: >
       rate(dds_data_reader_cache_alive_instances[1m]) <bool 0 or dds_data_reader_empty_metric

The dds_data_reader_cache_alive_instances metric is a gauge metric, meaning we want to use the absolute value for our limit check rather than the rate. In the following example recording rule, we want to update the limit test so that the error will be active whenever the value is greater than 3.

 # User Config Required
   - record: dds_data_reader_cache_alive_instances_errors
     expr: >
       dds_data_reader_cache_alive_instances >bool 3 or dds_data_reader_empty_metric

Important

After updating the monitoring_recording_rules.yml file, you must restart all Docker containers for Observability Framework by running rtiobservability -t followed by rtiobservability -s. The Prometheus server will read the updated file after restarting the containers.

7.3.7.4.2. Update the Alert “Category” Dashboard

Note

The Grafana images in this section were generated with Grafana version 9.2.1. If you are using a different version of Grafana, the interface may be slightly different.

Locate the Alert “Category” dashboard for the metric rule you are enabling. The metric in our example, dds_data_reader_cache_alive_instances_errors, is in the Saturation group (see Table 7.20), so the Alert Saturation dashboard is used in the following steps.

  1. Go to Dashboards > Browse to open the list of dashboards.

    Dashboard Browse Menu
  2. Select the Alert Saturation dashboard from the list.

    Alert Saturation Dashboard Select
  3. Once on the Alert Saturation dashboard, scroll down to the Alive Instances row under the Reader Cache section.

    Alive Instances Row
  4. Select Alive Instances > Edit from the status indicator panel menu.

    Alive Instances Indicator Edit
  5. In the right panel, scroll down until you find the Value mappings section.

    Alive Instances Indicator Base Gray
  6. Click the gray color circle next to the OK mapping to select a new color for the panel “OK” indication.

    Alive Instances Indicator Base Color Select
  7. Select the large green circle in the panel. The updated OK value should change from gray to green.

    Alive Instances Indicator Base Green
  8. Select Apply at the top right to apply the change and return to the Alert Saturation dashboard.

    Save Alive Instances Indicator Change
  9. Select Alive Instances > Edit from the status indicator panel menu.

    Alive Instances State Edit
  10. In the right panel, scroll down to the Thresholds section.

    Alive Instances State Base Gray
  11. Click the gray circle next to Base to select a new base color for the Thresholds panel.

    Alive Instances State Base Color Select
  12. Select the large green circle in the panel. The updated Threshold base value should change to green.

    Alive Instances State Base Green
  13. Select Apply at the top right to apply the changes and return to the Alert Saturation dashboard.

    Save Alive Instances State Change
  14. Select the Save Dashboard icon at the top right.

    Save Alert Saturation Dashboard
  15. When prompted to confirm, select Save.

    Save Alert Saturation Dashboard Confirmation

The Alive Instances row under the Reader Cache section should now be green, indicating it is enabled.

Alive Instances Row Enabled

7.3.7.4.3. Update the “Entity” Status Dashboard

Locate the “Entity” status dashboard for the metric rule you are enabling. For the metric in our example, dds_data_reader_cache_alive_instances_errors, we need to update the Alert DataReader Status dashboard.

  1. Go to Dashboards > Browse to open the list of dashboards.

    Dashboard Browse Menu
  2. Select the Alert DataReader Status dashboard from the list.

    Alert DataReader Status Dashboard Select
  3. Once on the Alert DataReader Status dashboard, scroll down to the Alive Instances row under the Saturation/Reader Cache section.

    Alert DataReader Status Alive Instances Row
  4. Select Alive Instances > Edit from the status indicator panel menu.

    Alive Instances Indicator State Edit

    The query for the panel is shown below.

    DataReader Status State Query
  5. Edit the query to match the rule that was created for the dds_data_reader_cache_alive_instances_errors metric. In the Metrics browser field, remove the irate calculation and set the limit check to >bool 3, as shown below.

    DataReader Status State Query Edited
  6. In the right panel, scroll down to the Thresholds section.

    DataReader Alive Instances State Base Gray
  7. Click the gray circle next to Base to select a new base color for the Thresholds panel.

    DataReader Alive Instances State Base Color Select
  8. Select the large green circle in the panel. The updated Threshold base value should change from gray to green.

    DataReader Alive Instances State Base Green
  9. Select Apply at the top right to apply the change and return to the Alert DataReader Status dashboard.

    Save DataReader Alive Instances State Change
  10. Select the Save Dashboard icon at the top right.

    Save Alert DataReader Status Dashboard
  11. When prompted to confirm, select Save.

    Save Alert DataReader Status Dashboard Confirmation

You have now enabled a rule for dds_data_reader_cache_alive_instances that detects any DataReader that has more than 3 sample instances in its queue with an instance state of ALIVE. The indication of this condition will display on all relevant dashboards.

You can test this rule by running the applications as described in section Start the Applications. Start any combination of publishing applications with the -s, --sensor-count command-line arguments totaling more than 3. Anytime this condition occurs, you will see this error indicated.

7.3.7.5. Custom Error Metrics

Table 7.21 shows metrics that are not fully implemented.

Table 7.21 Custom Error Metrics

Metric Name

Description

dds_custom_excessive_bandwidth_errors

Not fully implemented. Not to be modified or used.

dds_custom_saturation_errors

Not fully implemented. Not to be modified or used.

dds_custom_errors

Not fully implemented. Not to be modified or used.

dds_custom_delays_errors

Not fully implemented. Not to be modified or used.

dds_custom_data_loss_errors

Not fully implemented. Not to be modified or used.