1. What is Connext Observability Framework?

RTI® Connext® Observability Framework™ is a holistic solution that uses telemetry data to provide deep visibility into the current and past states of your Connext applications. This visibility makes it easier to proactively identify and resolve potential system issues, providing a higher level of confidence in the reliable operation of the system.

Observability Framework use cases include:

  • Debugging. Find the cause of an undesired behavior, or determine if the feature meets performance needs during development.

  • CI/CD monitoring. Assess the performance impact of code or configuration changes.

  • Monitoring deployed applications. Confirm that your systems are running as expected and proactively fix potential performance issues.

Important

Observability Framework is an experimental product that includes example configuration files for use with several third-party components (Prometheus®, Grafana Loki™, and Grafana®). This release is an evaluation distribution; use it to explore the new observability features that support Connext applications.

Do not deploy any Observability Framework components in production.

1.1. Telemetry Data

Telemetry data can be generated at three different levels:

  • Application. Telemetry data generated when you instrument your own applications.

  • Middleware. Telemetry data generated by Connext DDS entities and infrastructure services.

  • System. DevOps telemetry such as CPU, memory, and disk I/O usage.

In this release, Observability Framework supports middleware telemetry. Future releases could support application and system telemetry.

Regardless of the level, telemetry data can be categorized as:

  • Metrics. Collections of application statistics that are analyzed to understand application behavior. There are two types of metrics:

    • Counters count the number of events of a specific type; for example, the number of ACK messages emitted.

    • Gauges describe the state of some part of an application as a numeric value within a specified time frame; for example, the number of samples in a queue.

  • Logs. Events captured as text or structured data.

  • Security Events. Events related to securing a distributed system.

  • Traces. A representation of a series of causally-related events that encode the end-to-end flow of a piece of information in a software system. The traces in a distributed system are called distributed traces.

In this release, Observability Framework only supports metrics and logs. Future releases could support security events and traces.

1.2. Distribution of Telemetry Data

Observability Framework enables you to scalably collect and distribute telemetry data from individual Connext applications to third-party telemetry backends like Prometheus and Grafana Loki.

1.3. Flexible Storage

In this release, Observability Framework supports Prometheus as the time-series database to store Connext metrics and Grafana Loki as the log aggregation system to store Connext logs. As OpenTelemetry™ is gathering widespread support in the observability industry, future releases will support OpenTelemetry to help integrate with various observability backends.

1.4. Visualization of Telemetry Data

In this release, Observability Framework provides a way to visualize the telemetry data collected from Connext applications using a set of reference Grafana dashboards. You can customize these dashboards or use them as an example to enhance and build dashboards in your preferred platform.

1.5. Control and Selection of Telemetry Data

Your distributed system components can produce a large amount of data, but not all of this data is required for problem detection. Observability Framework enables you to control the amount of telemetry data that is generated, distributed, and stored. You can manage these settings at run-time and via an initial configuration.