11. Troubleshooting Observability Framework

This section provides solutions for issues you may run into while evaluating Observability Framework.

11.1. Docker Container[s] Failed to Start

The Docker containers used by Observability Framework can fail to start for a variety of reasons. Two common reasons for this are port conflicts or illegal file permissions. To verify the state of these Docker containers, run the Docker command docker ps -a.

An example that shows all Docker containers used by Observability Framework have successfully started is shown below.

CONTAINER ID  IMAGE                                    COMMAND                 CREATED        STATUS        NAMES
6651d7ed9810  prom/prometheus:v2.37.5                  "/bin/prometheus --c…"  5 minutes ago  Up 5 minutes  prometheus_observability
25050d16b1b5  grafana/grafana-enterprise:9.2.1-ubuntu  "/run.sh"               5 minutes ago  Up 5 minutes  grafana_observability
08611ea9b255  rticom/collector-service:<version>       "/rti_connext_dds-7.…"  5 minutes ago  Up 5 minutes  collector_service_observability
55568de5120f  grafana/loki:2.7.0                       "/usr/bin/loki --con…"  5 minutes ago  Up 5 minutes  loki_observability

An example that shows a container that has failed to start is shown below. The failure is indicated by the Restarting note in the STATUS column. In this example, the prometheus-observability container failed to start and repeatedly tried to restart.

CONTAINER ID  IMAGE                                    COMMAND                 CREATED        STATUS                         NAMES
08f75e0fadb2  prom/prometheus:v2.37.5                  "/bin/prometheus --c…"  5 minutes ago  Restarting (1) 27 seconds ago  prometheus_observability
9a3964b561ec  grafana/loki:2.7.0                       "/usr/bin/loki --con…"  5 minutes ago  Up 5 minutes                   loki_observability
b6a6ffa201f3  rticom/collector-service:<version>       "/rti_connext_dds-7.…"  5 minutes ago  Up 5 minutes                   collector_service_observability
26658f76cfdc  grafana/grafana-enterprise:9.2.1-ubuntu  "/run.sh"               5 minutes ago  Up 5 minutes                   grafana_observability

To determine why a container failed, examine its log file. To generate the log, run the Docker command docker logs <container-name> where <container_name> is specified in the NAMES column, as shown above.

11.1.1. Check for Port Conflicts

Run docker logs <container-name> to generate the logs for the failed container, then look for a port conflict error. An example of a Prometheus port conflict is shown below.

ts=2023-03-14T13:12:29.275Z caller=web.go:553 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090
ts=2023-03-14T13:12:29.275Z caller=main.go:786 level=error msg="Unable to start web listener" err="listen tcp 0.0.0.0:9090: bind: address already in use"

If you discover port conflicts, perform the following steps to resolve the issue.

Remove the existing Observability Workspace. See Removing the Docker Workspace for Observability Framework for details on how to remove the workspace.
Update the JSON configuration files to configure ports. See Configuring the Docker Workspace for Observability Framework for details on how to update the port configuration for the failed container.
Run <installdir>/bin/rtiobservability -c <JSON config> to recreate the Obervability Workspace with the new port configuration.
Run <installdir>/bin/rtiobservability -i to create and run the Docker containers with the new port configuration.

11.1.2. Check that You Have the Correct File Permissions

Run docker logs <container-name> to generate the logs for the failed container, then look for a file permissions error. An example of a file permissions problem is shown below.

ts=2023-03-14T22:21:47.666Z caller=main.go:450 level=error msg="Error loading config (--config.file=/etc/prometheus/prometheus.yml)" file=/etc/prometheus/prometheus.yml err="open /etc/prometheus/prometheus.yml: permission denied"

Docker containers for Observability Framework require the other permission to be “read/access” for directories, “read” for files. To resolve a file permission problem, ensure Linux permissions of at least:

755 (rwxr-xr-x) for directories
444 (r–r–r–) for files

11.2. No Data in Dashboards

Before proceeding, make sure all Docker containers for Observability Framework are running properly (see Docker Container[s] Failed to Start) and that you have started your applications with Monitoring Library 2.0 enabled (see Monitoring Library 2.0).

11.2.1. Check that Collector Service has Discovered Your Applications

Run one or more applications configured with Monitoring Library 2.0.
Open a browser to <servername>:<port>/metrics, where servername is the server where Collector Service is installed and port is the port number for the Collector Service Prometheus Client port (19090 is the default).
Verify that you have data for the dds_domain_participant_presence metric for your application(s) as highlighted below.

 # HELP exposer_transferred_bytes_total Transferred bytes to metrics services
 # TYPE exposer_transferred_bytes_total counter
 exposer_transferred_bytes_total 65289
 # HELP exposer_scrapes_total Number of times metrics were scraped
 # TYPE exposer_scrapes_total counter
 exposer_scrapes_total 60
 # HELP exposer_request_latencies Latencies of serving scrape requests, in microseconds
 # TYPE exposer_request_latencies summary
 exposer_request_latencies_count 60
 exposer_request_latencies_sum 25681
 exposer_request_latencies{quantile="0.5"} 316
 exposer_request_latencies{quantile="0.9"} 522
 exposer_request_latencies{quantile="0.99"} 728
 # TYPE dds_domain_participant_presence gauge
 dds_domain_participant_presence{guid="AC462E9B.9BB5237C.DBB61B21.80B55CD8",owner_guid="F8824B73.10EBC319.4ACD1E47.9ECB3033",dds_guid="010130C4.C84EFC6D.973810C6.000001C1",domain_id="57",platform="x64Linux4gcc7.3.0",product_version="<version>",name="/applications/SensorSubscriber/domain_participants/Temperature DomainParticipant",hostname="presanella",process_id="458392"} 1 1678836129957
 dds_domain_participant_presence{guid="291C3B07.34755D99.608E7BF3.1F6546D9",owner_guid="566D1E8D.5D7CBFD4.DD65CC20.C33D56E9",dds_guid="0101416F.425D03B2.8AC75FC8.000001C1",domain_id="57",platform="x64Linux4gcc7.3.0",product_version="<version>",name="/applications/SensorPublisher_2/domain_participants/Temperature DomainParticipant",hostname="presanella",process_id="458369"} 1 1678836129957
 dds_domain_participant_presence{guid="1D5929EC.4FB3CAE4.300F0DB0.C553A54F",owner_guid="D2FD6E87.D8C03AAA.EABFB1F8.E941495B",dds_guid="0101FBDA.551F142B.619EE527.000001C1",domain_id="57",platform="x64Linux4gcc7.3.0",product_version="<version>",name="/applications/SensorPublisher_1/domain_participants/Temperature DomainParticipant",hostname="presanella",process_id="458346"} 1 1678836129957

If there is no metric data available, you will see data as shown below with metric documentation only, but no metric data.

# HELP exposer_transferred_bytes_total Transferred bytes to metrics services
# TYPE exposer_transferred_bytes_total counter
exposer_transferred_bytes_total 4017
# HELP exposer_scrapes_total Number of times metrics were scraped
# TYPE exposer_scrapes_total counter
exposer_scrapes_total 4
# HELP exposer_request_latencies Latencies of serving scrape requests, in microseconds
# TYPE exposer_request_latencies summary
exposer_request_latencies_count 4
exposer_request_latencies_sum 2510
exposer_request_latencies{quantile="0.5"} 564
exposer_request_latencies{quantile="0.9"} 621
exposer_request_latencies{quantile="0.99"} 621
# TYPE dds_domain_participant_presence gauge
# TYPE dds_domain_participant_udpv4_usage_in_net_pkts_period_ms gauge
# TYPE dds_domain_participant_udpv4_usage_in_net_pkts_count gauge
# TYPE dds_domain_participant_udpv4_usage_in_net_pkts_mean gauge
# TYPE dds_domain_participant_udpv4_usage_in_net_pkts_min gauge
# TYPE dds_domain_participant_udpv4_usage_in_net_pkts_max gauge
# TYPE dds_domain_participant_udpv4_usage_in_net_bytes_period_ms gauge
# TYPE dds_domain_participant_udpv4_usage_in_net_bytes_count gauge
# TYPE dds_domain_participant_udpv4_usage_in_net_bytes_mean gauge
# TYPE dds_domain_participant_udpv4_usage_in_net_bytes_min gauge
# TYPE dds_domain_participant_udpv4_usage_in_net_bytes_max gauge
# TYPE dds_domain_participant_udpv4_usage_out_net_pkts_period_ms gauge
# TYPE dds_domain_participant_udpv4_usage_out_net_pkts_count gauge
# TYPE dds_domain_participant_udpv4_usage_out_net_pkts_mean gauge
# TYPE dds_domain_participant_udpv4_usage_out_net_pkts_min gauge
# TYPE dds_domain_participant_udpv4_usage_out_net_pkts_max gauge
# TYPE dds_domain_participant_udpv4_usage_out_net_bytes_period_ms gauge
# TYPE dds_domain_participant_udpv4_usage_out_net_bytes_count gauge
# TYPE dds_domain_participant_udpv4_usage_out_net_bytes_mean gauge
# TYPE dds_domain_participant_udpv4_usage_out_net_bytes_min gauge
# TYPE dds_domain_participant_udpv4_usage_out_net_bytes_max gauge
# TYPE dds_domain_participant_udpv6_usage_in_net_pkts_period_ms gauge
# TYPE dds_domain_participant_udpv6_usage_in_net_pkts_count gauge
# TYPE dds_domain_participant_udpv6_usage_in_net_pkts_mean gauge
# TYPE dds_domain_participant_udpv6_usage_in_net_pkts_min gauge
# TYPE dds_domain_participant_udpv6_usage_in_net_pkts_max gauge
# TYPE dds_domain_participant_udpv6_usage_in_net_bytes_period_ms gauge
# TYPE dds_domain_participant_udpv6_usage_in_net_bytes_count gauge
# TYPE dds_domain_participant_udpv6_usage_in_net_bytes_mean gauge
# TYPE dds_domain_participant_udpv6_usage_in_net_bytes_min gauge
# TYPE dds_domain_participant_udpv6_usage_in_net_bytes_max gauge
# TYPE dds_domain_participant_udpv6_usage_out_net_pkts_period_ms gauge
# TYPE dds_domain_participant_udpv6_usage_out_net_pkts_count gauge
# TYPE dds_domain_participant_udpv6_usage_out_net_pkts_mean gauge
# TYPE dds_domain_participant_udpv6_usage_out_net_pkts_min gauge
# TYPE dds_domain_participant_udpv6_usage_out_net_pkts_max gauge
# TYPE dds_domain_participant_udpv6_usage_out_net_bytes_period_ms gauge
# TYPE dds_domain_participant_udpv6_usage_out_net_bytes_count gauge
# TYPE dds_domain_participant_udpv6_usage_out_net_bytes_mean gauge
# TYPE dds_domain_participant_udpv6_usage_out_net_bytes_min gauge
# TYPE dds_domain_participant_udpv6_usage_out_net_bytes_max gauge

If you see metric documentation lines only, verify that your applications are configured to use the same Observability domain as Collector Service (2 is the default).

If your applications are run on a machine other than the one hosting Observability Collector Service, ensure that collector_initial_peers for the Monitoring Library 2.0 configuration in each application is configured with the IP address where Observability Collector Service is running.

For more information on configuring Monitoring Library 2.0 for your application, see Monitoring Library 2.0.

11.2.2. Check that Prometheus can Access Collector Service

Open a browser to <servername>:<port> where servername is the server where Prometheus is installed and port is the port number for the Prometheus Server (9090 is the default).

Select the Status > Targets menu to view configured targets as shown below.

A Prometheus Server with all healthy targets is shown below.

A Prometheus Server with an unhealthy Collector Service is shown below. Note the DOWN indication for the state of the dds target.

If Collector Service is shown as DOWN, check the following:

Collector Service is running.
The Endpoint URL for Collector Service is correct (including port).
Examine the Error to see if there is another cause being reported.

11.2.3. Check that Grafana can Access Prometheus

Note

These steps can only be performed as a Grafana Admin user. The Grafana images in this section were generated with Grafana version 10.1.4. If you are using a different version of Grafana, the details might be slightly different.

In Observability Dashboards, click the hamburger menu and select Connections > Data source.

Select the “Prometheus” data source.

Scroll down and click Test to ensure that Grafana has connectivity with the Prometheus server.

If the test passes, the following message is displayed.

If the test fails, the following message is displayed.

If the Prometheus Data Source connectivity test fails, check the following:

The Prometheus Server is running.
The HTTP URL matches your Prometheus server URL (including port).
Examine the error response to debug the connection.

11.2.4. Check that Grafana can Access Loki

Note

These steps can only be performed as a Grafana Admin user. The Grafana images in this section were generated with Grafana version 10.1.4. If you are using a different version of Grafana, the details might be slightly different.

In Observability Dashboards, click the hamburger menu and select Connections > Data source.

Select the Loki data source.

Scroll down and click Test to ensure that Grafana has connectivity with the Loki server.

If the test passes, the following message is displayed.

If the test fails, the following message is displayed.

If the Loki Data Source connectivity test fails, check the following:

The Loki Server is running.
The HTTP URL matches your Loki server URL (including port).
Examine the error response to debug the connection.