.. _section-troubleshooting: Troubleshooting |OBSERVABILITY_HEADING| *************************************** This section provides solutions for issues you may run into while evaluating |OBSERVABILITY|. .. _section-docker-containers-fail-to-start: Docker Container[s] Failed to Start =================================== The Docker containers used by |OBSERVABILITY| can fail to start for a variety of reasons. Two common reasons for this are port conflicts or illegal file permissions. To verify the state of these Docker containers, run the Docker command ``docker ps -a``. An example that shows all Docker containers used by |OBSERVABILITY| have successfully started is shown below. .. code-block:: console CONTAINER ID IMAGE COMMAND CREATED STATUS NAMES 6651d7ed9810 prom/prometheus:v2.37.5 "/bin/prometheus --c…" 5 minutes ago Up 5 minutes prometheus_observability 25050d16b1b5 grafana/grafana-enterprise:9.2.1-ubuntu "/run.sh" 5 minutes ago Up 5 minutes grafana_observability 08611ea9b255 rticom/collector-service: "/rti_connext_dds-7.…" 5 minutes ago Up 5 minutes collector_service_observability 55568de5120f grafana/loki:2.7.0 "/usr/bin/loki --con…" 5 minutes ago Up 5 minutes loki_observability An example that shows a container that has failed to start is shown below. The failure is indicated by the ``Restarting`` note in the STATUS column. In this example, the ``prometheus-observability`` container failed to start and repeatedly tried to restart. .. code-block:: console CONTAINER ID IMAGE COMMAND CREATED STATUS NAMES 08f75e0fadb2 prom/prometheus:v2.37.5 "/bin/prometheus --c…" 5 minutes ago Restarting (1) 27 seconds ago prometheus_observability 9a3964b561ec grafana/loki:2.7.0 "/usr/bin/loki --con…" 5 minutes ago Up 5 minutes loki_observability b6a6ffa201f3 rticom/collector-service: "/rti_connext_dds-7.…" 5 minutes ago Up 5 minutes collector_service_observability 26658f76cfdc grafana/grafana-enterprise:9.2.1-ubuntu "/run.sh" 5 minutes ago Up 5 minutes grafana_observability To determine why a container failed, examine its log file. To generate the log, run the Docker command ``docker logs `` where ```` is specified in the NAMES column, as shown above. Check for Port Conflicts ------------------------ Run ``docker logs `` to generate the logs for the failed container, then look for a port conflict error. An example of a Prometheus port conflict is shown below. .. code-block:: console ts=2023-03-14T13:12:29.275Z caller=web.go:553 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090 ts=2023-03-14T13:12:29.275Z caller=main.go:786 level=error msg="Unable to start web listener" err="listen tcp 0.0.0.0:9090: bind: address already in use" If you discover port conflicts, perform the following steps to resolve the issue. #. Remove the existing Observability Workspace. See :ref:`section-remove-observability-workspace` for details on how to remove the workspace. #. Update the JSON configuration files to configure ports. See :ref:`section-configure-docker` for details on how to update the port configuration for the failed container. #. Run ``/bin/rtiobservability -c `` to recreate the Obervability Workspace with the new port configuration. #. Run ``/bin/rtiobservability -i`` to create and run the Docker containers with the new port configuration. Check that You Have the Correct File Permissions ------------------------------------------------ Run ``docker logs `` to generate the logs for the failed container, then look for a file permissions error. An example of a file permissions problem is shown below. .. code-block:: console ts=2023-03-14T22:21:47.666Z caller=main.go:450 level=error msg="Error loading config (--config.file=/etc/prometheus/prometheus.yml)" file=/etc/prometheus/prometheus.yml err="open /etc/prometheus/prometheus.yml: permission denied" Docker containers for |OBSERVABILITY| require the ``other`` permission to be "read/access" for directories, "read" for files. To resolve a file permission problem, ensure Linux permissions of at least: - 755 (rwxr-xr-x) for directories - 444 (r--r--r--) for files No Data in Dashboards ===================== Before proceeding, make sure all Docker containers for |OBSERVABILITY| are running properly (see :ref:`section-docker-containers-fail-to-start`) and that you have started your applications with |MONITORINGLIBRARY2| enabled (see :ref:`section-monitoring-library-2`). Check that |COLLECTORSERVICE_HEADING| has Discovered Your Applications ---------------------------------------------------------------------- #. Run one or more applications configured with |MONITORINGLIBRARY2|. #. Open a browser to ``:/metrics``, where ``servername`` is the server where |OCS| is installed and ``port`` is the port number for the |OCS| Prometheus Client port (19090 is the default). #. Verify that you have data for the ``dds_domain_participant_presence`` metric for your application(s) as highlighted below. .. code-block:: console :emphasize-lines: 15,16,17 # HELP exposer_transferred_bytes_total Transferred bytes to metrics services # TYPE exposer_transferred_bytes_total counter exposer_transferred_bytes_total 65289 # HELP exposer_scrapes_total Number of times metrics were scraped # TYPE exposer_scrapes_total counter exposer_scrapes_total 60 # HELP exposer_request_latencies Latencies of serving scrape requests, in microseconds # TYPE exposer_request_latencies summary exposer_request_latencies_count 60 exposer_request_latencies_sum 25681 exposer_request_latencies{quantile="0.5"} 316 exposer_request_latencies{quantile="0.9"} 522 exposer_request_latencies{quantile="0.99"} 728 # TYPE dds_domain_participant_presence gauge dds_domain_participant_presence{guid="AC462E9B.9BB5237C.DBB61B21.80B55CD8",owner_guid="F8824B73.10EBC319.4ACD1E47.9ECB3033",dds_guid="010130C4.C84EFC6D.973810C6.000001C1",domain_id="57",platform="x64Linux4gcc8.5.0",product_version="",name="/applications/SensorSubscriber/domain_participants/Temperature DomainParticipant",hostname="presanella",process_id="458392"} 1 1678836129957 dds_domain_participant_presence{guid="291C3B07.34755D99.608E7BF3.1F6546D9",owner_guid="566D1E8D.5D7CBFD4.DD65CC20.C33D56E9",dds_guid="0101416F.425D03B2.8AC75FC8.000001C1",domain_id="57",platform="x64Linux4gcc8.5.0",product_version="",name="/applications/SensorPublisher_2/domain_participants/Temperature DomainParticipant",hostname="presanella",process_id="458369"} 1 1678836129957 dds_domain_participant_presence{guid="1D5929EC.4FB3CAE4.300F0DB0.C553A54F",owner_guid="D2FD6E87.D8C03AAA.EABFB1F8.E941495B",dds_guid="0101FBDA.551F142B.619EE527.000001C1",domain_id="57",platform="x64Linux4gcc8.5.0",product_version="",name="/applications/SensorPublisher_1/domain_participants/Temperature DomainParticipant",hostname="presanella",process_id="458346"} 1 1678836129957 If there is no metric data available, you will see data as shown below with metric documentation only, but no metric data. .. code-block:: console # HELP exposer_transferred_bytes_total Transferred bytes to metrics services # TYPE exposer_transferred_bytes_total counter exposer_transferred_bytes_total 4017 # HELP exposer_scrapes_total Number of times metrics were scraped # TYPE exposer_scrapes_total counter exposer_scrapes_total 4 # HELP exposer_request_latencies Latencies of serving scrape requests, in microseconds # TYPE exposer_request_latencies summary exposer_request_latencies_count 4 exposer_request_latencies_sum 2510 exposer_request_latencies{quantile="0.5"} 564 exposer_request_latencies{quantile="0.9"} 621 exposer_request_latencies{quantile="0.99"} 621 # TYPE dds_domain_participant_presence gauge # TYPE dds_domain_participant_udpv4_usage_in_net_pkts_period_ms gauge # TYPE dds_domain_participant_udpv4_usage_in_net_pkts_count gauge # TYPE dds_domain_participant_udpv4_usage_in_net_pkts_mean gauge # TYPE dds_domain_participant_udpv4_usage_in_net_pkts_min gauge # TYPE dds_domain_participant_udpv4_usage_in_net_pkts_max gauge # TYPE dds_domain_participant_udpv4_usage_in_net_bytes_period_ms gauge # TYPE dds_domain_participant_udpv4_usage_in_net_bytes_count gauge # TYPE dds_domain_participant_udpv4_usage_in_net_bytes_mean gauge # TYPE dds_domain_participant_udpv4_usage_in_net_bytes_min gauge # TYPE dds_domain_participant_udpv4_usage_in_net_bytes_max gauge # TYPE dds_domain_participant_udpv4_usage_out_net_pkts_period_ms gauge # TYPE dds_domain_participant_udpv4_usage_out_net_pkts_count gauge # TYPE dds_domain_participant_udpv4_usage_out_net_pkts_mean gauge # TYPE dds_domain_participant_udpv4_usage_out_net_pkts_min gauge # TYPE dds_domain_participant_udpv4_usage_out_net_pkts_max gauge # TYPE dds_domain_participant_udpv4_usage_out_net_bytes_period_ms gauge # TYPE dds_domain_participant_udpv4_usage_out_net_bytes_count gauge # TYPE dds_domain_participant_udpv4_usage_out_net_bytes_mean gauge # TYPE dds_domain_participant_udpv4_usage_out_net_bytes_min gauge # TYPE dds_domain_participant_udpv4_usage_out_net_bytes_max gauge # TYPE dds_domain_participant_udpv6_usage_in_net_pkts_period_ms gauge # TYPE dds_domain_participant_udpv6_usage_in_net_pkts_count gauge # TYPE dds_domain_participant_udpv6_usage_in_net_pkts_mean gauge # TYPE dds_domain_participant_udpv6_usage_in_net_pkts_min gauge # TYPE dds_domain_participant_udpv6_usage_in_net_pkts_max gauge # TYPE dds_domain_participant_udpv6_usage_in_net_bytes_period_ms gauge # TYPE dds_domain_participant_udpv6_usage_in_net_bytes_count gauge # TYPE dds_domain_participant_udpv6_usage_in_net_bytes_mean gauge # TYPE dds_domain_participant_udpv6_usage_in_net_bytes_min gauge # TYPE dds_domain_participant_udpv6_usage_in_net_bytes_max gauge # TYPE dds_domain_participant_udpv6_usage_out_net_pkts_period_ms gauge # TYPE dds_domain_participant_udpv6_usage_out_net_pkts_count gauge # TYPE dds_domain_participant_udpv6_usage_out_net_pkts_mean gauge # TYPE dds_domain_participant_udpv6_usage_out_net_pkts_min gauge # TYPE dds_domain_participant_udpv6_usage_out_net_pkts_max gauge # TYPE dds_domain_participant_udpv6_usage_out_net_bytes_period_ms gauge # TYPE dds_domain_participant_udpv6_usage_out_net_bytes_count gauge # TYPE dds_domain_participant_udpv6_usage_out_net_bytes_mean gauge # TYPE dds_domain_participant_udpv6_usage_out_net_bytes_min gauge # TYPE dds_domain_participant_udpv6_usage_out_net_bytes_max gauge If you see metric documentation lines only, verify that your applications are configured to use the same Observability domain as |OCS| (2 is the default). If your applications are run on a machine other than the one hosting |OCA|, ensure that ``collector_initial_peers`` for the |MONITORINGLIBRARY2| configuration in each application is configured with the IP address where |OCA| is running. For more information on configuring |MONITORINGLIBRARY2| for your application, see :ref:`section-monitoring-library-2`. Check that Prometheus can Access |COLLECTORSERVICE_HEADING| ----------------------------------------------------------- Open a browser to ``:`` where ``servername`` is the server where Prometheus is installed and ``port`` is the port number for the Prometheus Server (9090 is the default). Select the **Status > Targets** menu to view configured targets as shown below. .. figure:: static/prometheus_target_selection.png :figwidth: 90 % :alt: Prometheus Target Selection :name: PrometheusTargetSelection :align: center A Prometheus Server with all healthy targets is shown below. .. figure:: static/prometheus_targets_healthy.png :figwidth: 100 % :alt: Prometheus Targets Healthy :name: PrometheusTargetsHealthy :align: center A Prometheus Server with an unhealthy |COLLECTORSERVICE| is shown below. Note the ``DOWN`` indication for the state of the ``dds`` target. .. figure:: static/prometheus_targets_unhealthy.png :figwidth: 100 % :alt: Prometheus Targets Unhealthy :name: PrometheusTargetsUnhealthy :align: center If |COLLECTORSERVICE| is shown as ``DOWN``, check the following: - |COLLECTORSERVICE| is running. - The ``Endpoint`` URL for |COLLECTORSERVICE| is correct (including port). - Examine the ``Error`` to see if there is another cause being reported. Check that Grafana can Access Prometheus ---------------------------------------- .. note:: These steps can only be performed as a Grafana Admin user. The Grafana images in this section were generated with Grafana version 10.1.4. If you are using a different version of Grafana, the details might be slightly different. In |Dashboards|, click the hamburger menu and select **Connections > Data source**. .. figure:: static/grafana_datasources_menu.png :figwidth: 100 % :alt: Grafana Data Sources Menu :name: GrafanaDataSourcesMenuPrometheus :align: center Select the “Prometheus” data source. .. figure:: static/grafana_prometheus_datasource_select.png :figwidth: 100 % :alt: Grafana Prometheus Data Source Select :name: GrafanaPrometheusDataSourceSelect :align: center Scroll down and click **Test** to ensure that Grafana has connectivity with the Prometheus server. .. figure:: static/grafana_prometheus_config.png :figwidth: 100 % :alt: Grafana Prometheus Config :name: GrafanaPrometheusConfig :align: center If the test passes, the following message is displayed. .. figure:: static/grafana_prometheus_pass.png :figwidth: 100 % :alt: Grafana Prometheus Pass :name: GrafanaPrometheusPass :align: center If the test fails, the following message is displayed. .. figure:: static/grafana_prometheus_fail.png :figwidth: 100 % :alt: Grafana Prometheus Fail :name: GrafanaPrometheusFail :align: center If the Prometheus Data Source connectivity test fails, check the following: - The Prometheus Server is running. - The HTTP URL matches your Prometheus server URL (including port). - Examine the error response to debug the connection. Check that Grafana can Access Loki ---------------------------------- .. note:: These steps can only be performed as a Grafana Admin user. The Grafana images in this section were generated with Grafana version 10.1.4. If you are using a different version of Grafana, the details might be slightly different. In |Dashboards|, click the hamburger menu and select **Connections > Data source**. .. figure:: static/grafana_datasources_menu.png :figwidth: 100 % :alt: Grafana Data Sources Menu :name: GrafanaDataSourcesMenuLoki :align: center Select the Loki data source. .. figure:: static/grafana_loki_datasource_select.png :figwidth: 100 % :alt: Grafana Loki Data Source Select :name: GrafanaLokiDataSourceSelect :align: center Scroll down and click **Test** to ensure that Grafana has connectivity with the Loki server. .. figure:: static/grafana_loki_config.png :figwidth: 100 % :alt: Grafana Loki Config :name: GrafanaLokiConfig :align: center If the test passes, the following message is displayed. .. figure:: static/grafana_loki_pass.png :figwidth: 100 % :alt: Grafana Loki Pass :name: GrafanaLokiPass :align: center If the test fails, the following message is displayed. .. figure:: static/grafana_loki_fail.png :figwidth: 100 % :alt: Grafana Loki Fail :name: GrafanaLokiFail :align: center If the Loki Data Source connectivity test fails, check the following: - The Loki Server is running. - The HTTP URL matches your Loki server URL (including port). - Examine the error response to debug the connection. Can Collector Service run in Windows or macOS? ============================================== |OCS| is only provided as a Linux amd64 Docker image. It runs natively on Linux, and on Windows or macOS through: * Docker Desktop (recommended), which provides a Linux VM environment * WSL2 on Windows with Docker engine manually installed * Linux VM (e.g., Hyper-V, VirtualBox, VMware) On ARM hosts (such as Raspberry Pi or Apple Silicon Macs), |OCS| can run using QEMU emulation for amd64. Performance may be lower under emulation. For more information about supported Docker environments, see the `Collector Service article `_ in Docker Hub.