.. _section-troubleshooting:

Troubleshooting |OBSERVABILITY_HEADING|
***************************************
This section provides solutions for issues you may run into while evaluating
|OBSERVABILITY|.

.. _section-docker-containers-fail-to-start:

Docker Container[s] Failed to Start
===================================

The Docker containers used by |OBSERVABILITY| can fail to start for a
variety of reasons. Two common reasons for this are port conflicts or illegal
file permissions. To verify the state of these Docker containers, run the Docker
command ``docker ps -a``.

An example that shows all Docker containers used by |OBSERVABILITY| have
successfully started is shown below.

.. code-block:: console

 CONTAINER ID  IMAGE                                    COMMAND                 CREATED        STATUS        NAMES
 6651d7ed9810  prom/prometheus:v2.37.5                  "/bin/prometheus --c…"  5 minutes ago  Up 5 minutes  prometheus_observability
 25050d16b1b5  grafana/grafana-enterprise:9.2.1-ubuntu  "/run.sh"               5 minutes ago  Up 5 minutes  grafana_observability
 08611ea9b255  rticom/collector-service:<version>       "/rti_connext_dds-7.…"  5 minutes ago  Up 5 minutes  collector_service_observability
 55568de5120f  grafana/loki:2.7.0                       "/usr/bin/loki --con…"  5 minutes ago  Up 5 minutes  loki_observability

An example that shows a container that has failed to start is shown below. The 
failure is indicated by the ``Restarting`` note in the STATUS column. In this example, the
``prometheus-observability`` container failed to start and repeatedly tried 
to restart.

.. code-block:: console
    
 CONTAINER ID  IMAGE                                    COMMAND                 CREATED        STATUS                         NAMES
 08f75e0fadb2  prom/prometheus:v2.37.5                  "/bin/prometheus --c…"  5 minutes ago  Restarting (1) 27 seconds ago  prometheus_observability
 9a3964b561ec  grafana/loki:2.7.0                       "/usr/bin/loki --con…"  5 minutes ago  Up 5 minutes                   loki_observability
 b6a6ffa201f3  rticom/collector-service:<version>       "/rti_connext_dds-7.…"  5 minutes ago  Up 5 minutes                   collector_service_observability
 26658f76cfdc  grafana/grafana-enterprise:9.2.1-ubuntu  "/run.sh"               5 minutes ago  Up 5 minutes                   grafana_observability

To determine why a container failed, examine its log file. To generate 
the log, run the Docker command ``docker logs <container-name>`` 
where ``<container_name>`` is specified in the NAMES column, as shown above. 

Check for Port Conflicts
------------------------

Run ``docker logs <container-name>`` to generate the logs for the failed container, 
then look for a port conflict error. An example of a Prometheus 
port conflict is shown below.

.. code-block:: console

 ts=2023-03-14T13:12:29.275Z caller=web.go:553 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090
 ts=2023-03-14T13:12:29.275Z caller=main.go:786 level=error msg="Unable to start web listener" err="listen tcp 0.0.0.0:9090: bind: address already in use"

If you discover port conflicts, perform the following steps to resolve the issue.

#. Remove the existing Observability Workspace. See :ref:`section-remove-observability-workspace`
   for details on how to remove the workspace.

#. Update the JSON configuration files to configure ports. See :ref:`section-configure-docker`
   for details on how to update the port configuration for the failed container.

#. Run ``<installdir>/bin/rtiobservability -c <JSON config>`` to recreate the Obervability
   Workspace with the new port configuration.

#. Run ``<installdir>/bin/rtiobservability -i`` to create and run the Docker 
   containers with the new port configuration.

Check that You Have the Correct File Permissions
------------------------------------------------

Run ``docker logs <container-name>`` to generate the logs for the failed container, 
then look for a file permissions error. An example of a file
permissions problem is shown below.

.. code-block:: console

 ts=2023-03-14T22:21:47.666Z caller=main.go:450 level=error msg="Error loading config (--config.file=/etc/prometheus/prometheus.yml)" file=/etc/prometheus/prometheus.yml err="open /etc/prometheus/prometheus.yml: permission denied"

Docker containers for |OBSERVABILITY| require the ``other`` permission
to be "read/access" for directories, "read" for files. To resolve a file
permission problem, ensure Linux permissions of at least:

- 755 (rwxr-xr-x) for directories
- 444 (r--r--r--) for files

No Data in Dashboards
=====================

Before proceeding, make sure all Docker containers for |OBSERVABILITY|
are running properly (see :ref:`section-docker-containers-fail-to-start`) and
that you have started your applications with
|MONITORINGLIBRARY2| enabled (see :ref:`section-monitoring-library-2`).

Check that |COLLECTORSERVICE_HEADING| has Discovered Your Applications
----------------------------------------------------------------------

#. Run one or more applications configured with |MONITORINGLIBRARY2|. 
#. Open a browser to ``<servername>:<port>/metrics``, where ``servername`` is 
   the server where |OCS| is installed and 
   ``port`` is the port number for the |OCS| 
   Prometheus Client port (19090 is the default). 
#. Verify that you have data for the ``dds_domain_participant_presence``
   metric for your application(s) as highlighted below.

.. code-block:: console
   :emphasize-lines: 15,16,17

    # HELP exposer_transferred_bytes_total Transferred bytes to metrics services
    # TYPE exposer_transferred_bytes_total counter
    exposer_transferred_bytes_total 65289
    # HELP exposer_scrapes_total Number of times metrics were scraped
    # TYPE exposer_scrapes_total counter
    exposer_scrapes_total 60
    # HELP exposer_request_latencies Latencies of serving scrape requests, in microseconds
    # TYPE exposer_request_latencies summary
    exposer_request_latencies_count 60
    exposer_request_latencies_sum 25681
    exposer_request_latencies{quantile="0.5"} 316
    exposer_request_latencies{quantile="0.9"} 522
    exposer_request_latencies{quantile="0.99"} 728
    # TYPE dds_domain_participant_presence gauge
    dds_domain_participant_presence{guid="AC462E9B.9BB5237C.DBB61B21.80B55CD8",owner_guid="F8824B73.10EBC319.4ACD1E47.9ECB3033",dds_guid="010130C4.C84EFC6D.973810C6.000001C1",domain_id="57",platform="x64Linux4gcc8.5.0",product_version="<version>",name="/applications/SensorSubscriber/domain_participants/Temperature DomainParticipant",hostname="presanella",process_id="458392"} 1 1678836129957
    dds_domain_participant_presence{guid="291C3B07.34755D99.608E7BF3.1F6546D9",owner_guid="566D1E8D.5D7CBFD4.DD65CC20.C33D56E9",dds_guid="0101416F.425D03B2.8AC75FC8.000001C1",domain_id="57",platform="x64Linux4gcc8.5.0",product_version="<version>",name="/applications/SensorPublisher_2/domain_participants/Temperature DomainParticipant",hostname="presanella",process_id="458369"} 1 1678836129957
    dds_domain_participant_presence{guid="1D5929EC.4FB3CAE4.300F0DB0.C553A54F",owner_guid="D2FD6E87.D8C03AAA.EABFB1F8.E941495B",dds_guid="0101FBDA.551F142B.619EE527.000001C1",domain_id="57",platform="x64Linux4gcc8.5.0",product_version="<version>",name="/applications/SensorPublisher_1/domain_participants/Temperature DomainParticipant",hostname="presanella",process_id="458346"} 1 1678836129957

If there is no metric data available, you will see data as shown below with
metric documentation only, but no metric data.

.. code-block:: console

    # HELP exposer_transferred_bytes_total Transferred bytes to metrics services
    # TYPE exposer_transferred_bytes_total counter
    exposer_transferred_bytes_total 4017
    # HELP exposer_scrapes_total Number of times metrics were scraped
    # TYPE exposer_scrapes_total counter
    exposer_scrapes_total 4
    # HELP exposer_request_latencies Latencies of serving scrape requests, in microseconds
    # TYPE exposer_request_latencies summary
    exposer_request_latencies_count 4
    exposer_request_latencies_sum 2510
    exposer_request_latencies{quantile="0.5"} 564
    exposer_request_latencies{quantile="0.9"} 621
    exposer_request_latencies{quantile="0.99"} 621
    # TYPE dds_domain_participant_presence gauge
    # TYPE dds_domain_participant_udpv4_usage_in_net_pkts_period_ms gauge
    # TYPE dds_domain_participant_udpv4_usage_in_net_pkts_count gauge
    # TYPE dds_domain_participant_udpv4_usage_in_net_pkts_mean gauge
    # TYPE dds_domain_participant_udpv4_usage_in_net_pkts_min gauge
    # TYPE dds_domain_participant_udpv4_usage_in_net_pkts_max gauge
    # TYPE dds_domain_participant_udpv4_usage_in_net_bytes_period_ms gauge
    # TYPE dds_domain_participant_udpv4_usage_in_net_bytes_count gauge
    # TYPE dds_domain_participant_udpv4_usage_in_net_bytes_mean gauge
    # TYPE dds_domain_participant_udpv4_usage_in_net_bytes_min gauge
    # TYPE dds_domain_participant_udpv4_usage_in_net_bytes_max gauge
    # TYPE dds_domain_participant_udpv4_usage_out_net_pkts_period_ms gauge
    # TYPE dds_domain_participant_udpv4_usage_out_net_pkts_count gauge
    # TYPE dds_domain_participant_udpv4_usage_out_net_pkts_mean gauge
    # TYPE dds_domain_participant_udpv4_usage_out_net_pkts_min gauge
    # TYPE dds_domain_participant_udpv4_usage_out_net_pkts_max gauge
    # TYPE dds_domain_participant_udpv4_usage_out_net_bytes_period_ms gauge
    # TYPE dds_domain_participant_udpv4_usage_out_net_bytes_count gauge
    # TYPE dds_domain_participant_udpv4_usage_out_net_bytes_mean gauge
    # TYPE dds_domain_participant_udpv4_usage_out_net_bytes_min gauge
    # TYPE dds_domain_participant_udpv4_usage_out_net_bytes_max gauge
    # TYPE dds_domain_participant_udpv6_usage_in_net_pkts_period_ms gauge
    # TYPE dds_domain_participant_udpv6_usage_in_net_pkts_count gauge
    # TYPE dds_domain_participant_udpv6_usage_in_net_pkts_mean gauge
    # TYPE dds_domain_participant_udpv6_usage_in_net_pkts_min gauge
    # TYPE dds_domain_participant_udpv6_usage_in_net_pkts_max gauge
    # TYPE dds_domain_participant_udpv6_usage_in_net_bytes_period_ms gauge
    # TYPE dds_domain_participant_udpv6_usage_in_net_bytes_count gauge
    # TYPE dds_domain_participant_udpv6_usage_in_net_bytes_mean gauge
    # TYPE dds_domain_participant_udpv6_usage_in_net_bytes_min gauge
    # TYPE dds_domain_participant_udpv6_usage_in_net_bytes_max gauge
    # TYPE dds_domain_participant_udpv6_usage_out_net_pkts_period_ms gauge
    # TYPE dds_domain_participant_udpv6_usage_out_net_pkts_count gauge
    # TYPE dds_domain_participant_udpv6_usage_out_net_pkts_mean gauge
    # TYPE dds_domain_participant_udpv6_usage_out_net_pkts_min gauge
    # TYPE dds_domain_participant_udpv6_usage_out_net_pkts_max gauge
    # TYPE dds_domain_participant_udpv6_usage_out_net_bytes_period_ms gauge
    # TYPE dds_domain_participant_udpv6_usage_out_net_bytes_count gauge
    # TYPE dds_domain_participant_udpv6_usage_out_net_bytes_mean gauge
    # TYPE dds_domain_participant_udpv6_usage_out_net_bytes_min gauge
    # TYPE dds_domain_participant_udpv6_usage_out_net_bytes_max gauge

If you see metric documentation lines only, verify that your applications are 
configured to use the same Observability domain as |OCS| (2 is the default). 

If your applications are run on a machine other than the one hosting |OCA|,
ensure that ``collector_initial_peers`` for the |MONITORINGLIBRARY2| configuration
in each application is configured with the IP address where |OCA| is running.

For more information on configuring |MONITORINGLIBRARY2| for your
application, see :ref:`section-monitoring-library-2`.

Check that Prometheus can Access |COLLECTORSERVICE_HEADING|
-----------------------------------------------------------

Open a browser to ``<servername>:<port>`` where ``servername`` is the server
where Prometheus is installed and ``port`` is the port number for the Prometheus
Server (9090 is the default).

Select the **Status > Targets** menu to view configured targets as shown below.

.. figure:: static/prometheus_target_selection.png
    :figwidth: 90 %
    :alt: Prometheus Target Selection
    :name: PrometheusTargetSelection
    :align: center

A Prometheus Server with all healthy targets is shown below.

.. figure:: static/prometheus_targets_healthy.png
    :figwidth: 100 %
    :alt: Prometheus Targets Healthy
    :name: PrometheusTargetsHealthy
    :align: center

A Prometheus Server with an unhealthy |COLLECTORSERVICE| is shown below. Note
the ``DOWN`` indication for the state of the ``dds`` target.

.. figure:: static/prometheus_targets_unhealthy.png
    :figwidth: 100 %
    :alt: Prometheus Targets Unhealthy
    :name: PrometheusTargetsUnhealthy
    :align: center

If |COLLECTORSERVICE| is shown as ``DOWN``, check the following:

- |COLLECTORSERVICE| is running.
- The ``Endpoint`` URL for |COLLECTORSERVICE| is correct (including port).
- Examine the ``Error`` to see if there is another cause being reported.

Check that Grafana can Access Prometheus
----------------------------------------

.. note::
   These steps can only be performed as a Grafana Admin user. The Grafana images
   in this section were generated with Grafana version 10.1.4. If you are using
   a different version of Grafana, the details might be slightly different.

In |Dashboards|, click the hamburger menu and select **Connections > Data source**.

.. figure:: static/grafana_datasources_menu.png
    :figwidth: 100 %
    :alt: Grafana Data Sources Menu
    :name: GrafanaDataSourcesMenuPrometheus
    :align: center

Select the “Prometheus” data source.

.. figure:: static/grafana_prometheus_datasource_select.png
    :figwidth: 100 %
    :alt: Grafana Prometheus Data Source Select
    :name: GrafanaPrometheusDataSourceSelect
    :align: center

Scroll down and click **Test** to ensure that Grafana has
connectivity with the Prometheus server. 

.. figure:: static/grafana_prometheus_config.png
    :figwidth: 100 %
    :alt: Grafana Prometheus Config
    :name: GrafanaPrometheusConfig
    :align: center

If the test passes, the following message is displayed.

.. figure:: static/grafana_prometheus_pass.png
    :figwidth: 100 %
    :alt: Grafana Prometheus Pass
    :name: GrafanaPrometheusPass
    :align: center

If the test fails, the following message is displayed.

.. figure:: static/grafana_prometheus_fail.png
    :figwidth: 100 %
    :alt: Grafana Prometheus Fail
    :name: GrafanaPrometheusFail
    :align: center

If the Prometheus Data Source connectivity test fails, check the following:

- The Prometheus Server is running.
- The HTTP URL matches your Prometheus server URL (including port).
- Examine the error response to debug the connection.

Check that Grafana can Access Loki
----------------------------------

.. note::
   These steps can only be performed as a Grafana Admin user. The Grafana images
   in this section were generated with Grafana version 10.1.4. If you are using
   a different version of Grafana, the details might be slightly different.

In |Dashboards|, click the hamburger menu and select **Connections > Data source**.

.. figure:: static/grafana_datasources_menu.png
    :figwidth: 100 %
    :alt: Grafana Data Sources Menu
    :name: GrafanaDataSourcesMenuLoki
    :align: center

Select the Loki data source.

.. figure:: static/grafana_loki_datasource_select.png
    :figwidth: 100 %
    :alt: Grafana Loki Data Source Select
    :name: GrafanaLokiDataSourceSelect
    :align: center

Scroll down and click **Test** to ensure that Grafana has
connectivity with the Loki server.

.. figure:: static/grafana_loki_config.png
    :figwidth: 100 %
    :alt: Grafana Loki Config
    :name: GrafanaLokiConfig
    :align: center

If the test passes, the following message is displayed.

.. figure:: static/grafana_loki_pass.png
    :figwidth: 100 %
    :alt: Grafana Loki Pass
    :name: GrafanaLokiPass
    :align: center

If the test fails, the following message is displayed.

.. figure:: static/grafana_loki_fail.png
    :figwidth: 100 %
    :alt: Grafana Loki Fail
    :name: GrafanaLokiFail
    :align: center

If the Loki Data Source connectivity test fails, check the following:

- The Loki Server is running.
- The HTTP URL matches your Loki server URL (including port).
- Examine the error response to debug the connection.

Can Collector Service run in Windows or macOS?
==============================================

|OCS| is only provided as a Linux amd64 Docker image. It runs natively on Linux, 
and on Windows or macOS through:

* Docker Desktop (recommended), which provides a Linux VM environment
* WSL2 on Windows with Docker engine manually installed
* Linux VM (e.g., Hyper-V, VirtualBox, VMware)

On ARM hosts (such as Raspberry Pi or Apple Silicon Macs), |OCS| can run using 
QEMU emulation for amd64. Performance may be lower under emulation.

For more information about supported Docker environments, see the
`Collector Service article <https://hub.docker.com/r/rticom/collector-service>`_
in Docker Hub.