.. include:: /../getting_started/vars.rst

.. _section-troubleshooting:

Troubleshooting Observability Framework
***************************************
This section provides solutions for issues you may run into while evaluating
*Observability Framework*.

.. _section-docker-containers-fail-to-start:

Docker Container[s] Failed to Start
===================================

The Docker containers used by *Observability Framework* can fail to start for a
variety of reasons. Two common reasons for this are port conflicts or illegal
file permissions. To verify the state of these Docker containers, run the Docker
command ``docker ps -a``.

An example that shows all Docker containers used by *Observability Framework* have
successfully started is shown below.

.. code-block:: console

 CONTAINER ID  IMAGE                                    COMMAND                 CREATED        STATUS        NAMES
 6651d7ed9810  prom/prometheus:v2.37.5                  "/bin/prometheus --c…"  5 minutes ago  Up 5 minutes  prometheus_observability
 25050d16b1b5  grafana/grafana-enterprise:9.2.1-ubuntu  "/run.sh"               5 minutes ago  Up 5 minutes  grafana_observability
 08611ea9b255  rticom/collector-service:<version>       "/rti_connext_dds-7.…"  5 minutes ago  Up 5 minutes  collector_service_observability
 55568de5120f  grafana/loki:2.7.0                       "/usr/bin/loki --con…"  5 minutes ago  Up 5 minutes  loki_observability

An example that shows a container that has failed to start is shown below. The 
failure is indicated by the ``Restarting`` note in the STATUS column. In this example, the
``prometheus-observability`` container failed to start and repeatedly tried 
to restart.

.. code-block:: console
    
 CONTAINER ID  IMAGE                                    COMMAND                 CREATED        STATUS                         NAMES
 08f75e0fadb2  prom/prometheus:v2.37.5                  "/bin/prometheus --c…"  5 minutes ago  Restarting (1) 27 seconds ago  prometheus_observability
 9a3964b561ec  grafana/loki:2.7.0                       "/usr/bin/loki --con…"  5 minutes ago  Up 5 minutes                   loki_observability
 b6a6ffa201f3  rticom/collector-service:<version>       "/rti_connext_dds-7.…"  5 minutes ago  Up 5 minutes                   collector_service_observability
 26658f76cfdc  grafana/grafana-enterprise:9.2.1-ubuntu  "/run.sh"               5 minutes ago  Up 5 minutes                   grafana_observability

To determine why a container failed, examine its log file. To generate 
the log, run the Docker command ``docker logs <container-name>`` 
where ``<container_name>`` is specified in the NAMES column, as shown above. 

Check for Port Conflicts
------------------------

Run ``docker logs <container-name>`` to generate the logs for the failed container, 
then look for a port conflict error. An example of a Prometheus 
port conflict is shown below.

.. code-block:: console

 ts=2023-03-14T13:12:29.275Z caller=web.go:553 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090
 ts=2023-03-14T13:12:29.275Z caller=main.go:786 level=error msg="Unable to start web listener" err="listen tcp 0.0.0.0:9090: bind: address already in use"

If you discover port conflicts, perform the following steps to resolve the issue.

#. Remove the existing Observability Workspace. See :ref:`section-remove-observability-workspace`
   for details on how to remove the workspace.

#. Update the JSON configuration files to configure ports. See :ref:`section-configure-docker`
   for details on how to update the port configuration for the failed container.

#. Run ``<installdir>/bin/rtiobservability -c <JSON config>`` to recreate the Obervability
   Workspace with the new port configuration.

#. Run ``<installdir>/bin/rtiobservability -i`` to create and run the Docker 
   containers with the new port configuration.

Check that You Have the Correct File Permissions
------------------------------------------------

Run ``docker logs <container-name>`` to generate the logs for the failed container, 
then look for a file permissions error. An example of a file
permissions problem is shown below.

.. code-block:: console

 ts=2023-03-14T22:21:47.666Z caller=main.go:450 level=error msg="Error loading config (--config.file=/etc/prometheus/prometheus.yml)" file=/etc/prometheus/prometheus.yml err="open /etc/prometheus/prometheus.yml: permission denied"

Docker containers for *Observability Framework* require the ``other`` permission
to be "read/access" for directories, "read" for files. To resolve a file
permission problem, ensure Linux permissions of at least:

- 755 (rwxr-xr-x) for directories
- 444 (r--r--r--) for files

No Data in Dashboards
=====================

Before proceeding, make sure all Docker containers for *Observability Framework*
are running properly (see :ref:`section-docker-containers-fail-to-start`) and
that you have started your applications with
*Monitoring Library 2.0* enabled (see :ref:`section-monitoring-library-2`).

Check that Collector Service has Discovered Your Applications
-------------------------------------------------------------

#. Run one or more applications configured with *Monitoring Library 2.0*. 
#. Open a browser to ``<servername>:<port>/metrics``, where ``servername`` is 
   the server where *Observability Collector Service* is installed and 
   ``port`` is the port number for the *Observability Collector Service* 
   Prometheus Client port (19090 is the default). 
#. Verify that you have data for the ``dds_participant_presence``
   metric for your application(s) as highlighted below.

.. code-block:: console
   :emphasize-lines: 15,16,17

    # HELP exposer_transferred_bytes_total Transferred bytes to metrics services
    # TYPE exposer_transferred_bytes_total counter
    exposer_transferred_bytes_total 65289
    # HELP exposer_scrapes_total Number of times metrics were scraped
    # TYPE exposer_scrapes_total counter
    exposer_scrapes_total 60
    # HELP exposer_request_latencies Latencies of serving scrape requests, in microseconds
    # TYPE exposer_request_latencies summary
    exposer_request_latencies_count 60
    exposer_request_latencies_sum 25681
    exposer_request_latencies{quantile="0.5"} 316
    exposer_request_latencies{quantile="0.9"} 522
    exposer_request_latencies{quantile="0.99"} 728
    # TYPE dds_participant_presence gauge
    dds_participant_presence{guid="AC462E9B.9BB5237C.DBB61B21.80B55CD8",owner_guid="F8824B73.10EBC319.4ACD1E47.9ECB3033",dds_guid="010130C4.C84EFC6D.973810C6.000001C1",domain_id="57",platform="x64Linux4gcc7.3.0",product_version="<version>",name="/applications/SensorSubscriber/domain_participants/Temperature DomainParticipant",hostname="presanella",process_id="458392"} 1 1678836129957
    dds_participant_presence{guid="291C3B07.34755D99.608E7BF3.1F6546D9",owner_guid="566D1E8D.5D7CBFD4.DD65CC20.C33D56E9",dds_guid="0101416F.425D03B2.8AC75FC8.000001C1",domain_id="57",platform="x64Linux4gcc7.3.0",product_version="<version>",name="/applications/SensorPublisher_2/domain_participants/Temperature DomainParticipant",hostname="presanella",process_id="458369"} 1 1678836129957
    dds_participant_presence{guid="1D5929EC.4FB3CAE4.300F0DB0.C553A54F",owner_guid="D2FD6E87.D8C03AAA.EABFB1F8.E941495B",dds_guid="0101FBDA.551F142B.619EE527.000001C1",domain_id="57",platform="x64Linux4gcc7.3.0",product_version="<version>",name="/applications/SensorPublisher_1/domain_participants/Temperature DomainParticipant",hostname="presanella",process_id="458346"} 1 1678836129957

If there is no metric data available, you will see data as shown below with
metric documentation only, but no metric data.

.. code-block:: console

    # HELP exposer_transferred_bytes_total Transferred bytes to metrics services
    # TYPE exposer_transferred_bytes_total counter
    exposer_transferred_bytes_total 4017
    # HELP exposer_scrapes_total Number of times metrics were scraped
    # TYPE exposer_scrapes_total counter
    exposer_scrapes_total 4
    # HELP exposer_request_latencies Latencies of serving scrape requests, in microseconds
    # TYPE exposer_request_latencies summary
    exposer_request_latencies_count 4
    exposer_request_latencies_sum 2510
    exposer_request_latencies{quantile="0.5"} 564
    exposer_request_latencies{quantile="0.9"} 621
    exposer_request_latencies{quantile="0.99"} 621
    # TYPE dds_participant_presence gauge
    # TYPE dds_participant_udpv4_usage_in_net_pkts_period_ms gauge
    # TYPE dds_participant_udpv4_usage_in_net_pkts_count gauge
    # TYPE dds_participant_udpv4_usage_in_net_pkts_mean gauge
    # TYPE dds_participant_udpv4_usage_in_net_pkts_min gauge
    # TYPE dds_participant_udpv4_usage_in_net_pkts_max gauge
    # TYPE dds_participant_udpv4_usage_in_net_bytes_period_ms gauge
    # TYPE dds_participant_udpv4_usage_in_net_bytes_count gauge
    # TYPE dds_participant_udpv4_usage_in_net_bytes_mean gauge
    # TYPE dds_participant_udpv4_usage_in_net_bytes_min gauge
    # TYPE dds_participant_udpv4_usage_in_net_bytes_max gauge
    # TYPE dds_participant_udpv4_usage_out_net_pkts_period_ms gauge
    # TYPE dds_participant_udpv4_usage_out_net_pkts_count gauge
    # TYPE dds_participant_udpv4_usage_out_net_pkts_mean gauge
    # TYPE dds_participant_udpv4_usage_out_net_pkts_min gauge
    # TYPE dds_participant_udpv4_usage_out_net_pkts_max gauge
    # TYPE dds_participant_udpv4_usage_out_net_bytes_period_ms gauge
    # TYPE dds_participant_udpv4_usage_out_net_bytes_count gauge
    # TYPE dds_participant_udpv4_usage_out_net_bytes_mean gauge
    # TYPE dds_participant_udpv4_usage_out_net_bytes_min gauge
    # TYPE dds_participant_udpv4_usage_out_net_bytes_max gauge
    # TYPE dds_participant_udpv6_usage_in_net_pkts_period_ms gauge
    # TYPE dds_participant_udpv6_usage_in_net_pkts_count gauge
    # TYPE dds_participant_udpv6_usage_in_net_pkts_mean gauge
    # TYPE dds_participant_udpv6_usage_in_net_pkts_min gauge
    # TYPE dds_participant_udpv6_usage_in_net_pkts_max gauge
    # TYPE dds_participant_udpv6_usage_in_net_bytes_period_ms gauge
    # TYPE dds_participant_udpv6_usage_in_net_bytes_count gauge
    # TYPE dds_participant_udpv6_usage_in_net_bytes_mean gauge
    # TYPE dds_participant_udpv6_usage_in_net_bytes_min gauge
    # TYPE dds_participant_udpv6_usage_in_net_bytes_max gauge
    # TYPE dds_participant_udpv6_usage_out_net_pkts_period_ms gauge
    # TYPE dds_participant_udpv6_usage_out_net_pkts_count gauge
    # TYPE dds_participant_udpv6_usage_out_net_pkts_mean gauge
    # TYPE dds_participant_udpv6_usage_out_net_pkts_min gauge
    # TYPE dds_participant_udpv6_usage_out_net_pkts_max gauge
    # TYPE dds_participant_udpv6_usage_out_net_bytes_period_ms gauge
    # TYPE dds_participant_udpv6_usage_out_net_bytes_count gauge
    # TYPE dds_participant_udpv6_usage_out_net_bytes_mean gauge
    # TYPE dds_participant_udpv6_usage_out_net_bytes_min gauge
    # TYPE dds_participant_udpv6_usage_out_net_bytes_max gauge

If you see metric documentation lines only, verify that your applications are 
configured to use the same Observability domain as *Observability Collector
Service* (2 is the default). 

If your applications are run on a machine other than the one hosting  
*Observability Collector Service*, ensure that the initial peers list is configured
with the IP address where *Observability Collector Service* is running.

For more information on configuring *Monitoring Library 2.0* for your
application, see :ref:`section-monitoring-library-2`.

Check that Prometheus can Access Collector Service
------------------------------------------------------

Open a browser to ``<servername>:<port>`` where ``servername`` is the server
where Prometheus is installed and ``port`` is the port number for the Prometheus
Server (9090 is the default).

Select the **Status > Targets** menu to view configured targets as shown below.

.. figure:: static/prometheus_target_selection.png
    :figwidth: 90 %
    :alt: Prometheus Target Selection
    :name: PrometheusTargetSelection
    :align: center

A Prometheus Server with all healthy targets is shown below.

.. figure:: static/prometheus_targets_healthy.png
    :figwidth: 100 %
    :alt: Prometheus Targets Healthy
    :name: PrometheusTargetsHealthy
    :align: center

A Prometheus Server with an unhealthy *Collector Service* is shown below. Note
the ``DOWN`` indication for the state of the ``dds`` target.

.. figure:: static/prometheus_targets_unhealthy.png
    :figwidth: 100 %
    :alt: Prometheus Targets Unhealthy
    :name: PrometheusTargetsUnhealthy
    :align: center

If *Collector Service* is shown as ``DOWN``, check the following:

- *Collector Service* is running.
- The ``Endpoint`` URL for *Collector Service* is correct (including port).
- Examine the ``Error`` to see if there is another cause being reported.

Check that Grafana can Access Prometheus
----------------------------------------

.. note:: 
   These steps can only be performed as a Grafana Admin user.

.. note:: 
   The Grafana images in this section were generated with Grafana version 9.2.1.
   If you are using a different version of Grafana, the details might be slightly
   different.

Open the Grafana Data Sources Configuration page.

.. figure:: static/grafana_datasources_menu.png
    :figwidth: 100 %
    :alt: Grafana Data Sources Menu
    :name: GrafanaDataSourcesMenu
    :align: center

Select the “Prometheus” data source.

.. figure:: static/grafana_prometheus_datasource_select.png
    :figwidth: 100 %
    :alt: Grafana Prometheus Data Source Select
    :name: GrafanaPrometheusDataSourceSelect
    :align: center

Scroll down and click **Test** to ensure that Grafana has
connectivity with the Prometheus server. 

.. figure:: static/grafana_prometheus_config.png
    :figwidth: 100 %
    :alt: Grafana Prometheus Config
    :name: GrafanaPrometheusConfig
    :align: center

A Prometheus Data Source connectivity test that passes is shown below.

.. figure:: static/grafana_prometheus_pass.png
    :figwidth: 100 %
    :alt: Grafana Prometheus Pass
    :name: GrafanaPrometheusPass
    :align: center

A Prometheus Data Source connectivity test that fails is shown below.

.. figure:: static/grafana_prometheus_fail.png
    :figwidth: 100 %
    :alt: Grafana Prometheus Fail
    :name: GrafanaPrometheusFail
    :align: center

If the Prometheus Data Source connectivity test fails, check the following:

- The Prometheus Server is running.
- The HTTP URL matches your Prometheus server URL (including port).
- Examine the error response to debug the connection.