Tool for gathering statistics on DDS traffic?

2 posts / 0 new
Last post
Offline
Last seen: 4 years 5 months ago
Joined: 02/02/2018
Posts: 13
Tool for gathering statistics on DDS traffic?

I'm interested in gathering statistics on our DDS traffic so I can determine if a change to our QoS has any demonstrable result. In examining the current Connext toolset, I don't find anything that is an exact match for our needs. I'm less interested in inspecting the data within the sample but rather the data about the network itself.

For example:

  1. Statistics for given topics:
    1. Sample rate over time (chart)
    2. Samples lost over time (chart)
    3. Sample timing jitter over time (chart)
    4. Average sample rate over collection period
    5. Min/max sample rate over collection period
    6. Standard deviation in sample rate over collection period
  2. The statistics in #1 but by data writer, broken out by topic or aggregated.

Is there anything like that available, or will I have to use recorder and then examine the database on my own?

Thanks!

---Jason

Offline
Last seen: 3 months 6 days ago
Joined: 02/11/2016
Posts: 144

Hey Jason,

1. This topic is (in part) covered in a post I made ( https://community.rti.com/forum-topic/how-obtain-full-view-system ).

2) I'm unsure what "sample timing jitter over time" means.

3) Among other things, RTI has an api for checking the amount of samples lost and samples that arrived.

4) RTI Monitoring Library (which I do not recommend, I'll explain later) allows you to publish abovementioned data (and much more) periodically.

If you choose to use RTI monitoring library (by enabling it in qos / programmatically), you can either utilize RTI Monitoring Service (and to some extent, other RTI tools) or write your own monitoring tool to view this monitoring data.

One issue I've encountered with RTI Monitoring is that it seems to create STW (stop-the-world) pauses for all RTI entities (so, if you have a lot of different rti entities, when ever data is to be sent periodically, all of them will be stalled). Personally this introduced (in one application that has a lot of entities) a 500 ms stall happening periodically.

5) For monitoring the data you described I would recommend doing the following:

5.1) Attach listeners to all of your readers (if you are currently using them for other purposes, let all your listeners extend a shared listener implementation) and use the listener to capture: on_data_available (triggered when new samples are received) and on_sample_lost (triggered when rti identifies that a sample was lost).

Depending on your programming language there are different libraries I would recommend for capturing the data (I don't recommend implementing this infrastructure on your own).

For java, I use dropwizard metrics to capture metrics (for example, counters) and publish them to a remote db periodically.

5.2) If you can, I would recommend wrapping every read (or take) operation with some code that will capture the amount of samples read.

5.3) If you use dropwizard metrics (or similar libraries) you will have a selection to make of where to report the data to, personally I went with influx db, a time-based data base which has good support for tags (useful for separation per topic / per writer) and is well supported by my visualization solution, grafana.

5.3) You can separate the data per writer per topic using a naming format, if you use dropwizard metrics, there are a few workarounds that allow you to use tags on a per-metric basis.

 

TL;DR:

Other than the metric I'm unsure about (jitter), the rest can be covered by wrapping read/take code and utilizing listeners.

Capturing (optionally, with tag separation) and reporting metrics can be done using libraries, for java I would recommend dropwizard metrics

I would recommend reporting to influx db but there are other good options, as well.

Once you've reported the samples lost and samples read per reporting period you can use a visualization tool such as grafana (or kibana) to present this data in various forms (including rate).

 

p.s.

If you only interested in the rate at which events happened, you may find the Meter metric to be a better fit than the Counter metric (as the Meter will spare you the data manipulation in the visualization tool).

p.s. 2

I would recommend steering clear of averages and standard deviation when examining networking performance. You can look it up but extensive research shows that normal distribution has little to do with latency and networking performance.

Instead I would recommend looking at histograms.

 

I hope this helps,

Roy.