Tool for gathering statistics on DDS traffic?

2 posts / 0 new

Log in or register to post comments

Last post

Fri, 02/02/2018 - 14:50

#1

jasontiller2

Offline

Last seen: 5 years 1 month ago

Joined: 02/02/2018

Posts: 13

Tool for gathering statistics on DDS traffic?

I'm interested in gathering statistics on our DDS traffic so I can determine if a change to our QoS has any demonstrable result. In examining the current Connext toolset, I don't find anything that is an exact match for our needs. I'm less interested in inspecting the data within the sample but rather the data about the network itself.

For example:

Statistics for given topics:

Sample rate over time (chart)
Samples lost over time (chart)
Sample timing jitter over time (chart)
Average sample rate over collection period
Min/max sample rate over collection period
Standard deviation in sample rate over collection period

The statistics in #1 but by data writer, broken out by topic or aggregated.

Is there anything like that available, or will I have to use recorder and then examine the database on my own?

Thanks!

---Jason

Sat, 02/03/2018 - 03:25

#2

KickR

Offline

Last seen: 11 months 1 week ago

Joined: 02/11/2016

Posts: 144

Hey Jason,

1. This topic is (in part) covered in a post I made ( https://community.rti.com/forum-topic/how-obtain-full-view-system ).

2) I'm unsure what "sample timing jitter over time" means.

3) Among other things, RTI has an api for checking the amount of samples lost and samples that arrived.

4) RTI Monitoring Library (which I do not recommend, I'll explain later) allows you to publish abovementioned data (and much more) periodically.

If you choose to use RTI monitoring library (by enabling it in qos / programmatically), you can either utilize RTI Monitoring Service (and to some extent, other RTI tools) or write your own monitoring tool to view this monitoring data.

One issue I've encountered with RTI Monitoring is that it seems to create STW (stop-the-world) pauses for all RTI entities (so, if you have a lot of different rti entities, when ever data is to be sent periodically, all of them will be stalled). Personally this introduced (in one application that has a lot of entities) a 500 ms stall happening periodically.

5) For monitoring the data you described I would recommend doing the following:

5.1) Attach listeners to all of your readers (if you are currently using them for other purposes, let all your listeners extend a shared listener implementation) and use the listener to capture: on_data_available (triggered when new samples are received) and on_sample_lost (triggered when rti identifies that a sample was lost).

Depending on your programming language there are different libraries I would recommend for capturing the data (I don't recommend implementing this infrastructure on your own).

For java, I use dropwizard metrics to capture metrics (for example, counters) and publish them to a remote db periodically.

5.2) If you can, I would recommend wrapping every read (or take) operation with some code that will capture the amount of samples read.

5.3) If you use dropwizard metrics (or similar libraries) you will have a selection to make of where to report the data to, personally I went with influx db, a time-based data base which has good support for tags (useful for separation per topic / per writer) and is well supported by my visualization solution, grafana.

5.3) You can separate the data per writer per topic using a naming format, if you use dropwizard metrics, there are a few workarounds that allow you to use tags on a per-metric basis.

TL;DR:

Other than the metric I'm unsure about (jitter), the rest can be covered by wrapping read/take code and utilizing listeners.

Capturing (optionally, with tag separation) and reporting metrics can be done using libraries, for java I would recommend dropwizard metrics

I would recommend reporting to influx db but there are other good options, as well.

Once you've reported the samples lost and samples read per reporting period you can use a visualization tool such as grafana (or kibana) to present this data in various forms (including rate).

p.s.

If you only interested in the rate at which events happened, you may find the Meter metric to be a better fit than the Counter metric (as the Meter will spare you the data manipulation in the visualization tool).

p.s. 2

I would recommend steering clear of averages and standard deviation when examining networking performance. You can look it up but extensive research shows that normal distribution has little to do with latency and networking performance.

Instead I would recommend looking at histograms.

I hope this helps,

Roy.

RTI Community Portal Terms of Use

NOTICE: Any content you submit to the RTI Research Community Portal, including personal information, is not subject to the protections which may be afforded to information collected under other sections of RTI's Web site. You are entirely responsible for all content that you upload, post, e-mail, transmit or otherwise make available via RTI Community Portal. RTI does not control the content posted by visitors to RTI Community Portal and, does not guarantee the accuracy, integrity, or quality of such content. Under no circumstances will RTI be liable in any way for any content not authored by RTI, or any loss or damage of any kind incurred as a result of the use of any content posted, e-mailed, transmitted or otherwise made available via RTI Community Portal. Read the complete Terms prior to use.

Please see RTI's privacy policy and cookie policy if you have questions about any information collected during the sign-up process.

Community of RTI Data Distribution Service Users. Copyright © Real-Time Innovations, Inc.