How to obtain a full view of the system

5 posts / 0 new
Last post
Offline
Last seen: 2 months 6 days ago
Joined: 02/11/2016
Posts: 142
How to obtain a full view of the system

Hello,

 

Say that we want to develop a system (make it distributed, make it big) and we have the following prerequesites:

1. We will be using RTI Connext DDS as the messaging solution between the various parts of the system

2. we want to be capable of monitoring the system very thoroughly which means, as far as RTI is concerned: who's sending who messages, how often, how big are the messages, how many are lost/rejected/etc, what are the latencies, how many entities of each kind are created, destoryed, enabled, disabled (and other states which can be tracked but we have not listed)

3. We want to be able to easily integrate the above RTI monitoring with our general monitoring (for example, monitoring of business logic, or system related measurements)

 

I will confess that in our workplace we have already done a lot to integrate many of the described monitoring data into our general monitoring.

Using dropwizard metrics we've been able to:

1. count entity creations and destruction caused by our code

2. time and record the various api calls we make (copying samples, returning loans, reading, taking, and so on)

3. count various events using listeners (matches being made, samples being lost / rejected

4. measure latencies (up to inaccuracies caused by clock synchronization between different computers) using both the timestamps offered by rti and other timestampes that we add to our entities.

We are aware these latencies cannot be used to benchmark RTI latency, but it does help with the outliers, which were completely unknown to us before.

 

So it seems we have all the data we wanted gathered up, why am I making a post?

Well, there are 2 (or more) reasons:

1. It took us a lot of effort to get all of this data, which I'm sure RTI has no problem sharing with users if they so please (in fact, some of it is accessible via RTI Admin Console / RTI Monitoring Service [apart from latencies, except for in-service latency measurements, for example when sending routing service monitoring data]).

(In other words: Is there a faster way to get this data you'd recommend?)

2. I would like to hear about what RTI thinks is a good strategy for "understanding the system" in runtime.

(Specifically: Which things you'd recommend, if any, to monitor, and how? In case issues arise, what is your recommended path of treating it?)

 

(Notice that I avoided making suggestions as I am open to hear about the various options).

 

Thank you,

Roy.

ken
ken's picture
Offline
Last seen: 4 weeks 13 hours ago
Joined: 04/13/2011
Posts: 53

Hi Roy,

Thanks for your post! These are concerns that I believe many of RTI's customers also have. You mentioned Admin Console and Monitor which do provide some of the information you've mentioned though not all of it. I've tried to group your concerns below and will attempt to address each of them in turn.

  • Who's sending who messages / matching
    • I recommend these DataWriter APIs (DataReader has similar APIs):
      • get_publication_matched_status()
      • get_matched_subscriptions()
      • get_matched_subscription_datawriter_protocol_status()
      • get_matched_subscription_locators()
      • get_matched_subscription_datawriter_protocol_status_by_locator()
  • How often are messages being sent
    • DataWriter (but similar on the DataReader)
      • get_matched_subscription_datawriter_protocol_status
      • get_matched_subscription_datawriter_protocol_status_by_locator
  • How big are the messages
    • What we have today is the protocol status mentioned above. It uses the DataWriterProtocolStatus object to track summary information but does not include min/max/average-type metrics.
  • How many are lost/rejected/etc
    • You can find this in the DataWriter/DataReader ProtocolStatus APIs.
  • How many entities of each kind are created, destroyed, enabled, disabled
    • I recommend subscribing to the builtin topics if you're trying to track this remotely. I'm not aware of a callback to track this locally. Nor am I aware of a way to track the enabled/disabled state. Of course, the extended states you mention sound as if they are specific to your design so we don't have a mechanism to track those.
  • What are the latencies
    • I believe you're talking about tracking the one-way latency of a sample here. This is a feature which many want but is tricky to implement. You already understand the big problem here as you mentioned it, clock synchronization. I once worked with a customer that had spent a lot of time, money, and effort on getting synchronized time thoughout their system. It was a critical design aspect of the system as it would fail without tight synchronization. I warned them that absolute measures would likely be useless but they insisted on having them so I instrumented them. To my delight, the latency came back negative (meaning the sample arrived at the reader at an earlier time than it was sent from the writer; clearly indicating clock drift as this was not possible). Having said that, I'm sure you still want it (as do many people). What I can say at the moment is that we don't have one.
  • Time and record the various api calls we make (copying samples, returning loans, reading, taking, and so on)
    • These are best done with some local API & clocks. RTI doesn't have a mechanism to measure/distribute this type of information today.
  • We want to be able to easily integrate the above RTI monitoring with our general monitoring (for example, monitoring of business logic, or system related measurements)
    • I'd love to hear more about your general monitoring; what software do you use? We could certainly have a follow-up meeting if you'd rather not get into all of that in a post.
  • Dropwizard metrics
    • This appears to be a very useful Java library. The thing is that RTI supports a lot of languages and the bulk of our middleware logic is written in C. So, this particular library isn't something that we would be able to use.

 

Your post is very timely as we are looking into monitoring in general and are trying to collect customer use cases. Would you be interested in having a follow-up meeting to discuss further?

Thanks,
Ken

Offline
Last seen: 2 months 6 days ago
Joined: 02/11/2016
Posts: 142

hey ken,

thank you for the detailed answer.

i was aware of the existence of many of the APIs you listed (I am also aware that much of the data is obtainable via the monitoring data that's used by rti monitoring service).

 

the problems we have with rti monitoring are in two aspects:

1. It is not flexible in terms of what view you're getting.

2. The view that is available is a bit difficult to navigate.

3. In our environment which has many apps running in many domains it's difficult to even start the application.

4. We would like the data to be available alongside business related monitoring (to allow better tracking of correlations and identify root causes of problems)

 

while 2 and 3 are a bit less related to the topic, I will extend regarding 1 and 4

1. Monitoring service is a complete product which puts together: the way data is collected, the way it is stored, and the way it is presented.

i feel it would be very nice to have a service which collects "monitoring" data (maybe utilizing some yaml to configured which data it collects) and makes it accessible (there are many ways to go about this but I think supporting multiple options is possible if you design it properly)

otherwise, it would be nice if rti based application could be configured publish/expose APIs for monitoring data that are not based on rti (for example: servlets, graphite, influxdb format)

i should note that while many of the interesting things already exist via monitoring service, things like timing of api calls are not abailable.

this is, I suppose, related to your comment about drop wizard metrics.

its true that rti may not utilize it (although it may be utilized in the Java Api) but I'm sure c also has a similar solution.

4. So what we use is a simple model:

java apps use drop wizard metrics to collect metrics (counters, meters, timers, gauges, and histograms)

we also use an external histogram implementation which is said to be better (hdrhiatogram)

we then use another library so that our metrics can be reported periodically to influx db

At this point it is possible to do all sorts of queries on influx db, if one wants to.

grafana is what we use to set up dashboards, although kibana is very similar.

grafana integrates well with many different data sources and allows for easy creation of graphs, tables, and single stat.

this allows us to create different dashboards for different users as well as an easier way to see connections between user experience, application level performance, jvm level peroformance, and os level performance.

 

that being said, we are also looking into implementing a monitoring view which allows the user to more easily understand the connectivity of the systems (whos wrting what to whom and at what rates).

for this we may utilize neo4j and we may just use statically defined connections (since we don't really add new services very often, it's possible for us to map who should be sending what to whom and then monitor that things are functioning correctly).

 

 

in a related note (since logs are also useful for monitoring): rti log parser is nice and all, but is there a plan to make rti logs generally easier to understand?

alternatively, it may be useful to have something like the rti log parser in the log stash format (so that people may send rti logs to log stash, have it dissected into useful information, and store that information somewhere, possibly elasticsearc).

 

 

i would be happy to discuss all of these things more thoroughly when I'm back from my vacation (end of October)

 

thank you,

roy.

ken
ken's picture
Offline
Last seen: 4 weeks 13 hours ago
Joined: 04/13/2011
Posts: 53

Hi Roy,

   I'll try to answer as best I can...

"1. Monitoring service is a complete product which puts together: the way data is collected, the way it is stored, and the way it is presented."

   It is, yes. But that's not the way I'd like to approach our future developments in this area.

"i feel it would be very nice to have a service which collects "monitoring" data (maybe utilizing some yaml to configured which data it collects) and makes it accessible (there are many ways to go about this but I think supporting multiple options is possible if you design it properly)"

   Yes, agreed...this is the way I'm thinking about it as well.

"i should note that while many of the interesting things already exist via monitoring service, things like timing of api calls are not abailable."

   This is true. I will make a note and see if there's any way to address this sort of thing in new developments.

"its true that rti may not utilize it (although it may be utilized in the Java Api) but I'm sure c also has a similar solution."

   My comment here was in relation to the fact that most of our DDS code is written in C and other languages are provided as an API binding to that logic. Given that situation, and the fact that we cannot be sure that a JVM will be available on the host machine, there's no way we could use a Java library for the metrics.

"in a related note (since logs are also useful for monitoring): rti log parser is nice and all, but is there a plan to make rti logs generally easier to understand?"

   I will pass this feedback along to the core development team.

"alternatively, it may be useful to have something like the rti log parser in the log stash format (so that people may send rti logs to log stash, have it dissected into useful information, and store that information somewhere, possibly elasticsearc)."

   I know of no plans for this right now. Again, I'll pass this along to the appropriate person.

   Thanks for the pointers to the many software projects you're using. Much appreciated. It would be great to have a discussion when you're back. Yes. Let me know and I'll get it setup with you.

Thanks again,
Ken

Offline
Last seen: 2 months 6 days ago
Joined: 02/11/2016
Posts: 142

Hey Ken,

It has been a while but I'll reply to your latest post:

1. "It is, yes. But that's not the way I'd like to approach our future developments in this area." 

I'm glad to hear that.

2. "Yes, agreed...this is the way I'm thinking about it as well."

Again, glad to hear.

3. "This is true. I will make a note and see if there's any way to address this sort of thing in new developments."

It's by timing these api calls that we detected that we were spending a lot of time in some of the code. At times it lead us to the conclusion that we needed to change something in our qos (or avoid the api call).

4. "My comment here was in relation to the fact that most of our DDS code is written in C and other languages are provided as an API binding to that logic. Given that situation, and the fact that we cannot be sure that a JVM will be available on the host machine, there's no way we could use a Java library for the metrics."

At the very least I'd recommend instrumenting the core.

Instrumenting the apis (java / c# / others) can be useful to detect issues related to the jvm / the specific api used.

5. "I will pass this feedback along to the core development team."

I did see in the notes for 5.3.0 that the log messages have been improved (a standard format, added topic and type related info to some of the messages, etc), I hope to find them sufficiently improved when my team makes the switch to 5.3.0!

6. "I know of no plans for this right now. Again, I'll pass this along to the appropriate person."

To make this point clearer:

a. As far as I know RTI logging qos currently does not support sending log messages to a udp/tcp transport (unlike most popular logging solutions). This makes it difficult to identify when an issue is related to rti (issues being: overall slowness, crashes, latency spikes, and more)

b. Even if we work around this problem (for example, by running a logstash instance on every machine and setting it to collect data from rtiDds.log), it would still be difficult for us to parse the logs (that is, to interpret what the different parts mean in a key-value format).

As ELK is becoming quite central as a solution for log based event monitoring in complex systems, I would recommend learning a bit about logstash (https://www.elastic.co/products/logstash) and supplying users with a basic logstash configuration for the log messages (examples for logstash configurations can be found here: https://www.elastic.co/guide/en/logstash/current/config-examples.html)

 

After saying all of the above I will finish by saying that I will be happy to have that discussion when we are both able to.

 

Thanks,

Roy.