January 12, 2021
Perhaps you think that system monitoring is some sort of dashboard presented on the TV screen with a number of charts and graphs that large corporations or research centres utilise. Or possibly you believe it’s a solution for a nuclear power plant or another industrial facility. Well, in this article, I’m going to prove you wrong. System monitoring can (and, in fact, should!) be used with a variety of tools, no matter how big your company is. Admittedly, system monitoring brings a lot of benefits both for the development team and for end users! This raises the question – what is system monitoring all about? You’re about to find out!
In this article, we are going to show you two interesting software solutions – Grafana and Azure Monitor with Application Insights.
System monitoring is vital regarding observing your systems’ state. What is more, apart from the current state, you can predict how it will behave when you increase and activate some additional actions, like, for instance, scaling or adding other resources.
Apart from some kind of dashboard, monitoring also plays an alerting role. After all, you do not want to observe tens of charts continually, but you want to be notified immediately when a specific action is needed. The key is also to build and configure monitoring systems so that you immediately see what the problem is. There should be no need to dig into tons of logs to see the root cause of the given issue.
Of course, there can be some exceptions, but your main goal is to achieve responsiveness and understanding of the situation that happened in order to take adequate steps.
Let us begin with the users’ point of view. Currently, we have a variety of systems. Many of them are critical, like those supporting production processes or trading. Concerning those systems, any downtime can be considered an irreversible loss. For this reason alone, it is so important to act quickly.
Adequate monitoring should not only help developers but also assist less technical people to understand the big picture of what's happening. Thanks to an appropriate response path that can be worked out along with system monitoring, companies can save time and money alike.
There is a saying, "The less you know, the safer you are." However, it's not necessarily the case here. No one wants to be woken up in the middle of the night by a call with a notice that the production system is not down.
Very often, when implementing a new feature of the system or developing some critical core part of it, the development teams wonder what the overall impact of this change will be on the IT infrastructure. Will it affect our databases in the way that the new indexes will need to be created, or will it hit the processing of some asynchronous queue? We can implement a significant number of units and integration tests to spot potential bugs, but you simply can't cover and analyse all possible scenarios. The typical release path consists of deployment to some pre-production environment. This can be a good checkpoint to catch some deviations at the monitoring dashboard.
In the next part, we will analyse tools that will help you achieve your goals, and we will focus on the metrics that are worth monitoring.
In fact, Azure Monitor is a versatile tool that is tremendous concerning monitoring all relevant resources within your company. According to Azure Portal, Monitor is a tool used primarily for:
Azure Monitor – overview
As you can see, Azure Monitor itself provides huge possibilities for monitoring. It exposes numerous metric variables that can be visualised in multiple ways. Together with Application Insights, you can analyse the telemetry from a variety of applications that use .NET, Node.js and Java, hosted on-premises as well as in the cloud. What’s more, you can create your own custom dashboard that combines all the relevant metrics and results of the most useful Application Insights queries.
This tool, on the other hand, provides more visualisation options than the Azure Monitor. It also supports multiple data sources. It can combine data from multiple sources in a single dashboard, just like Azure Monitor or Elasticsearch, which is extremely useful when you have all resources monitored in a single place.
Grafana is designed for visualising and analysing custom user metrics plus those available in Azure Monitor, like:
As an output, you can create a smart chart with a few metrics where you can put different units on the axis. Why can it be useful? Because, most likely, you want to observe how a change of one metric affects the other. For instance, does high CPU usage influence the processing of some critical message queues? Can the request timeout be correlated with some timeout on the database level?
Of course, you can say that your system already has an excellent logging level, and you do not need to look at additional fancy charts. And yes, it can be accurate, but the real question is, how much time do you spend on monitoring your systems?
In order to focus on monitoring, I will skip the part where I describe how to integrate Azure with Grafana. This topic is very well-described in many articles. Take a look at this text for instance. In terms of pricing, it always depends on the data volume, retention and usage. For more details, check the Azure Calculator.
Before we make a deep dive into the Grafana charts, let’s consider the easiest thing you can do – creating your own heal check of the critical resources and the APP service itself as simple API that pings resources, for example, every 5 minutes. To run this, we can use the Azure Availability Test from Azure Portal and configure some email alerts that let us know whether or not the test failed. Thanks to that solution, you can be quickly notified which resource is down.
Let's do the review of some sample metrics that you can get from Azure and showcase on Grafana. Keep in mind that this is just a starting point, you can build your dashboard gradually, step by step adding new charts depending on your current needs.
Here, we will illustrate the most important metrics like CPU usage, a number or requests per second and IO bytes/s processed.
The most important performance metrics
Then, let’s also have a look at request performance buckets. Very often, you need to fit into the requirements that the system imposes in order to process some percentage of request below n ms/s. Having such values at the chart, you can quickly verify them.
Requests performance buckets
This interesting statistic focuses on a ratio: The total number of requests vs. failed ones. The sudden deviation of failed requests may be an alert that some specific action needs to be taken in the system, and your users are currently experiencing some serious problems.
Total requests count vs failed
Going further, let’s also drill down into the specific number of the HTTP client and server errors. Even though 4xx errors are typically client-related, it is often useful to know which error code a user is encountering to determine if the potential issue can be fixed.
Failed requests grouped by HTTP error code
Now, when it comes to the most critical chart: Server exceptions vs. HTTP server errors. When the server exception pops up, you should definitely check application logs to see the root cause.
Server exception and server errors
As we have a basic set of measurements to observe, it's essential to think about what is specified in your system? Is there a particular API request that is critical and needs to be continuously monitored?
Above all, you ought to observe queues where some asynchronous processing is taking place. Does your system have any limits of concurrently processed messages? If you observe peaks at the chart, take a moment for further analysis. Check which resources are in use, ensure that a bottleneck hasn’t occurred. If needed, observe them as well. Activate an alert when the processing of the message fails.
Sample queue message count status
The Grafana charts shown above can be only a starting point in the system observation. If you are looking for something specific, like, for instance, full monitoring of your Elasticsearch clusters but you do not know how to start, go here and check whether there’s a dashboard built by the Grafana community that you can easily download as a JSON file, import it, configure data source and voila – you’re ready to go!
It's fair to conclude that each organisation, regardless of size, can benefit from sound system monitoring software.
With some big names available on the market, system monitoring solutions should always be considered as an essential aid in your everyday work. As organisations develop and grow, system monitoring can act both as a safety cushion and a springboard to the future with its many functions and features.
Premature optimisation may not be the best practice, but system monitoring can just be the perfect solution for your company!
Ewelina is a software engineer with +9 years of experience in designing and developing applications based on .NET technologies. She is a person who enjoys creating useful solutions which help other people. She is focused on understanding customer needs with 'can do' attitude. As Domain Driven Design enthusiast, always seek for a clean, reusable and testable solution.
Contact Ewelina: firstname.lastname@example.org