Have you noticed that your company has hundreds of dashboards filled with technical metrics, making it even harder to understand what you should focus on? If you’re a developer, you likely use only a few dashboards and monitor a handful of services you’re responsible for. Developers have their own metrics, DevOps teams have others, and finding the root cause of an issue can be challenging.
Not every company has a dedicated team of system reliability engineers and DevOps who have already set up the infrastructure. Even if yours does, it’s still crucial to understand what matters most. But what if you need to build everything from scratch, and your company lacks proper monitoring, perhaps only having basic logging in place? How do you determine what you need and where to start?
Grafana offers 170 pages of dashboard templates, that’s a lot! I believe this article can help engineers understand why metrics are important, which ones are essential, and how to integrate them effectively. I won’t cover logs, traces, or alerts here, as that would make the article too long. I’ll address those separately. For now, I assume you already have some logs and basic alerts in place.
Beyond that, adding the right metrics can significantly enhance your system’s reliability, and I’ll explain why.
What is a Metric?
A metric is something that indicates how your system is performing. It can be represented in various forms, such as a chart in various forms, a percentage, or even a simple number. You need to determine whether you want to track trends over time, like a chart displaying the number of requests per second or a single, finite value, such as service uptime over one week. The challenge is that there is no universal standard for which metrics should be used in specific cases, yet the number of available templates is overwhelming.
At first glance, this makes it difficult to identify what your system truly needs. As a result, many people adopt a familiar approach, implementing what they have seen in previous workplaces. I’m not saying this approach is wrong, but I’d like to highlight the key steps you should consider when defining metrics for your system.
Google’s SRE Book states that your monitoring system should be as simple as possible; otherwise, it will become difficult to maintain. I agree with this. Having charts that display just the four golden signals for each service is a great start, but I believe it may not be sufficient for a robust observability system.
What Types of Metrics Exist?
Before diving into details, let’s first understand which metrics we need to monitor for our system and how we can define technical metrics. There are various types of metrics a company can have:
- Business metrics
- Technical metrics (Infrastructure level and Application level)
- Project metrics
- Incident metrics
- Other metrics
Choosing the right technical metrics for your system depends on what you consider critical to monitor. These metrics typically apply to infrastructure and application levels:
- The infrastructure level includes everything where your applications are deployed, such as VMs, Kubernetes clusters, storage systems, and network components.
- Application level refers to your service’s maintainability.
Some metrics apply to both infrastructure and application levels, e.g., CPU or memory usage. You might display CPU consumption for VMs where your application is running on one dashboard, while another dashboard shows CPU usage for the service itself. The first dashboard will likely show a higher value. Having both views is useful for detecting anomalies. For example, when VMs are running out of memory, the service may appear fine, but background cron jobs could be consuming all available resources.
Sometimes, technical and business metrics overlap. Let’s consider an online marketplace, where you might have two charts showing the number of total and failed orders for a period of time. The product department most likely is tracking the number of total orders but it can be on a separate board or even in a separate system.
From the technical side, you could set up an alert if the number of failed orders exceeds 10 per minute. When this alert triggers, you would check the dashboard, review the current values and trends for total and failed orders, and then go to the service-specific boards.
How to Understand if You Need a Metrics and Monitoring System
You could work in a big tech company and be deeply involved in the monitoring process, where numerous dashboards exist and every step is covered. This might lead to the misconception that metrics are an absolute necessity. However, I believe you should start by identifying your actual needs.
Metrics are valuable, but having basic alerts to ensure your application is functioning properly and logs to diagnose the root cause of issues are even more important; they will provide greater benefits in the long run. Monitoring becomes essential as your system grows in complexity, introducing multiple potential failure points. Additionally, most cloud providers offer basic infrastructure metrics out of the box, so you don’t need to set up everything from scratch.
From an engineering perspective, setting up a monitoring environment is a challenging task. It involves selecting the right technology stack, configuring exporters, setting up collectors, ensuring scalability and fault tolerance, creating visual dashboards, connecting them to collectors, and more. Once everything is in place, you need to carefully define what to monitor, otherwise, your dashboards could become cluttered and ineffective rather than useful in the future.
At the same time, if developers spend too much time identifying the root cause of issues, especially in a microservices architecture, a monitoring system with well-defined metrics can be highly beneficial. It could save you hours of investigation and when your system is down, these hours mean a lot. However, the initiative for such a system must come from the technical side. The business itself does not need technical metrics and might even resist their implementation, so you must clearly communicate the benefits.
What Metrics Do You Need?
Finally! Your application has grown significantly, and you now need metrics to understand what’s happening when the system goes down or experiences issues.
First, cover the infrastructure level. If you use virtual machines, you must monitor CPU, RAM, storage capacity, and network bandwidth. If your application runs on Kubernetes, you can find ready-made templates to collect all necessary information. For example, you may have:
- A separate dashboard for cluster statistics (CPU usage, memory usage, node count, namespace count, instance details, etc.).
- A separate dashboard for K8s node statistics.
There are plenty of templates available online that can help you set this up efficiently. Next, break down your system into technological components, such as databases, message brokers, caches, and your application services. Create a separate dashboard for each component to monitor relevant parameters. For databases, this might include:
- Number of processed requests
- Number of connections
- Insert/update/delete operations
- Number of errors
- Waiting connections
Again, templates can be found online, I will provide links below.
Once your infrastructure and core technology components are covered, you can create dashboards for your applications. Initially, I recommend having a separate board displaying the Four Golden Signals for necessary services:
- Requests per second (RPS)
- Error rate
- Request duration (latency)
- Saturation (resource usage, including CPU, memory, storage, and network bandwidth)
These metrics will be sufficient for the first implementation of any application. Additionally, I recommend tracking service uptime and log error rate. These will help you assess reliability and detect potential incidents immediately.
Afterward, you can introduce metrics to track specific use cases. That’s where technical metrics intersect with business ones, e.g. total and failed orders, the average time for order processing, average payment time, search time, etc.
Sometimes, you need to provide a certain service level to your clients, known as a Service Level Agreement (SLA). This is especially relevant in B2B environments. SLAs may include commitments such as total RPS (requests per second) or the number of active users.
To monitor SLA compliance, you may need to set up dashboards that cover the most critical business use cases. This allows you to quickly identify anomalies or, during load testing, determine whether the SLA is still achievable or needs adjustments.
If you don’t have documented SLA/SLI/SLO but do have Non-Functional Requirements (NFRs) to ensure a high-quality service for your clients, e.g. number of active users, first-page loading, etc., setting up a dedicated dashboard with relevant metrics is a great approach. Dashboards simply visualize the data you define, so instead of manually checking APIs across different services and analyzing multiple endpoints, a centralized dashboard can aggregate everything you need to track.
, you can configure alerts for these metrics, allowing you to quickly respond to potential issues.
Monitoring Tech Stack
If you found something new in this article, I would recommend using Grafana for visualization and OpenTelemetry for metric processing and storage. Here are the few benefits of OpenTelemetry
- It’s open-source, so you don’t have to pay for anything.
- It integrates seamlessly with Loki and Tempo for logs and traces.
- It has a large and active community, which means you can easily find support and answers to most questions.
- It includes its own SDK, agents, and collectors, making installation and setup easier.
- It supports metrics in Prometheus-compatible format, so Promql requests will work.
- Many other things I don’t know or remember.
Additionally, Grafana has excellent documentation, which makes setting up alerts based on Prometheus metrics much easier. Another monitoring solution is Datadog, a SaaS platform where you don’t need to worry about infrastructure setup, configuration, scaling, or performance. Datadog provides everything in one place, including logging, alerts, and traces.
With Grafana, you need to configure these components separately and handle persistence, scalability, and performance yourself. However, the downside of Datadog is its high cost. Unless you’re a large tech enterprise, you likely don’t need such an expensive solution in the early stages.
Other metrics monitoring and storage solutions are either not well-known, less reliable, or difficult to find engineers with experience in making them less practical options.
Defining metrics is a complex and challenging task. You need a structured approach to ensure it provides real value. This becomes even more difficult when working with 100+ engineers and needing to start from scratch, it’s clearly a team effort.
I hope you found something useful in this article and that it helps you build a more reliable system. Take care!