Grafana Labs has released Grafana Mimir 3.0. This is a significant advancement for the open-source, horizontally scalable time series database. The release features a new design that separates read and write operations. This change greatly boosts performance, reliability, and cost efficiency for organizations handling metrics at scale.
Grafana Mimir, launched in 2022, is now a top metrics backend for Prometheus and OpenTelemetry. It has gained over 4,700 GitHub stars and has 30 active maintainers. The project’s main goal is to build a highly scalable and efficient open-source time series database. It aims to support 1 billion active series and more.
The main feature of the 3.0 release is a new decoupled architecture. This change fixes a key limitation found in earlier versions. In earlier versions of Mimir, the ingester component handled both reading and writing. This setup meant that heavy query loads could hurt ingestion performance. The new design adds Apache Kafka as an asynchronous buffer between ingestion and query tasks. This allows each path to scale on its own and removes the cross-path dependencies that affected system stability before.
This architectural shift brings in “ingest storage,” a key element from Grafana Labs. It helps prevent spikes in query volume from slowing data ingestion and vice versa. Internal tests showed big gains in reliability. The risk of read path outages from random ingester failures decreased significantly, especially during the early failure stages.
Alongside the architectural overhaul, Mimir 3.0 makes the Mimir Query Engine (MQE) the default query engine. First rolled out in Mimir 2.17, MQE represents a departure from the traditional Prometheus PromQL engine’s approach to query processing. The standard PromQL engine processes samples in bulk. This can cause unpredictable memory use. In contrast, MQE uses a streaming approach. It loads only the necessary samples at each query execution step. Grafana Labs reports that this method reduces peak memory usage by up to 92 percent. This leads to faster queries and better reliability during heavy loads, while still being 100 percent PromQL compatible.
The performance improvements extend beyond query execution. Grafana Labs found that large clusters in their setup use up to 15% less resources. At the same time, they see better performance and greater reliability. These gains come from the decoupled architecture and the efficient query engine working together.
Mimir 3.0 reflects lessons from running large Mimir clusters. The team learned from customers like CERN, who use Mimir at scale. They focused on three main areas:
-
Reliability through the separation of concerns
-
Performance through streaming query execution
-
Cost optimization through better resource use
Grafana Labs advises organizations to plan upgrades carefully due to major architectural changes. The upgrade involves deploying a second Mimir cluster next to the current one. Then, reconfigure write clients to send data to multiple endpoints. Finally, switch read clients to the new cluster. Organizations need to modify Helm or Jsonnet configurations for both clusters during this transition.
The updates in Mimir 3.0 are already available in Grafana Cloud Metrics, the fully managed metrics service powered by Mimir. For self-hosted deployments, upgrade guides and release notes are in the project documentation. They help organizations move to the new architecture smoothly.
The release shows three years of hard work. It highlights the project’s goal to improve metrics storage and retrieval. Mimir 3.0 has a new decoupled architecture and an improved query engine. This helps organizations scale their observability systems. It also cuts down on complexity and resource costs.
Several robust alternatives exist for organizations seeking time series database solutions beyond Mimir. Prometheus is a popular open-source tool. It offers a strong query language (PromQL) and integrates well with Kubernetes. However, it is mainly designed for single-node setups. InfluxDB is another widely-used option. It handles high write and query loads well. It features its own query languages (InfluxQL and Flux) and supports IoT and real-time analytics. TimescaleDB, an extension of PostgreSQL, appeals to teams familiar with SQL. It lets them use existing PostgreSQL tools while gaining time series optimizations. For cloud-native needs, Amazon Timestream and Google Cloud Monitoring provide managed services to reduce operational tasks. Thanos enhances Prometheus by adding long-term storage and global query capabilities. It addresses some scalability challenges that Mimir also tackles. Each choice has trade-offs in scalability, query performance, operational complexity, and ecosystem compatibility.
