Agoda’s engineering team recently shared their custom solution designed to maintain critical Kafka consumer operations across multiple on-premise data centers, ensuring business continuity even during outages. Processing over 3 trillion Kafka records daily, Agoda needed a failover mechanism that could seamlessly shift consumer workloads between distinct Kafka clusters while preserving processing state and avoiding data duplication or loss.
Rather than relying on Kafka’s stretch clusters, which proved impractical due to geographic latency, or MirrorMaker 2, which lacks bidirectional offset synchronization, Agoda engineers developed an enhanced system that extends MirrorMaker 2 to support reliable failover, seamless failback, and persistent offset translation. Their approach involves always-on, two-way synchronization of consumer group offsets and OffsetSync records between clusters. When a consumer group commits an offset in one data center, that offset is translated and updated in the other cluster using a custom synchronization service built around Kafka Connect and OffsetSync mechanisms.
In failover scenarios, the secondary cluster seamlessly takes over processing from the exact point consumed in the original location, thanks to the translated and replicated offsets. When the primary data center returns, the system supports failback: consumer offsets are synchronized back to the original cluster, ensuring continuity without duplicating messages or losing progress . To avoid cyclic offset updates, the sync service checks for already-in-sync states before applying updates.
The system also includes strong observability components: dedicated Grafana dashboards track metrics such as replication delays, sync failures, and consumer lag to detect anomalies early and intervene before operational impact occurs. This real-time visibility supports reliability across the multi-data-center Kafka deployment.
The custom failover and failback architecture reflects a growing trend where organizations engineering at a multi-DC scale cannot rely on default Kafka features. According to Agoda, this system provides the necessary resilience for service continuity, precise processing semantics, and disaster recovery capabilities at scale. This approach highlights a strategic commitment to operational rigor and design flexibility, enabling Agoda’s data platform to withstand infrastructure outages without compromising correctness or throughput.
On Instagram, Agoda posted about their Kafka infrastructure handling over three trillion records per day and noted:
“To maintain business continuity during data center outages, we must be able to shift Kafka consumers across clusters.”
Other companies with large-scale streaming platforms have tackled multi-data center Kafka failover challenges in ways that share similarities with Agoda’s custom solution, though their implementations differ depending on operational constraints and priorities:/p>
MirrorMaker 2 is Kafka’s built-in tool for cross-cluster replication. By default, it supports unidirectional replication, copying data from a primary cluster to a secondary cluster. While this works for active-passive failover scenarios, it lacks native support for bidirectional offset synchronization and seamless failback. Without custom extensions, MM2 cannot translate consumer offsets across mirrored topics, which means consumers would have to reprocess messages or risk missing data after failover.
Netflix runs multi-regional active-active systems built on Kafka for events and microservices communication. Their solution utilizes custom tooling on top of MirrorMaker (pre-MM2) to replicate data between regions. For failover, Netflix integrates with its control plane (Zuul, Eureka) to redirect traffic and resume consumers in alternate regions. While Agoda’s solution automates offset translation, Netflix historically prioritized idempotent event handling and replay to handle failback scenarios.
Uber’s streaming stack (uChannel, Apache Kafka) supports global services like Uber Eats and ride dispatch. They implement geo-distributed replication but often rely on asynchronous failover models. Similar to Agoda, Uber avoids cross-region synchronous clusters and uses local offsets with checkpointing to resume consumption in disaster recovery scenarios. Their model emphasizes partitioning workloads by geography and replaying from checkpoints during failback rather than continuous bidirectional sync.
Confluent’s commercial tool extends MM2 with more advanced capabilities, such as better topic auto-creation and schema replication. However, like MM2, it doesn’t inherently solve offset translation for bidirectional failover. Additionally, enterprises may face vendor lock-in and licensing costs when adopting Confluent’s solution.
Agoda’s design requires ongoing offset synchronization and observability tooling (Grafana dashboards for sync lag and failure rates). This adds complexity compared to simpler unidirectional DR setups but provides higher reliability and correctness for critical, high-volume workloads. Other companies that prioritize cost and simplicity might accept some replay or manual failback instead of building custom solutions.