Key Takeaways
- Apache Kafka Stretch cluster setup in an on-premise environment poses a high risk of service unavailability during WAN disruptions, often leading to split-brain or brain-dead scenarios and violating Service Level Agreements (SLAs) due to degraded Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
- Proactive monitoring of your Kafka environment to understand data skews across brokers is critical; an uneven data load on a single node can cause a stretch cluster failure.
- An ungraceful broker shutdown during upgrades or due to poor security posture can make the Kafka service unavailable, resulting in a loss of data.
- There are three popular and widely used Kafka disaster recovery strategies; each has challenges, complexities, and pitfalls.
- Kafka Mirror Maker 2 replication lag (Δ) for the Disaster Recovery setup can cause message loss, inconsistent data.
Apache Kafka is a well-known publish-subscribe distributed system widely used across industries for use cases such as log analytics, observability, event-driven architectures, and real-time streaming. Its distribution allows it to become the critical piece or backbone of modern streaming architecture, offering many integrations and supporting near-real-time or real-time latency requirements.
Today, Kafka has been offered by major companies as a service, including Confluent Kafka as well as cloud offerings such as Confluent Cloud, AWS Managed Streaming Kafka , Google Cloud Managed Service for Apache Kafka, Azure Event Hub for Apache Kafka, etc. Additionally, Kafka has been deployed across over one-hundred-thousand companies, including Fortune 100 across various industry segments such as enterprise, financials, healthcare, automobile, startups, independent software vendors, and retail sectors.
The value of the data diminishes over time. Many businesses these days compete in time and demand real-time data processing with millisecond responses to power mission-critical applications. Financial services, e-commerce, IoT, and other latency-sensitive industries require highly available and resilient architectures that can withstand infrastructure failures without impacting business continuity.
Companies operating across multiple geographic locations need solutions that provide seamless failover, data consistency, and disaster recovery (DR) capabilities while minimizing operational complexity and cost. One of the common architectural patterns that companies often consider using is a single Kafka cluster spanning multiple data center locations on WAN (Wide Area Network). It is called Stretch Cluster. This pattern is commonly considered by companies for disaster recovery (DR) and high availability (HA).
Stretch Cluster can theoretically provide data redundancy across regions and minimize data loss in the event of a region-wide failure. However, this approach comes with several challenges and trade-offs due to the inherent nature of Kafka and its dependency on network latency, consistency, and partition tolerance in a multi-region setup. In this article, we are going to focus on the stretch cluster architectural pattern for DR or HA-related implications and considerations.
The following testing has been conducted, and the cluster behavior has been documented.
Environment Details
Regions | London, Frankfurt |
Kafka Distribution | Apache Kafka |
Kafka Version | 3.6.2 |
Zookeeper | 3.8.2 |
Operating system | Linux |
High-Level Environment Setup View
Below is the high-level view of a single stretch cluster setup that spans across the London and Frankfurt regions with
- Four Kafka brokers, where actual data resides.
- Four Zookeeper nodes (a three-node rather than a four-node setup would work as well, but to mimic the exact environment in both the regions, we have set up four Zookeeper nodes ).
- Zookeeper follows [N/2] + 1 rule for quorum establishment. In this case, the Zookeeper quorum will be [4/2] + 1 = 2 + 1 = 3.
- Network latency between these regions is ~ 15ms.
- Multiple producers (1 to n) and consumers (1 to n) are producing and consuming the data from this cluster from different regions of the world.
Kafka Broker Hardware Configuration
Hard Disk Size | 10 TB |
---|---|
Memory | 120 GB |
CPU | 16 Core |
Disk Setup | RAID 0 |
Kafka Broker Configuration
- Replication factor set to 4
- Data retention (aka log retention) period is 7 days
- Auto-creation of topics allowed
- Minimum in-sync replication (ISR) set to 3
- The number of partitions per topic is 10
- Other configuration details
- Acks = all
- max.in.flight.request.per.connection=1
- compression.type=lz4
- batch.size=16384
- buffer.memory=67108864
- auto.offset.reset=earliest
- send.buffer.bytes=524288
- receive.buffer.bytes=131072
- num.replica.fetchers=4
- num.network.threads=12
- num.io.threads=32
Producers and Consumers:
Producers:
In this use case, we have used a producer written in Java that consistently produces defined (K) Transactions Per Second (aka TPS), which can be controlled using properties files. Sample snippets using Kafka performance test api.
******** Start of Producer code snippet *******
void performAction(int topicN){
try {
ProcessBuilder processBuilder = new ProcessBuilder();
Properties prop = new Properties();
InputStream input = null;
input = new FileInputStream("config.properties");
prop.load(input);
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS");
sdf.setTimeZone(TimeZone.getTimeZone("America/New_York"));
String startTime = sdf.format(new Date());
System.out.println("Start time ---> "+startTime+" topic " +topicN);
String command = "/home/opc/kafka/bin/kafka-producer-perf-test --producer-props bootstrap.servers=x.x.x.x:9092"+ " --topic "+topicN+
" --num-records "+ prop.getProperty("numOfRecords") +
" --record-size "+ prop.getProperty("recordSize") +
" --throughput "+ prop.getProperty("throughputRate") +
" --producer.config "+ "/home/opc/kafka/bin/prod.config";
//processBuilder.command("cmd.exe", "/c", "dir C:\Users\admin");
processBuilder.command("bash", "-c", command);
Process process = processBuilder.start();
//StringBuilder output = new StringBuilder();
BufferedReader reader = new BufferedReader(
new InputStreamReader(process.getInputStream()));
String line;
while ((line = reader.readLine()) != null) {
//output.append(line + "n");
if(line!=null && line.contains("99.9th.")) {
break;
}
}
int exitVal = process.waitFor();
if (exitVal == 0) {
System.out.println("Success!");
String endTime = sdf.format(new Date());
//send to opensearch for tracking purpose - optional step
pushToIndex(line,startTime, endTime, prop.getProperty("indexname"));
//System.out.println(output);
//System.exit(0);
} else {
System.out.println("Something unexpected happend!!!");
}
} catch (IOException e) {
e.printStackTrace();
} catch(Exception e){
e.printStackTrace();
}
}
********** End of producer code snippet **************
****** Stat of Producer properties file *****
numOfRecords=70000
recordSize=4000
throughputRate=20
topicstart=1
topicend=100
****** End of Producer properties file *****
Consumers:
In this use case, we have used consumers written in Java that consistently consume defined (K) transactions per second (TPS), which can be controlled using properties files.
***** Start of consumer code snippet *******
void performAction(int topicN){
try {
ProcessBuilder processBuilder = new ProcessBuilder();
Properties prop = new Properties();
InputStream input = null;
input = new FileInputStream("consumer.properties");
prop.load(input);
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS");
sdf.setTimeZone(TimeZone.getTimeZone("America/New_York"));
String startTime = sdf.format(new Date());
System.out.println("Start time ---> "+startTime+" topic " +topicN);
String command =null;
if( prop.getProperty("latest").equalsIgnoreCase("no")) {
command = "/home/opc/kafka/bin/kafka-consumer-perf-test --broker-list=x.x.x.x:9092"+
" --topic "+topicN+
" --messages "+ prop.getProperty("numOfRecords") +
// " --group "+ prop.getProperty("groupid") +
" --timeout "+ prop.getProperty("timeout")+
" --consumer.config "+"/home/opc/kafka/bin/cons.config";
}else{
command = "/home/opc/kafka/bin/kafka-consumer-perf-test --broker-list=x.x.x.x:9092"+" --topic "+topicN+
" --messages "+ prop.getProperty("numOfRecords") +
// " --group "+ prop.getProperty("groupid") +
" --timeout "+ prop.getProperty("timeout")+
" --consumer.config "+"/home/opc/kafka/bin/cons.config"+
" --from-latest";
}
System.out.println("command is ---> " +command);
//processBuilder.command("cmd.exe", "/c", "dir C:\Users\admin");
processBuilder.command("bash", "-c", command);
Process process = processBuilder.start();
StringBuilder output = new StringBuilder();
BufferedReader reader = new BufferedReader(
new InputStreamReader(process.getInputStream()));
String line;
while ((line = reader.readLine()) != null) {
// output.append(line + "n"+"test");
//if(line.contains(":") && line.contains("-")) {
if (line.contains(":") && line.contains("-") && !line.contains("WARNING:")) {
break;
}
}
int exitVal = process.waitFor();
if (exitVal == 0) {
System.out.println("Success!");
//optional step - send data to opensearch
pushToIndex(line, prop.getProperty("indexname"),topicN);
System.out.println(output);
//System.exit(0);
} else {
System.out.println("Something unexpected happend!!!");
}
} catch (IOException e) {
e.printStackTrace();
} catch(Exception e){
e.printStackTrace();
}
}
***** End of consumer code snippet *******
*****Start of Consumer properties snippet ****
numOfRecords=70000
topicstart=1
topicend=100
groupid=cgg1
reportingIntervalms=10000
fetchThreads=1
latest=no
processingthreads=1
timeout=60000
***** End of consumer properties snippet ****
CAP Theorem
Before diving into further discussions and scenario executions, it’s crucial to understand the fundamental algorithms of distributed systems. This foundational knowledge will help you better correlate WAN disruptions with their impact on system behavior.
CAP Theorem (i.e., Consistency, Availability, and Partition tolerance) applies to all the distributed systems like Kafka, Hadoop, Elasticsearch, and many others, where data is being distributed across nodes to:
- Enable parallel data processing
- Maintain high availability for data and service
- Support horizontal or vertical scaling requirements
Consistency (C)
Every read gets the latest write or an error.
|
Kafka follows eventual consistency; data ingested into one node eventually syncs across others. Reads happen from the leader, so replicas might lag behind. |
Availability (A)
Every request gets a response, even if some nodes fail.
|
Kafka ensures high availability using replication factor, min.insync.replicas, and acks. As long as a leader is available, Kafka keeps running. |
Partition Tolerance (P)
The system keeps working despite network issues.
|
Even if some nodes disconnect, Kafka elects a new leader and continues processing. |
Kafka is an AP system; it prioritizes Availability and Partition Tolerance, while consistency is eventual and tunable via settings.
Now, let’s get into the scenarios.
Stretch Cluster failed scenarios
Before going in-depth into scenarios, let’s understand the key components and terminology of Kafka internals.
- Brokers represent data nodes where all the data is distributed with replicas for data high availability.
- The Controller is a critical component of Kafka that maintains the cluster’s metadata, including information about topics, partitions, replicas, and brokers. Also performs administrative tasks. Think of it as the brain of a Kafka cluster.
- A Zookeeper quorum is a group of nodes (usually deployed on dedicated nodes) that act as a coordination service helping to manage the Controller election, Broker Registration, Broker Discovery, Cluster membership, Partition leader election, Topic configuration, Access Control List (ACLs), consumer group management, and offset management.
- Topics are logical constructs of data to which all related data is published. Examples include all the network logs published to the “
Network_logs
” topic and application logs published to the “Application_logs
” topic. - Partitions are the fundamental units or building blocks supporting scalability and parallelism within a topic. When a topic is created, it can have one or more partitions where all the data is published in order.
- Offsets are unique identifiers assigned to each message within a partition of a topic, essentially a sequential number that indicates the position of a message within the partition.
- A producer is a client application that publishes records (messages) to Kafka topics.
- A consumer is a client application that consumes records (messages) from Kafka topics.
Scenario 1: WAN Disruption – Active Controllers as ‘0’ – AKA Brain-Dead Scenario
Scenario | WAN Disruption – Active controllers as ‘0’. |
Scenario Description |
As discussed in earlier sections, in a distributed network setup across geographic locations, network service interruptions can occur because no service guarantees 100% availability.
This scenario focuses on WAN disruption between two regions, causing Kafka service unavailability, leaving the cluster with zero active controllers and no self-recovery, even after hours of waiting.
|
Execution Steps |
|
Expected Behavior |
|
Actual/Observed Behavior |
Kafka is unavailable and can’t be recovered on its own upon WAN service disruption and reestablishment and shows active controllers as ‘0’
|
Root Cause Analysis (RCA) |
It’s a “brain-dead” scenario. Upon WAN disruption, a single 4-node cluster becomes 2 independent clusters, each with 2 nodes.
|
References |
|
Solution/Remedy | Because there are no controller claims by brokers, all the brokers need to be restarted. |
WAN Disruption:
WAN disruption can be done by deleting or modifying route tables.
Warning: Deleting a route table may disrupt all traffic in the associated subnets. Instead, consider modifying or detaching the route table.
Firewall Stop
You may need to shut down firewalls
Firewall status: systemctl status firewalld
Firewall stop: service firewalld stop
ZK Commands list:
ZK Server start: bin/zookeeper-server-start etc/kafka/zookeeper.properties
ZK Server stop: zookeeper-server-stop
ZK Shell commands: bin/zookeeper-shell.sh <zookeeper host>:2181
Active controllers check: get /controller or ls /controller
Brokers list: /home/opc/kafka/bin/zookeeper-shell kafkabrokeraddress:2181 <<<"ls /brokers/ids"
Alternative to check controllers and brokers in a cluster. Available from kafka 3.7.x
bin/kafka-metadata-shell.sh --bootstrap-server <KAFKA_BROKER>:9092 --describe-cluster
Scenario 2: WAN Disruption – Active Controllers as ‘2’ – AKA Split-Brain Scenario
Scenario | WAN Disruption – Active controllers as ‘2’ |
Scenario Description | As discussed in earlier sections, in a distributed network setup across geographic locations, network service interruptions can occur because no service guarantees 100% availability. This scenario focuses on WAN disruption between two regions, causing Kafka service unavailability with active controllers as ‘2’ and no self-recovery, even after hours of waiting. |
Execution Steps |
|
Expected Behavior |
|
Actual/Observed Behavior |
Kafka is not available and can’t be recovered on its own upon WAN service disruption and re-establishment and shows active controllers as ‘2’
|
Root Cause Analysis (RCA) |
|
References |
|
Solution/Remedy | Restart both controllers |
Scenario 3: Broker Disk full, resulting respective broker process crash and No Active controllers
Scenario | Broker disk full, resulting respective broker process crash, and No active controllers or ‘0’ active controllers. |
Scenario Description | On a steady Kafka cluster, which means all the brokers are up and running, and both producers and consumers are producing and consuming events. One of the 4 brokers’ disks is full due to data skew, i.e., uneven or imbalanced distribution of data across and leading to an abrupt broker process crash and cascading the crash to the remaining brokers. This causes the Kafka cluster unavailability because of active controllers as ‘0’. |
Execution Steps |
|
Expected Behavior |
|
Actual/Observed Behavior |
Kafka is not available and couldn’t be recovered on its own upon and showed active controllers as ‘0’
|
Root Cause Analysis (RCA) |
From the logs, observed disk-related errors and exceptions
Logs show that the controller moved the exception from one broker to another broker, but no exit or hang was reported in the logs
Controllers ‘0’ behavior is equal to the brain-dead scenario explained in scenario 2.
|
References |
|
Solution/Remedy |
|
Scenario 4: Corrupted indices upon Kafka process kill
Scenario | Corrupted indices upon Kafka process kill |
Scenario Description |
Killing broker and ZK processes (an uncontrolled shutdown can be comparable with machine abrupt reboot) and restarting brokers lead to corrupted indices, rebuilding indices doesn’t work even after hours of waiting.
This scenario is comparable to an ungraceful shutdown during an environment upgrade or to a poor security posture, where an unauthorized person could terminate the service.
|
Execution Steps |
|
Expected Behavior |
|
Actual/Observed Behavior |
Kafka is not available and couldn’t be recovered on its own.
|
Root Cause Analysis (RCA) |
From the logs, ‘index corrupted’ logs for partitions are shown, and also ‘rebuilding indices logs’. Rebuilding indices didn’t work and doesn’t show any related logs on progress. The resulting cluster is in unavailable status even after 1 hour.
Also, it is recommended to do graceful shutdowns instead of using the kill switch or doing abrupt shutdowns so the data in the Kafka local cache will be flushed to the disks by maintaining meta data information and also committing offsets.
|
References |
|
Solution/Remedy | If Kafka fails to rebuild all the indices on its own, manual deletion of all index files and restarting all brokers will fix the issue. |
Conclusion on WAN disruptions failure scenarios
Deploying on-premise stretch clusters requires a deep understanding of failure scenarios, particularly during network disruptions. Unlike cloud environments, on-premises setups may lack highly available network infrastructure, fault tolerance mechanisms, or dedicated fiber lines, making them more susceptible to outages. In such cases, a Kafka cluster may experience downtime if network connectivity is disrupted.
Cloud providers, on the other hand, offer high-bandwidth, low-latency networking over fully redundant, dedicated metro fiber connections between regions and data centers. Their managed Kafka services are designed for high availability, typically recommending deployment across multiple zones or regions to withstand zonal failures. Additionally, these services incorporate zone-awareness features, ensuring Kafka data replication occurs across multiple zones, rather than being confined to a single one.
It is also important to understand and emphasize the latency impact on a stretch cluster deployed in two different Geo locations with a WAN connection. We can not escape physics; in our case, as per WonderNetwork, the average latency between London and Frankfurt is 14.88 milliseconds for a distance of 636km. So, for mission-critical applications, the setup of a stretch cluster driven by latencies, through testing, is needed to determine how the data is being replicated and consumed across.
By understanding these differences, organizations can make informed decisions about whether an on-premise stretch cluster aligns with their availability and resilience requirements or if a cloud-native Kafka deployment is a more suitable alternative.
Disaster Recovery Strategies:
A Business Continuity Plan (BCP) is essential for any business to maintain its brand and reputation. Almost all the companies think of High Availability (HA) and Disaster recovery (DR) strategies to meet their BCP. Below are the popular DR Strategies for Kafka.
- Active – Standby (aka Active-passive)
- Active – Active
- Backup and restore
Before we dive deep into DR strategies further, let’s understand that the Kafka Mirror Maker 2.0 we are using in the following architectural patterns. There are other components like Kafka Connect, Replicator, etc., offered by Confluent and other cloud services, but they are out of scope for this article.
DNS Redirection in Apache Kafka is a cross-cluster replication tool that uses the Kafka Connect framework to replicate topics between clusters, supporting active-active and active-passive setups. It provides automatic consumer group synchronization, offset translation, and failover support for disaster recovery and multi-region architectures.
Active Standby
This is a popular DR strategy pattern where there will be two different Kafka clusters, which are of identical size & configuration, deployed in two different Geo regions or Data centers. In this case, it’s the London and Frankfurt regions. Usually, Geo location choice depends upon Earth’s different tectonic plates, company or country compliance and regulations requirements, and other factors. Out of which only one cluster is active at a given point of time, which receives and serves the clients.
As shown in the following picture, Kafka cluster one (KC1) is located in the London region and Kafka cluster two (KC2) is located in the Frankfurt region with the same size and configuration.
All the producers are producing data to KC1 (London region cluster), and all consumers are consuming the data from the same cluster. KC2 (Frankfurt region cluster) receives the data from the KC1 cluster through Mirror Maker as shown in Figure 1.
Figure 1: Kafka Active Standby (aka Active Passive) cluster setup with Mirror role
In simple terms, we can treat the mirror maker as a combination of producer and consumer, where it consumes the topics from the source cluster KC1 (London region) and produces the data to the destination cluster KC2 (Frankfurt region).
During the disaster to the primary/active cluster KC1 (London region), the standby cluster (Frankfurt region) will become active, and all the producers and consumers’ traffic redirected to the standby cluster. This can be achieved through DNS redirection via a load balancer in front of Kafka clusters. Fig 2 shows this scenario.
Figure 2: Kafka Active Standby setup – Standby cluster becomes active
When the London Region KC1 cluster becomes active, we need to set up a mirror maker instance to consume the data from the Frankfurt region cluster and produce back to it to the London region KC1 cluster. This scenario is shown in Fig. 3.
It’s up to the business to keep the Frankfurt region (KC2) active or promote the London region cluster (KC1) back to primary once the data is fully synced/hydrated to the current state. In some cases, the standby cluster, i.e., the Frankfurt region cluster (KC2), continues to function like an active cluster and the London region cluster (KC1) becomes the standby as well.
It is essential to note that RTO (Recovery Time Objective) and RPO (Recovery Point Objective) values are high compared to the Active-Active setup. Active-standby setup cost is medium to high as we need to maintain the same Kafka clusters in both regions.
Figure 3: Kafka Active Standby setup – Standby cluster data sync happening to the KC1
Concerns with the Active-Standby setup
Kafka Mirror Maker Replication Lag (Δ):
Let’s take a scenario, and it might happen very rarely, where we lose all the cluster completely, including nodes and disks, etc. Where the Active Kafka Cluster (London region – KC1) has ‘m’ messages, and the mirror maker is consuming and producing to the standby cluster. Due to network latency or consumers & producing process with acknowledgements, the standby cluster (i.e., Frankfurt region – KC2) lags behind and holds only ‘n’ messages. Which means the delta((Delta)) is m minus n.
(Delta = m – n)
where:
- m = Messages in the Active Kafka Cluster (London)
- n = Messages in the Standby Kafka Cluster (Frankfurt)
Due to this replication lag ((Delta)), it will have the following impact on mission-critical applications
Potential Message Loss:
If a failover occurs before the Standby cluster catches up, some messages ((Delta)) may be missing.
Inconsistent Data:
Consumers relying on the Standby cluster may process incomplete or outdated data.
Delayed Processing:
Applications depending on real-time replication may experience latency in data availability.
Active-Active Setup:
The replication lag concern can be overcome with the Active-Active setup. Just like the Active standby setup, we will have the same Kafka clusters deployed in the London and Frankfurt regions.
The producers of the ingestion pipeline will write the data to both clusters at any given point in time. Which is also known as Dual Writes. So, both clusters’ data is in sync as shown in Fig. 4.
Please note that the ingestion pipeline or producers should have the capability to support dual writes. It is also important to note that these write pipelines should be isolated so that any issue with one cluster should not impact other pipelines or create back pressure and producer latencies. For example, if your use case is log analytics or observability, considering Logstash in your ingestion and creating two separate pipelines within a single or more Logstash instances will act as an isolated pipeline. But only one cluster is active for the consumers at any given point of time.
Figure 4: Active-Active setup or Dual writes.
In case, as shown in Figure 5, any issues with the one cluster the consumers need to be redirected with DNS redirect and RTO and RPO or SLAs are near zero.
Figure 5: Active-Active setup, one cluster is unavailable
Once the London cluster is available and becomes active, data can sync back to the London cluster from the Frankfurt cluster using Mirror Maker. As shown in Fig. 6, consumers can be switched back to the London cluster using DNS redirect. Also, it is important to configure consumer groups to start where they left off.
During this setup, during the failover, it is important to configure consumers’ offset to read the messages from a particular point in time, like reply messages from the last current time minus 15 mins, or set up a mirror maker between active clusters to read ‘only’ consumer offset topics from the London region cluster and replicate to the Frankfurt region cluster. And during the failure, you can use source.auto.offset.reset=latest
to read the data from the latest offsets in the Frankfurt region cluster.
If the consumer groups are not properly configured, you may see duplicate messages in your downstream system. So, it’s essential to design the downstream to handle duplicate records. For example, if it’s an observability use case and if the downstream is OpenSearch, it’s good practice to design a unique ID depending on the message timestamp and other parameters like session IDs, etc. So, that duplicate message will overwrite the existing duplicate record and maintain only one record.
Figure 6: Active-Active setup cluster syncing data with MirrorMaker
Challenges and Complexities in Active-Active Kafka Setups
Message ordering challenges occur when writing to two clusters simultaneously, messages may arrive in different orders due to network delays between regions. This can be particularly problematic for applications that require strict message ordering. So, it needs to be handled in the data processing time or downstream systems. For example, defining a primary key within the message body timestamp field would order the messages based on timestamps.
Data Consistency Issues can occur while establishing dual writes to two different clusters. It requires robust monitoring and reconciliation processes to ensure data integrity across clusters. Because there might be conditions where data might successfully write to one cluster but fail in another, resulting in data inconsistency.
Producer complexity is an issue in this active-active setup. Producers need to handle writing to two clusters and manage failures independently for each cluster, so isolation of each pipeline is important. This setup might increase the complexity of the producer applications and require more resources to manage dual writes effectively. There are some agents or tools in the market also as open-source offerings that will provide dual pipeline options such as Elastic Logstash.
Consumer group management is complex. Managing consumer offsets across multiple clusters could be challenging, especially during failover scenarios, and can lead to message duplication. Consumer applications or downstream systems need to be designed to handle duplicate messages or implement deduplication logic.
Operational Overhead – Running active-active clusters requires sophisticated monitoring tools and increases operational costs significantly. Teams need to maintain detailed runbooks and conduct regular disaster recovery drills to ensure smooth failover processes.
Network Latency Considerations– Cross-region network latency and bandwidth costs become critical factors in active-active setups. Reliable network connectivity between regions is essential for maintaining data consistency and managing replication effectively.
Backup and Restore
Backup and restore is a cost-effective architectural pattern compared to Active Standby(AP) or Active Active(AA) DR architectural patterns. It is suitable for non-critical applications with higher RTO and RPO values compared to AA or AP patterns.
Backup and restore is a cost-effective option because we need only one Kafka cluster. In case the Kafka cluster is unavailable, the data from the centralized repository can be replayed/restored to the Kafka cluster once it becomes available. In this architectural pattern, shown in Figure 7, all the data will be saved to a persistent location like Amazon S3 or another data lake, where the data can be restored/replayed to the Kafka cluster when the Kafka cluster is available. It is also common to see a new environment of Kafka clusters built together using terraform templates and build a new Kafka cluster and replay data.
Figure 7: Backup and restore architectural pattern
Challenges and Complexities in Kafka Backup and Restore Setup
Recovery time impact is a major consideration; Restoring (aka replaying) large volumes of data to Kafka can take significant time, resulting in a higher RTO. This makes it suitable only for applications that can tolerate longer downtimes.
Data consistency challenges are also an issue. When restoring data from backup storage like Amazon S3, maintaining the exact message ordering and partitioning as the original cluster can be challenging. Special attention is needed to ensure data consistency during restoration.
Backup performance overhead is a further consideration. Regular backups to external storage can impact the performance of the production cluster. The backup process needs careful scheduling and resource management to minimize impact on production workloads.
Storage is key. Managing large volumes of backup data requires effective storage lifecycle policies and controlling costs. Regular cleanup of old backups and efficient storage utilization strategies are essential.
Coordination during recovery is mandatory. The recovery process requires careful coordination between backing up offset information and actual messages. Missing either can lead to message loss or duplication during restoration.
Conclusion on Kafka DR Strategies
As we explored earlier, choosing a DR strategy isn’t one-size-fits-all, it comes with trade-offs. It often boils down to finding the right balance between what the business needs and what it can afford. The decision usually depends on several factors.
SLAs (e.g., 99.99%+ uptime) are pre-defined commitments to uptime and availability. An SLA requiring 99.99% uptime allows only 52.56 minutes of downtime per year.
The RTO metric is used to define the maximum acceptable downtime for a system or application after a failure before it causes unacceptable business harm. Knowledge/content/training systems, for example, can expect one hour or more downtime because they are not mission critical. For banking applications, on the other hand, downtime acceptance could be none or a couple of minutes.
The RPO metric is used to define the maximum acceptable amount of data loss during a disruptive event, measured in time from the most recent backup. For example, an observability/monitoring application can have the acceptability for losing the past hour whereas a financial application can not tolerate losing a single record of transaction data.
Associated costs vary. Higher availability and faster recovery typically come at a higher price. Organizations must weigh their need for uptime and data protection against the financial investment required.
DR Strategy | RTO (Downtime tolerance) | RPO (Data loss tolerance) | Use case | Cost |
---|---|---|---|---|
Active Active | Near-Zero (Seconds) | Near-Zero (Seconds) | Mission-critical apps (e.g., banking, e-commerce, stock trading) | High |
Active Standby | Minutes to Hours | Near-Zero to Minutes | Enterprise apps need high availability | Medium to High (No dual write ingestion pipeline cost) |
Backup and Restore | Hours to Days | Hours to Days | Non-critical apps with occasional DR needs or cost-sensitive apps, | Low |
References: