By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Analyzing Apache Kafka Stretch Clusters: WAN Disruptions, Failure Scenarios, and DR Strategies
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > News > Analyzing Apache Kafka Stretch Clusters: WAN Disruptions, Failure Scenarios, and DR Strategies
News

Analyzing Apache Kafka Stretch Clusters: WAN Disruptions, Failure Scenarios, and DR Strategies

News Room
Last updated: 2025/06/20 at 6:56 AM
News Room Published 20 June 2025
Share
SHARE

Key Takeaways

  • Apache Kafka Stretch cluster setup in an on-premise environment poses a high risk of service unavailability during WAN disruptions, often leading to split-brain or brain-dead scenarios and violating Service Level Agreements (SLAs) due to degraded Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
  • Proactive monitoring of your Kafka environment to understand data skews across brokers is critical; an uneven data load on a single node can cause a stretch cluster failure.
  • An ungraceful broker shutdown during upgrades or due to poor security posture can make the Kafka service unavailable, resulting in a loss of data.
  • There are three popular and widely used Kafka disaster recovery strategies; each has challenges, complexities, and pitfalls.
  • Kafka Mirror Maker 2 replication lag (Δ) for the Disaster Recovery setup can cause message loss, inconsistent data.

 

Apache Kafka is a well-known publish-subscribe distributed system widely used across industries for use cases such as log analytics, observability, event-driven architectures, and real-time streaming. Its distribution allows it to become the critical piece or backbone of modern streaming architecture, offering many integrations and supporting near-real-time or real-time latency requirements.

Today, Kafka has been offered by major companies as a service, including Confluent Kafka as well as cloud offerings such as Confluent Cloud, AWS Managed Streaming Kafka , Google Cloud Managed Service for Apache Kafka, Azure Event Hub for Apache Kafka, etc. Additionally, Kafka has been deployed across over one-hundred-thousand companies, including Fortune 100 across various industry segments such as enterprise, financials, healthcare, automobile, startups, independent software vendors, and retail sectors.

The value of the data diminishes over time. Many businesses these days compete in time and demand real-time data processing with millisecond responses to power mission-critical applications. Financial services, e-commerce, IoT, and other latency-sensitive industries require highly available and resilient architectures that can withstand infrastructure failures without impacting business continuity.

    Related Sponsored Content

Companies operating across multiple geographic locations need solutions that provide seamless failover, data consistency, and disaster recovery (DR) capabilities while minimizing operational complexity and cost. One of the common architectural patterns that companies often consider using is a single Kafka cluster spanning multiple data center locations on WAN (Wide Area Network). It is called Stretch Cluster. This pattern is commonly considered by companies for disaster recovery (DR) and high availability (HA).

Stretch Cluster can theoretically provide data redundancy across regions and minimize data loss in the event of a region-wide failure. However, this approach comes with several challenges and trade-offs due to the inherent nature of Kafka and its dependency on network latency, consistency, and partition tolerance in a multi-region setup. In this article, we are going to focus on the stretch cluster architectural pattern for DR or HA-related implications and considerations.

The following testing has been conducted, and the cluster behavior has been documented.

Environment Details








Regions London, Frankfurt
Kafka Distribution Apache Kafka
Kafka Version 3.6.2
Zookeeper 3.8.2
Operating system Linux

High-Level Environment Setup View

Below is the high-level view of a single stretch cluster setup that spans across the London and Frankfurt regions with

  • Four Kafka brokers, where actual data resides.
  • Four Zookeeper nodes (a three-node rather than a four-node setup would work as well, but to mimic the exact environment in both the regions, we have set up four Zookeeper nodes ).
  • Zookeeper follows [N/2] + 1 rule for quorum establishment. In this case, the Zookeeper quorum will be [4/2] + 1 = 2 + 1 = 3.
  • Network latency between these regions is ~ 15ms.
  • Multiple producers (1 to n) and consumers (1 to n) are producing and consuming the data from this cluster from different regions of the world.

Kafka Broker Hardware Configuration







Hard Disk Size 10 TB
Memory 120 GB
CPU 16 Core
Disk Setup RAID 0

Kafka Broker Configuration

  • Replication factor set to 4
  • Data retention (aka log retention) period is 7 days
  • Auto-creation of topics allowed
  • Minimum in-sync replication (ISR) set to 3
  • The number of partitions per topic is 10
  • Other configuration details

    • Acks = all
    • max.in.flight.request.per.connection=1
    • compression.type=lz4
    • batch.size=16384
    • buffer.memory=67108864
    • auto.offset.reset=earliest
    • send.buffer.bytes=524288
    • receive.buffer.bytes=131072
    • num.replica.fetchers=4
    • num.network.threads=12
    • num.io.threads=32

Producers and Consumers:

Producers:

In this use case, we have used a producer written in Java that consistently produces defined (K) Transactions Per Second (aka TPS), which can be controlled using properties files. Sample snippets using Kafka performance test api.


******** Start of Producer code snippet *******

 void performAction(int topicN){
        try {
            ProcessBuilder processBuilder = new ProcessBuilder();
            Properties prop = new Properties();
            InputStream input = null;
            input = new FileInputStream("config.properties");
            prop.load(input);

                SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS");
                sdf.setTimeZone(TimeZone.getTimeZone("America/New_York"));
                String startTime = sdf.format(new Date());
            System.out.println("Start time ---> "+startTime+" topic " +topicN);


String command = "/home/opc/kafka/bin/kafka-producer-perf-test --producer-props bootstrap.servers=x.x.x.x:9092"+ " --topic "+topicN+
                        " --num-records "+ prop.getProperty("numOfRecords") +
                        " --record-size "+ prop.getProperty("recordSize") +
                        " --throughput "+ prop.getProperty("throughputRate") +
                        " --producer.config "+ "/home/opc/kafka/bin/prod.config";

                
                //processBuilder.command("cmd.exe", "/c", "dir C:\Users\admin");
                processBuilder.command("bash", "-c", command);

                Process process = processBuilder.start();
                //StringBuilder output = new StringBuilder();
                BufferedReader reader = new BufferedReader(
                        new InputStreamReader(process.getInputStream()));
               
                String line;
                while ((line = reader.readLine()) != null) {
                    //output.append(line + "n");
                    if(line!=null && line.contains("99.9th.")) {
                            break;
                        }   
                }

                int exitVal = process.waitFor();
                if (exitVal == 0) {
                System.out.println("Success!");
                String endTime = sdf.format(new Date());
        //send to opensearch for tracking purpose - optional step
                pushToIndex(line,startTime, endTime, prop.getProperty("indexname"));
                //System.out.println(output);
                //System.exit(0);
                } else {
                    System.out.println("Something unexpected happend!!!");
                }

        } catch (IOException e) {
            e.printStackTrace();
        } catch(Exception e){
            e.printStackTrace();
        }
    }

********** End of producer code snippet **************


****** Stat of Producer properties file ***** 
numOfRecords=70000
recordSize=4000
throughputRate=20
topicstart=1
topicend=100
****** End of Producer properties file ***** 

Consumers:

In this use case, we have used consumers written in Java that consistently consume defined (K) transactions per second (TPS), which can be controlled using properties files.


***** Start of consumer code snippet ******* 
    void performAction(int topicN){
        try {
            ProcessBuilder processBuilder = new ProcessBuilder();
            Properties prop = new Properties();
            InputStream input = null;
            input = new FileInputStream("consumer.properties");
            prop.load(input);

                SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS");
                sdf.setTimeZone(TimeZone.getTimeZone("America/New_York"));
                String startTime = sdf.format(new Date());
            System.out.println("Start time ---> "+startTime+" topic " +topicN);

String command =null;
            if( prop.getProperty("latest").equalsIgnoreCase("no")) {
         
command = "/home/opc/kafka/bin/kafka-consumer-perf-test --broker-list=x.x.x.x:9092"+
                        " --topic "+topicN+
                        " --messages "+ prop.getProperty("numOfRecords") +
                       // " --group "+ prop.getProperty("groupid") +
                        " --timeout "+ prop.getProperty("timeout")+
            " --consumer.config "+"/home/opc/kafka/bin/cons.config";
}else{

command = "/home/opc/kafka/bin/kafka-consumer-perf-test --broker-list=x.x.x.x:9092"+" --topic "+topicN+
                        " --messages "+ prop.getProperty("numOfRecords") +
                       // " --group "+ prop.getProperty("groupid") +
                        " --timeout "+ prop.getProperty("timeout")+
            " --consumer.config "+"/home/opc/kafka/bin/cons.config"+
                        " --from-latest";                
            }

System.out.println("command is ---> " +command);                
                //processBuilder.command("cmd.exe", "/c", "dir C:\Users\admin");
                processBuilder.command("bash", "-c", command);

                Process process = processBuilder.start();
                StringBuilder output = new StringBuilder();
                BufferedReader reader = new BufferedReader(
                        new InputStreamReader(process.getInputStream()));
               
                String line;
                while ((line = reader.readLine()) != null) {
                   // output.append(line + "n"+"test");
                    //if(line.contains(":") && line.contains("-")) {
                      if (line.contains(":") && line.contains("-") && !line.contains("WARNING:")) {
                            break;
                        }   
                }

                int exitVal = process.waitFor();
                if (exitVal == 0) {
                System.out.println("Success!");
               //optional step - send data to opensearch
                 pushToIndex(line, prop.getProperty("indexname"),topicN);
                System.out.println(output);
                //System.exit(0);
                } else {
                    System.out.println("Something unexpected happend!!!");
                }

        } catch (IOException e) {
            e.printStackTrace();
        } catch(Exception e){
            e.printStackTrace();
        }
    }

***** End of consumer code snippet *******


*****Start of Consumer properties snippet **** 

numOfRecords=70000
topicstart=1
topicend=100
groupid=cgg1
reportingIntervalms=10000
fetchThreads=1
latest=no
processingthreads=1
timeout=60000

***** End of consumer properties snippet ****

CAP Theorem

Before diving into further discussions and scenario executions, it’s crucial to understand the fundamental algorithms of distributed systems. This foundational knowledge will help you better correlate WAN disruptions with their impact on system behavior.

CAP Theorem (i.e., Consistency, Availability, and Partition tolerance) applies to all the distributed systems like Kafka, Hadoop, Elasticsearch, and many others, where data is being distributed across nodes to:

  • Enable parallel data processing
  • Maintain high availability for data and service
  • Support horizontal or vertical scaling requirements






Consistency (C)


Every read gets the latest write or an error.

Kafka follows eventual consistency; data ingested into one node eventually syncs across others. Reads happen from the leader, so replicas might lag behind.

Availability (A)


Every request gets a response, even if some nodes fail.

Kafka ensures high availability using replication factor, min.insync.replicas, and acks. As long as a leader is available, Kafka keeps running.

Partition Tolerance (P)


The system keeps working despite network issues.

Even if some nodes disconnect, Kafka elects a new leader and continues processing.

Kafka is an AP system; it prioritizes Availability and Partition Tolerance, while consistency is eventual and tunable via settings.

Now, let’s get into the scenarios.

Stretch Cluster failed scenarios

Before going in-depth into scenarios, let’s understand the key components and terminology of Kafka internals.

  • Brokers represent data nodes where all the data is distributed with replicas for data high availability.
  • The Controller is a critical component of Kafka that maintains the cluster’s metadata, including information about topics, partitions, replicas, and brokers. Also performs administrative tasks. Think of it as the brain of a Kafka cluster.
  • A Zookeeper quorum is a group of nodes (usually deployed on dedicated nodes) that act as a coordination service helping to manage the Controller election, Broker Registration, Broker Discovery, Cluster membership, Partition leader election, Topic configuration, Access Control List (ACLs), consumer group management, and offset management.
  • Topics are logical constructs of data to which all related data is published. Examples include all the network logs published to the “Network_logs” topic and application logs published to the “Application_logs” topic.
  • Partitions are the fundamental units or building blocks supporting scalability and parallelism within a topic. When a topic is created, it can have one or more partitions where all the data is published in order.
  • Offsets are unique identifiers assigned to each message within a partition of a topic, essentially a sequential number that indicates the position of a message within the partition.
  • A producer is a client application that publishes records (messages) to Kafka topics.
  • A consumer is a client application that consumes records (messages) from Kafka topics.

Scenario 1: WAN Disruption – Active Controllers as ‘0’ – AKA Brain-Dead Scenario












Scenario WAN Disruption – Active controllers as ‘0’.
Scenario Description

As discussed in earlier sections, in a distributed network setup across geographic locations, network service interruptions can occur because no service guarantees 100% availability.


This scenario focuses on WAN disruption between two regions, causing Kafka service unavailability, leaving the cluster with zero active controllers and no self-recovery, even after hours of waiting.

Execution Steps

  1. Kafka cluster (all brokers and Zookeeper quorum) should be up and running with the configuration stated in the environment section
  2. Produce and consume events with brokers’ equilibrium state (this step is optional, but simulating a real use case)
  3. Disconnect/disrupt WAN service between regions
  4. Observe the cluster behavior. Check through the broker host CLI and ZK CLI. If you are using Confluent, you can check in Confluent center or AWS Cloud provided managed service you can check the same CloudWatch, respectively and other offerings too

    • a. ZK quorum availability
    • b. Brokers availability as cluster members
    • c. Number of active controllers
    • d. Producers and consumers progress
    • e. Under replicated partitions, etc.

  5. Reestablish WAN service between regions and observe the cluster behavior.

Expected Behavior

  1. On steady state, the Kafka cluster should be available with

    1. 4 brokers
    2. ZK quorum as 3
    3. Active controllers as 1
    4. Producers and consumers should produce and fetch events from topics


  2. Upon WAN service disruption,

    1. The Kafka cluster should not be available
    2. ZK quorum 3 will break
    3. Active controllers as 1 or zero
    4. Producers and consumers should be interrupted, create back pressure or time-outs.


  3. Reestablish WAN service

    1. ZK quorum 3 should be reestablished
    2. Kafka cluster should be available with 4 brokers
    3. The active controller should be 1
    4. Producers and consumers should continue/resume


Actual/Observed Behavior

Kafka is unavailable and can’t be recovered on its own upon WAN service disruption and reestablishment and shows active controllers as ‘0’


  • Active brokers are reported as zero
  • Active controller number is zero
  • ZK quorum with 3 is formed on WAN service resume, and the quorum broke on disruption
  • All brokers showed up in ZK cluster membership upon WAN service resume and only two brokers showed during WAN disruption.
  • Producers and consumers completely interrupted and timed out.

Root Cause Analysis (RCA)

It’s a “brain-dead” scenario. Upon WAN disruption, a single 4-node cluster becomes 2 independent clusters, each with 2 nodes.


  • And the existing single Zookeeper quorum becomes two quorums i.e., one for each site. But each other quorum does not elect a new controller, resulting as it thinks that the controller already existing resulting in zero controllers upon WAN reestablishment.
  • Imagine two cooks in a kitchen preparing a dish. Each cook assumes the other has added the salt, but in the end, no salt is added to the dish.
  • From the logs, it was observed controller moved the exception from one broker to another broker, but no exit or hang was reported in the logs



Controller exceptions:
controller.log:org.apache.kafka.common.errors.ControllerMovedException: Controller moved to another broker. Aborting 
controller.log:java.io.IOException: Client was shutdown before response was read
controller.log:java.lang.InterruptedException


  • From ZooKeeper (ZK) CLI, It’s also observed that ‘no broker claims to be the controller’ in the cluster, the cluster failed to respond properly in the face of state changes, including topic or partition creation or broker failures.



ERROR Unexpected exception causing shutdown while sock still open (org.apache.zookeeper.server.quorum.LearnerHandler)
java.lang.Exception: shutdown Leader! reason: Not sufficient followers synced, only synced with sids: [ 2 ]

ERROR Exception while listening (org.apache.zookeeper.server.quorum.QuorumCnxManager)

References

  • WAN disable can be achieved through route tables modify or delete.
  • ZK commands. Refer ZK command section.

Solution/Remedy Because there are no controller claims by brokers, all the brokers need to be restarted.

WAN Disruption:

WAN disruption can be done by deleting or modifying route tables.

Warning: Deleting a route table may disrupt all traffic in the associated subnets. Instead, consider modifying or detaching the route table.

Firewall Stop

You may need to shut down firewalls

Firewall status: systemctl status firewalld

Firewall stop: service firewalld stop

ZK Commands list:

ZK Server start: bin/zookeeper-server-start etc/kafka/zookeeper.properties

ZK Server stop: zookeeper-server-stop

ZK Shell commands: bin/zookeeper-shell.sh <zookeeper host>:2181

Active controllers check: get /controller or ls /controller

Brokers list: /home/opc/kafka/bin/zookeeper-shell kafkabrokeraddress:2181 <<<"ls /brokers/ids"

Alternative to check controllers and brokers in a cluster. Available from kafka 3.7.x

bin/kafka-metadata-shell.sh --bootstrap-server <KAFKA_BROKER>:9092 --describe-cluster

Scenario 2: WAN Disruption – Active Controllers as ‘2’ – AKA Split-Brain Scenario












Scenario WAN Disruption – Active controllers as ‘2’
Scenario Description As discussed in earlier sections, in a distributed network setup across geographic locations, network service interruptions can occur because no service guarantees 100% availability. This scenario focuses on WAN disruption between two regions, causing Kafka service unavailability with active controllers as ‘2’ and no self-recovery, even after hours of waiting.
Execution Steps

  1. Kafka Cluster (all brokers and Zookeeper quorum) should be up and running with the configuration stated in the environment section
  2. Produce and consume events with brokers’ equilibrium state (this step is optional)
  3. Disconnect/Disrupt WAN service between regions
  4. Observe the cluster behavior- Check through the broker host CLI and ZK CLI. If you are using Confluent, you can check in the Confluent center or the AWS Cloud provided managed service, you can check the same CloudWatch respectively

    1. ZK quorum availability
    2. Brokers’ availability as cluster members
    3. Number of active controllers
    4. Producers and consumers progress
    5. Under replicated partitions etc.

  5. Reestablish WAN service between regions and observe the cluster behavior.

Expected Behavior

  1. In a steady state, the Kafka cluster should be available with

    1. 4 brokers
    2. ZK quorum as 3
    3. Active controllers as 1
    4. Producers and consumers should produce and fetch events from topics

  2. Upon WAN service disruption,

    1. The Kafka cluster should not be available
    2. ZK quorum 3 will break
    3. Active controllers as 1 or zero
    4. Producers and consumers should be interrupted or create back pressure or time-outs.

  3. Re-establish WAN service

    1. ZK quorum 3 should be reestablished
    2. Kafka cluster should be available with 4 brokers
    3. Active controller should be 1
    4. Producers and consumers should continue/resume

Actual/Observed Behavior

Kafka is not available and can’t be recovered on its own upon WAN service disruption and re-establishment and shows active controllers as ‘2’


  • Active brokers are reported as zero
  • Active controller number is two
  • ZK quorum with 3 is formed on WAN service resume and the quorum broke on disruption
  • All brokers showed up in ZK cluster membership upon WAN service resume and only two brokers showed during WAN disruption.
  • Producers and consumers completely interrupted and timed out. NETWORK_EXCEPTIONs
  • Producers got failed requests with NOT_LEADER_FOR_PARTITIONS exceptions

Root Cause Analysis (RCA)

  • This is a split-brain scenario. Upon WAN disruption, a single 4-node cluster becomes 2 independent clusters with each consisting of 2 nodes.
  • Also, an existing single Zookeeper quorum becomes two quorums, i.e., one for each site. As a result, each quorum will elect a new controller, resulting in two controllers.
  • Upon WAN re-establishment, we see 2 controllers for 4 4-node clusters. Think of it like two co-pilots flying one plane from different cockpits after radio loss. Both think the other is gone and take over, conflicting actions crash the system.
  • From ZooKeeper (ZK) CLI, it’s also observed that two brokers claim to be the controller in the cluster; the cluster failed to respond properly in the face of state changes, including topic or partition creation or broker failures.

References

  • WAN disable can be achieved through the route tables delete
  • ZK commands

Solution/Remedy Restart both controllers

Scenario 3: Broker Disk full, resulting respective broker process crash and No Active controllers












Scenario Broker disk full, resulting respective broker process crash, and No active controllers or ‘0’ active controllers.
Scenario Description On a steady Kafka cluster, which means all the brokers are up and running, and both producers and consumers are producing and consuming events. One of the 4 brokers’ disks is full due to data skew, i.e., uneven or imbalanced distribution of data across and leading to an abrupt broker process crash and cascading the crash to the remaining brokers. This causes the Kafka cluster unavailability because of active controllers as ‘0’.
Execution Steps

  1. Kafka Cluster (all brokers and Zookeeper quorum) should be up and running with the configuration stated in the environment section
  2. Produce and consume events with brokers’ equilibrium state
  3. One of the 4 brokers’ disks is full and causing the respective broker process to crash
  4. Observe the cluster behavior through the broker host CLI or ZK CLI. If you are using Confluent, you can check in the Confluent center or the AWS Cloud provided managed service you can check the same CloudWatch respectively

    1. ZK quorum availability
    2. Brokers availability as cluster members
    3. Number of active controllers
    4. Producers and consumers progress
    5. Under replicated partitions, etc.

Expected Behavior

  1. In a steady state, Kafka cluster should be available with

    1. 4 brokers
    2. ZK quorum as 3
    3. Active controllers as 1
    4. Producers and consumers should produce and fetch events from topics

  2. Upon one broker’s disk full,

    1. Disk error should be logged to logs.
    2. Kafka cluster should not be available; only one broker should be shown as not available or out of cluster membership
    3. ZK quorum shouldn’t fail
    4. Active controllers can be zero (if the controller node is the same as a dead broker) and move the controller to another available broker.
    5. The cluster should report under replicated partitions due to one broker unavailability.
    6. Producers and consumers should be interrupted, create back pressure, or time-outs.

Actual/Observed Behavior

Kafka is not available and couldn’t be recovered on its own upon and showed active controllers as ‘0’


  • Active brokers are reported as zero
  • The active controller number is zero
  • No impact on Zookeeper quorum
  • Producers and consumers completely interrupted and timed out

Root Cause Analysis (RCA)

From the logs, observed disk-related errors and exceptions



logs/state-change.log:102362: ERROR [Broker id=1] Skipped the become-follower state change with correlation id 331 from controller 1 epoch 1 for partition topic1-9 (last update controller epoch 1) with leader 5 since the replica for the partition is offline due to disk error org.apache.kafka.common.errors.KafkaStorageException: Error while creating log for topic1-9 in dir /kafka/kafka-logs (state.change.logger)


Logs show that the controller moved the exception from one broker to another broker, but no exit or hang was reported in the logs



Controller exceptions: controller.log:org.apache.kafka.common.errors.ControllerMovedException: Controller moved to another broker. Aborting  controller.log:java.io.IOException: Client was shutdown before response was read controller.log:java.lang.InterruptedException


Controllers ‘0’ behavior is equal to the brain-dead scenario explained in scenario 2.

References

  • ZK active controller commands
  • ZK broker membership commands

Solution/Remedy

  • It is essential to monitor your Kafka cluster with proper metrics such as CPU, Disk, Memory, Java Heap, etc. Once the disk reaches 80% or more, you should scale up your cluster.
  • Because there are no controller claims by brokers, all the brokers need to be restarted.

Scenario 4: Corrupted indices upon Kafka process kill












Scenario Corrupted indices upon Kafka process kill
Scenario Description

Killing broker and ZK processes (an uncontrolled shutdown can be comparable with machine abrupt reboot) and restarting brokers lead to corrupted indices, rebuilding indices doesn’t work even after hours of waiting.


This scenario is comparable to an ungraceful shutdown during an environment upgrade or to a poor security posture, where an unauthorized person could terminate the service.

Execution Steps

  1. Kafka Cluster (all brokers and Zookeeper quorum) should be up and running with the configuration stated in the environment section
  2. Produce and consume events with brokers’ equilibrium state
  3. Kill the broker and ZK processes, and restart the processes
  4. Observe the cluster behavior through the broker host CLI or ZK CLI. If you are using Confluent, you can check in the Confluent center or the AWS Cloud provided managed service you can check the same CloudWatch respectively

    1. ZK quorum availability
    2. Brokers’ availability as cluster members
    3. Number of active controllers
    4. Producers and consumers progress
    5. Under replicated partitions etc.

Expected Behavior

  1. In a steady state, the Kafka cluster should be available with

    1. 4 brokers
    2. ZK quorum is 3
    3. Active controllers as 1
    4. Producers and consumers should produce and fetch events from topics

  2. Kill Kafka processes (kill -9), i.e., both broker and ZK processes on all 4 nodes and restart processes

    1. ZK quorum should be formed
    2. Active controllers as ‘1’
    3. In-flight messages may be lost if not flushed to the disk
    4. May result in corrupted indices, but the broker rebuilds corrupted indices on its own without any manual work
    5. Upon rebuild of corrupted indices, the Kafka cluster should be available
    6. The cluster might show under replicated partitions due to corrupted indices
    7. Producers and consumers will get exceptions or errors and can’t continue to perform their jobs

Actual/Observed Behavior

Kafka is not available and couldn’t be recovered on its own.


  • Active brokers are ‘4’
  • Active controller number is ‘1’
  • No impact on Zookeeper quorum
  • Producers and consumers completely interrupted and timed out

Root Cause Analysis (RCA)

From the logs, ‘index corrupted’ logs for partitions are shown, and also ‘rebuilding indices logs’. Rebuilding indices didn’t work and doesn’t show any related logs on progress. The resulting cluster is in unavailable status even after 1 hour.



WARN [Log partition=topic1-13, dir=/kafka/new/kafka-logs] Found a corrupted index file corresponding to log file /kafka/new/kafka-logs/topic1-13/00000000000000000000.log due to Corrupt time index found, time index file (/kafka/new/kafka-logs/topic1-13/00000000000000000000.timeindex) has non-zero size but the last timestamp is 0 which is less than the first timestamp 1553657637686}, recovering segment and rebuilding index files... (kafka.log.Log)


Also, it is recommended to do graceful shutdowns instead of using the kill switch or doing abrupt shutdowns so the data in the Kafka local cache will be flushed to the disks by maintaining meta data information and also committing offsets.

References
Solution/Remedy If Kafka fails to rebuild all the indices on its own, manual deletion of all index files and restarting all brokers will fix the issue.

Conclusion on WAN disruptions failure scenarios

Deploying on-premise stretch clusters requires a deep understanding of failure scenarios, particularly during network disruptions. Unlike cloud environments, on-premises setups may lack highly available network infrastructure, fault tolerance mechanisms, or dedicated fiber lines, making them more susceptible to outages. In such cases, a Kafka cluster may experience downtime if network connectivity is disrupted.

Cloud providers, on the other hand, offer high-bandwidth, low-latency networking over fully redundant, dedicated metro fiber connections between regions and data centers. Their managed Kafka services are designed for high availability, typically recommending deployment across multiple zones or regions to withstand zonal failures. Additionally, these services incorporate zone-awareness features, ensuring Kafka data replication occurs across multiple zones, rather than being confined to a single one.

It is also important to understand and emphasize the latency impact on a stretch cluster deployed in two different Geo locations with a WAN connection. We can not escape physics; in our case, as per WonderNetwork, the average latency between London and Frankfurt is 14.88 milliseconds for a distance of 636km. So, for mission-critical applications, the setup of a stretch cluster driven by latencies, through testing, is needed to determine how the data is being replicated and consumed across.

By understanding these differences, organizations can make informed decisions about whether an on-premise stretch cluster aligns with their availability and resilience requirements or if a cloud-native Kafka deployment is a more suitable alternative.

Disaster Recovery Strategies:

A Business Continuity Plan (BCP) is essential for any business to maintain its brand and reputation. Almost all the companies think of High Availability (HA) and Disaster recovery (DR) strategies to meet their BCP. Below are the popular DR Strategies for Kafka.

  • Active – Standby (aka Active-passive)
  • Active – Active
  • Backup and restore

Before we dive deep into DR strategies further, let’s understand that the Kafka Mirror Maker 2.0 we are using in the following architectural patterns. There are other components like Kafka Connect, Replicator, etc., offered by Confluent and other cloud services, but they are out of scope for this article.

DNS Redirection in Apache Kafka is a cross-cluster replication tool that uses the Kafka Connect framework to replicate topics between clusters, supporting active-active and active-passive setups. It provides automatic consumer group synchronization, offset translation, and failover support for disaster recovery and multi-region architectures.

Active Standby

This is a popular DR strategy pattern where there will be two different Kafka clusters, which are of identical size & configuration, deployed in two different Geo regions or Data centers. In this case, it’s the London and Frankfurt regions. Usually, Geo location choice depends upon Earth’s different tectonic plates, company or country compliance and regulations requirements, and other factors. Out of which only one cluster is active at a given point of time, which receives and serves the clients.

As shown in the following picture, Kafka cluster one (KC1) is located in the London region and Kafka cluster two (KC2) is located in the Frankfurt region with the same size and configuration.

All the producers are producing data to KC1 (London region cluster), and all consumers are consuming the data from the same cluster. KC2 (Frankfurt region cluster) receives the data from the KC1 cluster through Mirror Maker as shown in Figure 1.

Figure 1: Kafka Active Standby (aka Active Passive) cluster setup with Mirror role

In simple terms, we can treat the mirror maker as a combination of producer and consumer, where it consumes the topics from the source cluster KC1 (London region) and produces the data to the destination cluster KC2 (Frankfurt region).

During the disaster to the primary/active cluster KC1 (London region), the standby cluster (Frankfurt region) will become active, and all the producers and consumers’ traffic redirected to the standby cluster. This can be achieved through DNS redirection via a load balancer in front of Kafka clusters. Fig 2 shows this scenario.

Figure 2: Kafka Active Standby setup – Standby cluster becomes active

When the London Region KC1 cluster becomes active, we need to set up a mirror maker instance to consume the data from the Frankfurt region cluster and produce back to it to the London region KC1 cluster. This scenario is shown in Fig. 3.

It’s up to the business to keep the Frankfurt region (KC2) active or promote the London region cluster (KC1) back to primary once the data is fully synced/hydrated to the current state. In some cases, the standby cluster, i.e., the Frankfurt region cluster (KC2), continues to function like an active cluster and the London region cluster (KC1) becomes the standby as well.

It is essential to note that RTO (Recovery Time Objective) and RPO (Recovery Point Objective) values are high compared to the Active-Active setup. Active-standby setup cost is medium to high as we need to maintain the same Kafka clusters in both regions.

Figure 3: Kafka Active Standby setup – Standby cluster data sync happening to the KC1

Concerns with the Active-Standby setup

Kafka Mirror Maker Replication Lag (Δ):

Let’s take a scenario, and it might happen very rarely, where we lose all the cluster completely, including nodes and disks, etc. Where the Active Kafka Cluster (London region – KC1) has ‘m’ messages, and the mirror maker is consuming and producing to the standby cluster. Due to network latency or consumers & producing process with acknowledgements, the standby cluster (i.e., Frankfurt region – KC2) lags behind and holds only ‘n’ messages. Which means the delta((Delta)) is m minus n.

(Delta = m – n)

where:

  • m = Messages in the Active Kafka Cluster (London)
  • n = Messages in the Standby Kafka Cluster (Frankfurt)

Due to this replication lag ((Delta)), it will have the following impact on mission-critical applications

Potential Message Loss:

If a failover occurs before the Standby cluster catches up, some messages ((Delta)) may be missing.

Inconsistent Data:

Consumers relying on the Standby cluster may process incomplete or outdated data.

Delayed Processing:

Applications depending on real-time replication may experience latency in data availability.

Active-Active Setup:

The replication lag concern can be overcome with the Active-Active setup. Just like the Active standby setup, we will have the same Kafka clusters deployed in the London and Frankfurt regions.

The producers of the ingestion pipeline will write the data to both clusters at any given point in time. Which is also known as Dual Writes. So, both clusters’ data is in sync as shown in Fig. 4.

Please note that the ingestion pipeline or producers should have the capability to support dual writes. It is also important to note that these write pipelines should be isolated so that any issue with one cluster should not impact other pipelines or create back pressure and producer latencies. For example, if your use case is log analytics or observability, considering Logstash in your ingestion and creating two separate pipelines within a single or more Logstash instances will act as an isolated pipeline. But only one cluster is active for the consumers at any given point of time.

Figure 4: Active-Active setup or Dual writes.

In case, as shown in Figure 5, any issues with the one cluster the consumers need to be redirected with DNS redirect and RTO and RPO or SLAs are near zero.

Figure 5: Active-Active setup, one cluster is unavailable

Once the London cluster is available and becomes active, data can sync back to the London cluster from the Frankfurt cluster using Mirror Maker. As shown in Fig. 6, consumers can be switched back to the London cluster using DNS redirect. Also, it is important to configure consumer groups to start where they left off.

During this setup, during the failover, it is important to configure consumers’ offset to read the messages from a particular point in time, like reply messages from the last current time minus 15 mins, or set up a mirror maker between active clusters to read ‘only’ consumer offset topics from the London region cluster and replicate to the Frankfurt region cluster. And during the failure, you can use source.auto.offset.reset=latest to read the data from the latest offsets in the Frankfurt region cluster.

If the consumer groups are not properly configured, you may see duplicate messages in your downstream system. So, it’s essential to design the downstream to handle duplicate records. For example, if it’s an observability use case and if the downstream is OpenSearch, it’s good practice to design a unique ID depending on the message timestamp and other parameters like session IDs, etc. So, that duplicate message will overwrite the existing duplicate record and maintain only one record.

Figure 6: Active-Active setup cluster syncing data with MirrorMaker

Challenges and Complexities in Active-Active Kafka Setups

Message ordering challenges occur when writing to two clusters simultaneously, messages may arrive in different orders due to network delays between regions. This can be particularly problematic for applications that require strict message ordering. So, it needs to be handled in the data processing time or downstream systems. For example, defining a primary key within the message body timestamp field would order the messages based on timestamps.

Data Consistency Issues can occur while establishing dual writes to two different clusters. It requires robust monitoring and reconciliation processes to ensure data integrity across clusters. Because there might be conditions where data might successfully write to one cluster but fail in another, resulting in data inconsistency.

Producer complexity is an issue in this active-active setup. Producers need to handle writing to two clusters and manage failures independently for each cluster, so isolation of each pipeline is important. This setup might increase the complexity of the producer applications and require more resources to manage dual writes effectively. There are some agents or tools in the market also as open-source offerings that will provide dual pipeline options such as Elastic Logstash.

Consumer group management is complex. Managing consumer offsets across multiple clusters could be challenging, especially during failover scenarios, and can lead to message duplication. Consumer applications or downstream systems need to be designed to handle duplicate messages or implement deduplication logic.

Operational Overhead – Running active-active clusters requires sophisticated monitoring tools and increases operational costs significantly. Teams need to maintain detailed runbooks and conduct regular disaster recovery drills to ensure smooth failover processes.

Network Latency Considerations– Cross-region network latency and bandwidth costs become critical factors in active-active setups. Reliable network connectivity between regions is essential for maintaining data consistency and managing replication effectively.

Backup and Restore

Backup and restore is a cost-effective architectural pattern compared to Active Standby(AP) or Active Active(AA) DR architectural patterns. It is suitable for non-critical applications with higher RTO and RPO values compared to AA or AP patterns.

Backup and restore is a cost-effective option because we need only one Kafka cluster. In case the Kafka cluster is unavailable, the data from the centralized repository can be replayed/restored to the Kafka cluster once it becomes available. In this architectural pattern, shown in Figure 7, all the data will be saved to a persistent location like Amazon S3 or another data lake, where the data can be restored/replayed to the Kafka cluster when the Kafka cluster is available. It is also common to see a new environment of Kafka clusters built together using terraform templates and build a new Kafka cluster and replay data.

Figure 7: Backup and restore architectural pattern

Challenges and Complexities in Kafka Backup and Restore Setup

Recovery time impact is a major consideration; Restoring (aka replaying) large volumes of data to Kafka can take significant time, resulting in a higher RTO. This makes it suitable only for applications that can tolerate longer downtimes.

Data consistency challenges are also an issue. When restoring data from backup storage like Amazon S3, maintaining the exact message ordering and partitioning as the original cluster can be challenging. Special attention is needed to ensure data consistency during restoration.

Backup performance overhead is a further consideration. Regular backups to external storage can impact the performance of the production cluster. The backup process needs careful scheduling and resource management to minimize impact on production workloads.

Storage is key. Managing large volumes of backup data requires effective storage lifecycle policies and controlling costs. Regular cleanup of old backups and efficient storage utilization strategies are essential.

Coordination during recovery is mandatory. The recovery process requires careful coordination between backing up offset information and actual messages. Missing either can lead to message loss or duplication during restoration.

Conclusion on Kafka DR Strategies

As we explored earlier, choosing a DR strategy isn’t one-size-fits-all, it comes with trade-offs. It often boils down to finding the right balance between what the business needs and what it can afford. The decision usually depends on several factors.

SLAs (e.g., 99.99%+ uptime) are pre-defined commitments to uptime and availability. An SLA requiring 99.99% uptime allows only 52.56 minutes of downtime per year.

The RTO metric is used to define the maximum acceptable downtime for a system or application after a failure before it causes unacceptable business harm. Knowledge/content/training systems, for example, can expect one hour or more downtime because they are not mission critical. For banking applications, on the other hand, downtime acceptance could be none or a couple of minutes.

The RPO metric is used to define the maximum acceptable amount of data loss during a disruptive event, measured in time from the most recent backup. For example, an observability/monitoring application can have the acceptability for losing the past hour whereas a financial application can not tolerate losing a single record of transaction data.

Associated costs vary. Higher availability and faster recovery typically come at a higher price. Organizations must weigh their need for uptime and data protection against the financial investment required.







DR Strategy RTO (Downtime tolerance) RPO (Data loss tolerance) Use case Cost
Active Active Near-Zero (Seconds) Near-Zero (Seconds) Mission-critical apps (e.g., banking, e-commerce, stock trading) High
Active Standby Minutes to Hours Near-Zero to Minutes Enterprise apps need high availability Medium to High (No dual write ingestion pipeline cost)
Backup and Restore Hours to Days Hours to Days Non-critical apps with occasional DR needs or cost-sensitive apps, Low

References:

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article 6 Steps to 24/7 In-House SOC Success
Next Article How AI Is Helping Kids Find the Right College
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

5 Data Breaches That Ended in Disaster (and Lessons Learned) | HackerNoon
Computing
T-Mobile to roll out three new in June offer with ‘sweet‑spot’ scheme
News
How to create a strong password
Gadget
Qilin Ransomware Adds
Computing

You Might also Like

News

T-Mobile to roll out three new in June offer with ‘sweet‑spot’ scheme

5 Min Read
News

‘It could get an orgasm out of a cabbage’: the best vibrators, tested

38 Min Read
News

I’m beating heat with hi-tech Dyson fan but its best perk isn’t even the cooling

9 Min Read
News

Applebee’s and IHOP have plans for AI

1 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?