Open source Apache Kafka has long been the backbone of real-time data streaming, but it’s traditionally come with a trade-off: keep expanding expensive broker storage or sacrifice historical data retention. With Kafka’s Tiered Storage, it’s a dilemma that’s finally fading (if you know what you’re doing).
By offloading older data to cheaper cloud object storage while keeping recent data local for speed, tiered Storage transforms Kafka’s storage economics and unlocks new possibilities for developers. But how does this work in practice, and what challenges should teams realistically expect when implementing it?
I recently spoke with Anil Inamdar from NetApp Instaclustr. Anil’s an expert in how to use 100% open source data technologies for mission-critical applications, and has a lot of experience with Kafka deployments. We covered everything from cost savings to unexpected uses cases that Tiered Storage makes possible. Here’s what he had to say.
Apache Kafka’s Tiered Storage has been gaining attention in the dev community. Could you explain the concept and how it changes Kafka’s traditional storage approach?
Traditionally, Kafka deployments can give you a tough choice—either keep expanding your broker storage to hold onto data longer (but with higher storage costs) or accept shorter retention periods and lose that historical data. It’s a classic trade-off that’s been part of Kafka since day one.
Kafka’s Tiered Storage completely flips this model. Instead of keeping everything on expensive local disks, Kafka now separates your data into two distinct tiers. Your recent, hot data stays local for optimal performance, while historical data automatically flows to much cheaper cloud object storage like S3. If you’ve ever wrestled with retention of messages in Kafka, this can be a game-changer.
What’s great about this architecture is that it works like a write-through cache system. Data follows a predictable path: it lands on local storage first, then once segments are closed, they’re asynchronously copied to remote storage. The beauty is that consumers don’t even need to know where the data is coming from; whether they’re reading from local or remote storage happens transparently.
Tiered Storage also opens up new use cases. Now you can keep months or even years of data accessible without breaking the bank (so particularly for organizations that need to analyze historical patterns or reprocess data from the past, this is a big deal). The cost savings can be dramatic, especially at scale, since cloud object storage is a fraction of the price of high-performance SSDs.
What’s particularly smart about how Tiered Storage was implemented is that it preserves all of Kafka’s core semantics and APIs. Your producers and consumers keep working exactly as they did before. The infrastructure changes, but your applications don’t have to.
What are the key technical and business drivers pushing teams toward Kafka’s Tiered Storage?
The biggest driver is simple economics. Organizations need to retain more data without proportionally increasing infrastructure costs. As data volumes explode, the traditional approach of scaling broker storage becomes financially unsustainable. Technical teams are pushing for Tiered Storage to enable longer-term analytics and compliance requirements, while CFOs appreciate decoupled compute and storage costs. The ability to leverage cheap(er) cloud storage while maintaining seamless access to historical data enables new possibilities around reprocessing, machine learning training, and regulatory compliance without sacrificing operational performance for real-time workloads.
When implementing Kafka Tiered Storage, what performance trade-offs should engineering teams be prepared for, and how can they mitigate potential bottlenecks?
First off, they should be prepared for the performance differences when reading from Tiered Storage versus local disks. Our benchmarks show local storage reads can be 2-3x faster than reading from remote storage like S3. The biggest hit comes with small segment sizes. We’ve seen up to 20x degradation there, so resist the temptation to reduce segment size without thorough testing.
To mitigate these challenges, you can increase partition counts for topics that need historical data processing. More partitions mean more concurrent consumers can read the data simultaneously, significantly boosting throughput from remote storage. Also, be strategic about your retention settings, and keep frequently accessed data local while offloading less critical data to remote storage.
Remember that Kafka producers aren’t affected since remote copying happens asynchronously, but you’ll still want to budget for about 10% additional cluster CPU and network resources to handle background tiering operations.
The ability to “time travel” through data is one of Kafka’s powerful capabilities. How does Tiered Storage expand possibilities for applications that need to reprocess historical data streams?
Kafka’s time travel was always limited by storage economics—keeping months of data on local disks just wasn’t all that viable for high-volume streams. Tiered Storage changes the equation. Now you can retain years of historical data affordably, transforming reprocessing scenarios from theoretical to practical. Training new ML models on complete datasets, migrating to new sink systems, or auditing past transactions for compliance all become realistic options.
Even more powerful is how this impacts development. Found a bug in your processing logic from months ago? Just replay from that point forward. You can experiment more freely, running parallel processing pipelines against the same historical data for A/B testing. I don’t think it’s a stretch to say that Tiered Storage essentially democratizes time travel for Kafka’s users.
Many teams still wrestle with right-sizing their Kafka clusters. What fundamental principles should guide capacity planning when implementing Tiered Storage?
With Tiered Storage, capacity planning shifts dramatically from “how much disk do I need?” to a, well, more nuanced calculation. Start by separating your workloads. Identify your producer input rate and your consumer patterns, and then determine what portion of data needs to remain local versus remote. As mentioned, budget for a bit more additional CPU and network overhead to handle the tiering operations.
You should focus on right-sizing local retention based on access patterns, not total data volume. The most active data should stay local, with everything else moving to cheaper remote storage. Remember that remote storage read performance depends heavily on partition count, so size your topics appropriately for the level of parallel processing you’ll need for historical data access.
Beyond use cases like compliance and analytics, what creative applications of Kafka Tiered Storage are you seeing engineers pursue?
Some of the most interesting Tiered Storage applications I’ve seen involve time-shifted operations. One organization created a digital twin architecture where they use Kafka as both the real-time control plane and the historical simulation environment. By keeping years of operational data accessible through Tiered Storage, they can run complex what-if scenarios against actual historical conditions instead of synthetic data.
I’ve also seen novel disaster recovery patterns emerge. Rather than maintaining hot standbys with duplicated infrastructure, companies are using Tiered Storage as a much cheaper recovery mechanism. When needed, they can rapidly spin up new Kafka clusters and reload just the relevant historical data from remote storage.
Another interesting one involves, essentially, business time machines that let teams roll back entire application states to specific points in the past. By combining event sourcing with Tiered Storage, these systems can recreate any past state without the prohibitive costs that previously made such capabilities impractical for all but the most critical systems.
As data streaming technologies continue to evolve, what innovations do you anticipate seeing in Kafka’s architecture over the next few years?
I think we’re headed toward somewhat of an infrastructure disappearance with Kafka. The next evolution of the open source project won’t necessarily be about adding more features, but more about making the infrastructure fade into the background so developers can focus purely on data and business logic.
We’re already seeing this start with Tiered Storage separating storage concerns from processing. The logical next step is more granular, serverless-like compute that dynamically scales with workload demands. Imagine Kafka clusters that automatically expand and contract based on actual throughput needs without manual intervention.
I also expect we’ll see Kafka evolve beyond the traditional producer-consumer model toward something more like a universal data fabric. The boundaries between streaming, database, and analytics platforms are blurring. Upcoming Kafka architectures will likely incorporate more database-like capabilities—transactions, advanced querying, and integration with compute frameworks—while maintaining its core streaming identity.
I’d also keep an eye on self-optimizing systems. As streaming data volumes continue to grow exponentially, manual tuning becomes impossible. We’ll need Kafka systems that can automatically determine optimal partition counts, retention policies, and resource allocation based on observed access patterns and workload characteristics. The Kafka of tomorrow won’t just be a better message broker—it’ll be the intelligent backbone of truly data-driven organizations.