KIP-405: Tiered Storage

🍫in which Kafka goes on a diet to lose a lot of weight (around 10TB+)

If you’re using Kafka to its full extent, you’re storing a lot of historical data in it.

But. This can be a bottleneck.

A key limitation in Kafka’s design is that it couples its storage with its compute.

The Problems

⛔️ Scale Up/Down

If you want to reassign partitions, you have to move all the data associated with them.

A simple 10TB disk on a broker can take 27.7 hours to move its data at a decent 100MB/s replication rate. You cannot react timely to any workload change with that.

⛔️ Disk Loss

If your broker dies and the disk is wiped out, it has to start from scratch with an empty disk and replicate in all that 10TB of data. Same problem.

There is room to improve this by 120x - turn a 4-hour recovery time down to 2 minutes. 👇

⛔️ Hard Restarts

When the broker starts up from an ungraceful shutdown, it has to rebuild all the local log index files associated with its partitions in a process called log recovery. With a 10TB disk, this can take many hours or even days.

⛔️ Competition for IOPS

HDDs have improved exponentially since their existence, but some parts have not kept pace. They have been stuck at roughly 120 IOPS for the last two decades.

When consumers try to read historical data, Kafka is forced to read from the disk (as the data isn’t in page cache) and that uses up the precious IOPS on the HDD.

Tests from KIP-405 showed a 43% producer performance decrease when historical consumers were present.

⛔️ Tail Latency

Similarly, latency has not kept pace with HDD improvements at all.

Capacity for HDDs has increased around 48,000 times faster than latency, which means that:

On a per-byte basis, HDDs are becoming slower.

Due to their nature, HDDs are more susceptible to higher outlier latencies than SSDs. In an increasingly latency-sensitive world, this is unacceptable.

KIP-405: Tiered Storage

Tiered storage is the simple idea of storing most of the broker’s data in another server - e.g S3.

KIP-405 introduces this by adding a pluggable external store.

Pluggable is a key word here, as it will enable the open-source community to develop different implementations for different external stores in parallel.

With this, Kafka will have two tiers of storage:

  • local (hot)

  • remote (cold)

This will abstracted away seamlessly - clients will not be able to tell where they’re fetching data from.

Tiered Storage visualization

Leader brokers are responsible for tiering the data there (persisting it).

Both leaders/followers can read from the remote store to serve historical data to consumers.

You will be able to enable tiered storage uniquely per topic, with varying local and remote retention settings.

Tests showed a minimal latency impact - 21ms → 25ms of p99 produce latency.

When will it release?

It just missed the 3.5 release and is currently slated for Early Access in 3.6 - so around September 2023.


Stan: I am feeling the constraints of the 2-minutes! Almost wish I had named this 3 Minute Streaming…

Here is the link to the KIP if you’d like to learn more:

The testing results were interesting, but we did not have space to cover them.

More Kafka? 🔥

I posted two very interesting things on social media today.

S3 Deep Dive

A 34-page long slide deck with a deep dive into S3. While this isn’t related to Kafka, the content was so interesting and groundbreaking (to me) that I had to share it. It is essentially the story of leveraging a massively multi-tenant cloud native design at extreme scale to offer something that would be impossible otherwise. See it here:

Top 3 Kafka Metrics

If you operate Kafka in any way in production, you will be interested in this. 🔥

I share the top 3 Kafka metrics you need to monitor, plus some gifts 🎁


To keep it on topic with tiered storage, I posted a few words about all the cost-benefits Kafka can enjoy from tiered storage by moving to SSDs:

People are telling me that my meme game is only getting better 😁

Apache®, Apache Kafka®, Kafka, and the Kafka logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.