KIP-848: The New Kafka Consumer Group Protocol

a 2-min intro to the complete re-design of consumer groups in Kafka

Ah, Consumer Groups…

Many changes are coming to them.
(…and I’m not talking about KIP-932 and its Queue / Share Consumer Groups either…)

First, in case you don’t know the basics 👉

Second, introducing…

✨ KIP-848: The Next Generation of the Consumer Group Protocol ✨

A complete re-design, the newly-proposed protocol fixes many of the problems in the current version of consumer groups:

  • ⛔️ scalability cap

  • 🤯 complexity

  • 🐛 bugs

  • 🐌 slow fixes

  • 🔍 hard to debug

  • ⚙️ too extendable

  • 😵‍💫 interoperability

  • 😢 inconsistent metadata

Curious about more details regarding the problems? See here:

And it fixes it with these four pillars:

  • 1. Logic: Client Side → Broker Side

    • no more group leader, assignment is done in the broker.

  • 2. Rebalance Disruption: Stop the World → Incremental

    • it is much more incremental in a finer-grained sense.

      • when a new member joins, you don’t need to wait for all N members (that the group leader used to choose) to rebalance/revoke their partitions.

        • It can incrementally start getting partitions from the first (fastest) member to revoke.

When you think about it, a rebalance is simply a reassignment of some partitions from some consumers to others. 💡

Why does the whole group need to stop and know about this?

It doesn’t.

  • 3. Protocol: Generic → Consumer

    • make it a consumer-optimized protocol.

  • 4. Model: Pull-based → Push-Based

    • the broker now pushes the new partition assignments onto the consumers, instead of them pulling it from the broker (which used to pull it from the group leader).

🤔 How?

Elegantly.

All is streamlined within a new Heartbeat API.

The protocol maintains three types of epochs:

  • Group Epoch - tracks when a new rebalance should happen

  • Assignment Epoch - tracks the generation of the assignment

  • Member Epoch - tracks each member’s last-synced epoch

💡 A new assignment is computed every time the group epoch changes.

All epochs should converge to be the same number.

The order is the following:

  1. the group-wide epoch is bumped.

  2. the target assignment epoch is bumped.

  3. consumers catch up to the member epoch via the heartbeat request, individually. (fine-grained)

In general, what you have is a simple state machine inside the Group Coordinator broker that’s running a constant reconciliation loop. 💥

And for those that learn better through visuals?

Your humble author has a video for you:

Other Details

  • the session timeout is now defined on the server - group_consumer_session_timeout_ms.

  • the heartbeat interval is also now defined on the server - group_consumer_heartbeat_interval_ms.

  • static membership (group_instance_id) is still supported.

  • Extended Protocol - extra APIs are added to extend the protocol dance for power-users like Kafka Streams that still want to have a client-side assignor.

A preview of the protocol is targeting shipping with Apache Kafka 3.7.

Liked this edition?

You won’t find Kafka content like this elsewhere on the internet. So make sure to subscribe.

And help spread the word to support our growth so that we can continue to deliver valuable content!

And if you really enjoy the newsletter in general - please forward it to your team. It only takes 5 seconds. Writing it takes me 5+ hours.