2 Minute Streaming
Posts
How Slack uses Apache Kafka

How Slack uses Apache Kafka

🐆 a million of messages per second ... and I thought I sent a lot of messages

Stanislav Kozlovski
July 17, 2023

Slack 🤝 Kafka

Slack started their Kafka journey with… Redis.

They initially used Redis as a queue for async processing of any tasks that were too slow for a web request - e.g unfurling links, notifications, search index updates, etc.

High Level Redis Architecture

Then one day in 2016?

They had a big incident.

A database slowdown lead to a:
- job execution slowdown which lead to:
  - Redis hitting its memory limit.

When Redis exhausts memory, it doesn’t allow you to enqueue new jobs. Slack lost data during this incident (the jobs they couldn’t enqueue).

And ultimately - Redis got completely stuck. It turns out dequeueing something from Redis requires a tiny amount of memory too. 😬

They solved this by migrating to Kafka. Incrementally.

They first placed Kafka in front of Redis to act as a durable store.

kafkagate, an HTTP proxy, was written & placed in front of Kafka for their PHP web apps to interface with.

Kafka + Redis Intermediate Architecture

Kafka 🤝 Slack’s Data Warehouse

In 2017, Slack shared that Kafka is also used to collect data (logs, jobs, etc) & push it to their data warehouse in S3.

They used Pinterest’s Secor library as the service that persists Kafka messages to S3. (really a sink connector equivalent)

Kafka 🤝 Slack’s Distributed Tracing Events

Another use case they have is shepherding distributed tracing events into the appropriate data stores for visualization purposes.

This is at the following scale:

310M traces a day (3587/s)
8.5 spans a day (98.3k/s)

The Latest Stats ✨

Slack continued to grow their Kafka usage across the org - with different teams adopting Kafka in their own setups. This eventually led to a fragmentation of versions & duplicate effort in managing Kafka.

Year by year, Kafka became an increasingly-central nervous system at their company, moving mission-critical data.

In 2022, it powered:

logging pipelines
trace data
billing
enterprise analytics
security analytics

The latest numbers I could find were the following:

❝

6.5 Gbps

1,000,000s of messages a second

700TB of data (0.7PB)

10 Kafka clusters

100s of nodes

Slack’s Kafka stats

Managing 10 clusters at this scale required some work - they invested in automating many processes:

topic/partition creation & reassignment
capacity planning
adding/removing brokers
replacing nodes / upgrading versions
observability
ops toil

And Slack’s Kafka usage is only growing.

They formed a new Data Streaming team to handle all current and future Kafka usecases.

Immediate future plans include a new Change Data Capture (CDC) project. It will support the caching needs for Slack’s Permission Service used to authorize actions in Slack and enable near real-time updates to their Data Warehouse.

Have any questions for Slack?

We managed to find that platform team’s manager - Ryan - on Twitter and he agreed to an AMA👇 (thanks Ryan!)

Regardless, it still took the team quite some time to get Kafka to a place where we considered it stable and not the thing that keeps us up at night. Software with a lot of knobs tends to have a very small sweet spot.
(Context, I managed the team that ran Kafka at Slack. AMA?)
— ryan katkov (@ryangonnaryan)
9:42 PM • May 25, 2023

🗣️ A quote by Slack’s co-founder I found inspiring for product development:

This quote by Slack co-founder Stewart Butterfield is amazing:
“It is very difficult to approach Slack with beginner’s mind. But we have to, all of us, and we have to do it every day, over and over and polish every rough edge off until this product is as smooth as lacquered… twitter.com/i/web/status/1…
— Stanislav Kozlovski (@BdKozlovski)
9:08 AM • Jul 11, 2023

Stan: This is how Kafka seems to naturally proliferate inside companies. Starts with one thing, and then teams just continue to adopt and adopt as its network effect of expertise & experience grows inside the company.

I’m curious to hear how adoption is going in your company. Please reply to this e-mail if you feel like sharing! We will make sure to highlight it in the next issue 🙂

Liked this edition?

Help support our growth so that we can continue to deliver valuable content!

More Kafka? 🔥

What more did we post on socials this week?

Let’s see:

🎥 Visualization of the new KIP-8484 Consumer Groups protocol ✨

Something amazing is coming to Apache Kafka…
Consumer Groups v2!
If you’ve ever used consumer groups in production at any non-trivial scale, you probably know all the problems with it:
- ⛔️ Group-wide synchronization barrier acts as a cap on scalability
A single misbehaving… twitter.com/i/web/status/1…
— Stanislav Kozlovski (@BdKozlovski)
5:55 PM • Jul 12, 2023

💡 The top 14 Consumer Configs you should learn about:

Here are the top 14 Apache Kafka Consumer configs you should learn about. ✌️
For those interested - yesterday's 2 minute streaming issue was about Consumer 101 basics, and had links to more advanced concepts.
— Stanislav Kozlovski (@BdKozlovski)
2:20 PM • Jul 11, 2023

🤐 ZenDesk’s mTLS Kafka setup - see what secures your support tickets:

Most people think mTLS is hard.
It's not - simply study how other companies do it.
Here’s how ZenDesks secures their Kafka clusters using mTLS.
I promise it’s simple.
But first - why mTLS?
It’s simply a very appealing way of both encrypting and authenticating your… twitter.com/i/web/status/1…
— Stanislav Kozlovski (@BdKozlovski)
10:34 AM • Jul 15, 2023

😴 When to alert on URPs 🥱

An under-replicated partition (URP) isn't a bad thing per se.

It simply means that not all replicas of a partition are alive.

Recall that data in Kafka is split by topic -> partitions -> replicas on brokers.

A partition with a replication factor of 3 would have 3… https:twitter.com/i/web/status/1…:
— Stanislav Kozlovski (@BdKozlovski)
5:51 PM • Jul 14, 2023

_{Apache®, Apache Kafka®, Kafka, and the Kafka logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks}_.