How Slack uses Apache Kafka

🐆 a million of messages per second ... and I thought I sent a lot of messages

Slack 🤝 Kafka

Slack started their Kafka journey with… Redis.

They initially used Redis as a queue for async processing of any tasks that were too slow for a web request - e.g unfurling links, notifications, search index updates, etc.

High Level Redis Architecture

Then one day in 2016?

They had a big incident.

  • A database slowdown lead to a:

    • job execution slowdown which lead to:

      • Redis hitting its memory limit.

When Redis exhausts memory, it doesn’t allow you to enqueue new jobs. Slack lost data during this incident (the jobs they couldn’t enqueue).

And ultimately - Redis got completely stuck. It turns out dequeueing something from Redis requires a tiny amount of memory too. 😬

They solved this by migrating to Kafka. Incrementally.

They first placed Kafka in front of Redis to act as a durable store.

kafkagate, an HTTP proxy, was written & placed in front of Kafka for their PHP web apps to interface with.

Kafka + Redis Intermediate Architecture

Kafka 🤝 Slack’s Data Warehouse

In 2017, Slack shared that Kafka is also used to collect data (logs, jobs, etc) & push it to their data warehouse in S3.

They used Pinterest’s Secor library as the service that persists Kafka messages to S3. (really a sink connector equivalent)

Kafka 🤝 Slack’s Distributed Tracing Events

Another use case they have is shepherding distributed tracing events into the appropriate data stores for visualization purposes.

This is at the following scale:

  • 310M traces a day (3587/s)

  • 8.5 spans a day (98.3k/s)

The Latest Stats ✨ 

Slack continued to grow their Kafka usage across the org - with different teams adopting Kafka in their own setups. This eventually led to a fragmentation of versions & duplicate effort in managing Kafka.

Year by year, Kafka became an increasingly-central nervous system at their company, moving mission-critical data.

In 2022, it powered:

  • logging pipelines

  • trace data

  • billing

  • enterprise analytics

  • security analytics

The latest numbers I could find were the following:

6.5 Gbps

1,000,000s of messages a second

700TB of data (0.7PB)

10 Kafka clusters

100s of nodes

Slack’s Kafka stats

Managing 10 clusters at this scale required some work - they invested in automating many processes:

  • topic/partition creation & reassignment

  • capacity planning

  • adding/removing brokers

  • replacing nodes / upgrading versions

  • observability

  • ops toil

And Slack’s Kafka usage is only growing.

They formed a new Data Streaming team to handle all current and future Kafka usecases.

Immediate future plans include a new Change Data Capture (CDC) project. It will support the caching needs for Slack’s Permission Service used to authorize actions in Slack and enable near real-time updates to their Data Warehouse.

Have any questions for Slack?

We managed to find that platform team’s manager - Ryan - on Twitter and he agreed to an AMA👇 (thanks Ryan!)

Stan: This is how Kafka seems to naturally proliferate inside companies. Starts with one thing, and then teams just continue to adopt and adopt as its network effect of expertise & experience grows inside the company.

I’m curious to hear how adoption is going in your company. Please reply to this e-mail if you feel like sharing! We will make sure to highlight it in the next issue 🙂 

Liked this edition?

Help support our growth so that we can continue to deliver valuable content!

More Kafka? 🔥

What more did we post on socials this week?

Let’s see:

Apache®, Apache Kafka®, Kafka, and the Kafka logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.