How Slack uses Apache Kafka

šŸ† a million of messages per second ... and I thought I sent a lot of messages

Slack šŸ¤ Kafka

Slack started their Kafka journey withā€¦ Redis.

They initially used Redis as a queue for async processing of any tasks that were too slow for a web request - e.g unfurling links, notifications, search index updates, etc.

High Level Redis Architecture

Then one day in 2016?

They had a big incident.

  • A database slowdown lead to a:

    • job execution slowdown which lead to:

      • Redis hitting its memory limit.

When Redis exhausts memory, it doesnā€™t allow you to enqueue new jobs. Slack lost data during this incident (the jobs they couldnā€™t enqueue).

And ultimately - Redis got completely stuck. It turns out dequeueing something from Redis requires a tiny amount of memory too. šŸ˜¬

They solved this by migrating to Kafka. Incrementally.

They first placed Kafka in front of Redis to act as a durable store.

kafkagate, an HTTP proxy, was written & placed in front of Kafka for their PHP web apps to interface with.

Kafka + Redis Intermediate Architecture

Kafka šŸ¤ Slackā€™s Data Warehouse

In 2017, Slack shared that Kafka is also used to collect data (logs, jobs, etc) & push it to their data warehouse in S3.

They used Pinterestā€™s Secor library as the service that persists Kafka messages to S3. (really a sink connector equivalent)

Kafka šŸ¤ Slackā€™s Distributed Tracing Events

Another use case they have is shepherding distributed tracing events into the appropriate data stores for visualization purposes.

This is at the following scale:

  • 310M traces a day (3587/s)

  • 8.5 spans a day (98.3k/s)

The Latest Stats āœØĀ 

Slack continued to grow their Kafka usage across the org - with different teams adopting Kafka in their own setups. This eventually led to a fragmentation of versions & duplicate effort in managing Kafka.

Year by year, Kafka became an increasingly-central nervous system at their company, moving mission-critical data.

In 2022, it powered:

  • logging pipelines

  • trace data

  • billing

  • enterprise analytics

  • security analytics

The latest numbers I could find were the following:

ā

6.5 Gbps

1,000,000s of messages a second

700TB of data (0.7PB)

10 Kafka clusters

100s of nodes

Slackā€™s Kafka stats

Managing 10 clusters at this scale required some work - they invested in automating many processes:

  • topic/partition creation & reassignment

  • capacity planning

  • adding/removing brokers

  • replacing nodes / upgrading versions

  • observability

  • ops toil

And Slackā€™s Kafka usage is only growing.

They formed a new Data Streaming team to handle all current and future Kafka usecases.

Immediate future plans include a new Change Data Capture (CDC) project. It will support the caching needs for Slackā€™s Permission Service used to authorize actions in Slack and enable near real-time updates to their Data Warehouse.

Have any questions for Slack?

We managed to find that platform teamā€™s manager - Ryan - on Twitter and he agreed to an AMAšŸ‘‡ (thanks Ryan!)

Stan: This is how Kafka seems to naturally proliferate inside companies. Starts with one thing, and then teams just continue to adopt and adopt as its network effect of expertise & experience grows inside the company.

Iā€™m curious to hear how adoption is going in your company. Please reply to this e-mail if you feel like sharing! We will make sure to highlight it in the next issue šŸ™‚Ā 

Liked this edition?

Help support our growth so that we can continue to deliver valuable content!

More Kafka? šŸ”„

What more did we post on socials this week?

Letā€™s see:

ApacheĀ®, Apache KafkaĀ®, Kafka, and the Kafka logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.