- 2 Minute Streaming
- Posts
- How Slack uses Apache Kafka
How Slack uses Apache Kafka
š a million of messages per second ... and I thought I sent a lot of messages
Slack š¤ Kafka
Slack started their Kafka journey withā¦ Redis.
They initially used Redis as a queue for async processing of any tasks that were too slow for a web request - e.g unfurling links, notifications, search index updates, etc.
High Level Redis Architecture
Then one day in 2016?
They had a big incident.
A database slowdown lead to a:
job execution slowdown which lead to:
Redis hitting its memory limit.
When Redis exhausts memory, it doesnāt allow you to enqueue new jobs. Slack lost data during this incident (the jobs they couldnāt enqueue).
And ultimately - Redis got completely stuck. It turns out dequeueing something from Redis requires a tiny amount of memory too. š¬
They solved this by migrating to Kafka. Incrementally.
They first placed Kafka in front of Redis to act as a durable store.
kafkagate, an HTTP proxy, was written & placed in front of Kafka for their PHP web apps to interface with.
Kafka + Redis Intermediate Architecture
Kafka š¤ Slackās Data Warehouse
In 2017, Slack shared that Kafka is also used to collect data (logs, jobs, etc) & push it to their data warehouse in S3.
They used Pinterestās Secor library as the service that persists Kafka messages to S3. (really a sink connector equivalent)
Kafka š¤ Slackās Distributed Tracing Events
Another use case they have is shepherding distributed tracing events into the appropriate data stores for visualization purposes.
This is at the following scale:
310M traces a day (3587/s)
8.5 spans a day (98.3k/s)
The Latest Stats āØ
Slack continued to grow their Kafka usage across the org - with different teams adopting Kafka in their own setups. This eventually led to a fragmentation of versions & duplicate effort in managing Kafka.
Year by year, Kafka became an increasingly-central nervous system at their company, moving mission-critical data.
In 2022, it powered:
logging pipelines
trace data
billing
enterprise analytics
security analytics
The latest numbers I could find were the following:
6.5 Gbps
1,000,000s of messages a second
700TB of data (0.7PB)
10 Kafka clusters
100s of nodes
Managing 10 clusters at this scale required some work - they invested in automating many processes:
topic/partition creation & reassignment
capacity planning
adding/removing brokers
replacing nodes / upgrading versions
observability
ops toil
And Slackās Kafka usage is only growing.
They formed a new Data Streaming team to handle all current and future Kafka usecases.
Immediate future plans include a new Change Data Capture (CDC) project. It will support the caching needs for Slackās Permission Service used to authorize actions in Slack and enable near real-time updates to their Data Warehouse.
Have any questions for Slack?
We managed to find that platform teamās manager - Ryan - on Twitter and he agreed to an AMAš (thanks Ryan!)
Regardless, it still took the team quite some time to get Kafka to a place where we considered it stable and not the thing that keeps us up at night. Software with a lot of knobs tends to have a very small sweet spot.
(Context, I managed the team that ran Kafka at Slack. AMA?)
ā ryan katkov (@ryangonnaryan)
9:42 PM ā¢ May 25, 2023
This quote by Slack co-founder Stewart Butterfield is amazing:
āIt is very difficult to approach Slack with beginnerās mind. But we have to, all of us, and we have to do it every day, over and over and polish every rough edge off until this product is as smooth as lacqueredā¦ twitter.com/i/web/status/1ā¦
ā Stanislav Kozlovski (@BdKozlovski)
9:08 AM ā¢ Jul 11, 2023
Stan: This is how Kafka seems to naturally proliferate inside companies. Starts with one thing, and then teams just continue to adopt and adopt as its network effect of expertise & experience grows inside the company.
Iām curious to hear how adoption is going in your company. Please reply to this e-mail if you feel like sharing! We will make sure to highlight it in the next issue š
Liked this edition?
Help support our growth so that we can continue to deliver valuable content!
More Kafka? š„
What more did we post on socials this week?
Letās see:
Something amazing is coming to Apache Kafkaā¦
Consumer Groups v2!
If youāve ever used consumer groups in production at any non-trivial scale, you probably know all the problems with it:
- āļø Group-wide synchronization barrier acts as a cap on scalability
A single misbehavingā¦ twitter.com/i/web/status/1ā¦
ā Stanislav Kozlovski (@BdKozlovski)
5:55 PM ā¢ Jul 12, 2023
Here are the top 14 Apache Kafka Consumer configs you should learn about. āļø
For those interested - yesterday's 2 minute streaming issue was about Consumer 101 basics, and had links to more advanced concepts.
ā Stanislav Kozlovski (@BdKozlovski)
2:20 PM ā¢ Jul 11, 2023
Most people think mTLS is hard.
It's not - simply study how other companies do it.
Hereās how ZenDesks secures their Kafka clusters using mTLS.
I promise itās simple.
But first - why mTLS?
Itās simply a very appealing way of both encrypting and authenticating yourā¦ twitter.com/i/web/status/1ā¦
ā Stanislav Kozlovski (@BdKozlovski)
10:34 AM ā¢ Jul 15, 2023
š“ When to alert on URPs š„±
An under-replicated partition (URP) isn't a bad thing per se.
It simply means that not all replicas of a partition are alive.
Recall that data in Kafka is split by topic -> partitions -> replicas on brokers.
A partition with a replication factor of 3 would have 3ā¦ https:twitter.com/i/web/status/1ā¦:ā Stanislav Kozlovski (@BdKozlovski)
5:51 PM ā¢ Jul 14, 2023
ApacheĀ®, Apache KafkaĀ®, Kafka, and the Kafka logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.