- 2 Minute Streaming
- Apache Kafka Lines of Code Analysis (Java, Scala, Python)
Apache Kafka Lines of Code Analysis (Java, Scala, Python)
🔍 an analysis of apache kafka's codebase from versions 0.7.2 to 3.7.0
Talk is Cheap. Show Me The Code
A distributed streaming platform like Kafka is the infrastructure backbone of many, many companies today.
But how much code does it actually take to create such a platform?
That is… a lot of code when you think about it.
Apache Kafka 3.7 lines of code by module
If we have to give them a rough grouping:
Backend Server (420k)
core - 242k
metadata - 57k
group-coordinator - 50k
storage - 28k
raft - 27k
server-common - 16k
clients - 287k
tools - 24k
trogdor - 18k
streams - 329k
connect - 134k
We see it starts to make more sense, as the Kafka repository is well split between a few different projects.
🐣 Started From the Bottom, Now We’re Here
Can you guess how many lines of code Kafka started with?
Picture a number in your mind.
I will now give you a hint: the repository grew at an average rate of 24% per release. There were 24 releases. (cue mental algebra 🙂 )
Kafka started with 24,400 lines of code!
That’s literally as much as the
tools module today!
But then. Developers started cracking…
Release over release code growth rate (in percentage)
The very first release tripled the codebase’s size, and each of the two subsequent roughly doubled it.
With such a growth rate, you didn’t need many releases to grow the size substantially.
2012 - 24k
2015 - 138k
2017 - 400k
2020 - 726k
2022 - 994k
(start of) 2024 - 1,262k
One thing is clear - development has NOT slowed at all.
If anything, Kafka is having more code contributed to it than ever.
Talk about a healthy community.
👑 Top Contributors
What’s a newsletter without some shout outs?
While many people have written a lot of code, the top contributors in Apache Kafka have consistently contributed for the last ~7 years. This is an amazing feat.
Thank you to all the open source contributors!
Kafka is mainly Java. But there was a fair amount of Scala code back in the day, and large parts of the server (core) are still in Scala.
The project is slowly migrating away from Scala by simply writing most new code in Java. For example - all the new major Kafka features are written in Java:
The Java codebase growing at hyperspeed
Bonus: Largest Files
As a final present - I present you the top 5 Java/Scala files in terms of size. Challenge yourself to go read some of them 😉
📝 Less-Interesting Notes
this data was counted via cloc
we only count the three main languages in Kafka - Java, Scala and Python
we count blank lines, comments and code lines
the data includes test code, which is in all likelihood a large majority of the codebase (that’s a good thing!)
More than half of Kafka’s code is test code!
Liked this edition?
Help support our growth so that we can continue to deliver valuable content!
We were quiet on the socials this time.
Back to basics, this is a great share with any newbie you know:
Apache Kafka 101 in 1 minute. 🔥
Let’s go! 👇
It’s a distributed commit log.
A log is the simplest data structure - an ordered sequence of records that only supports appends.
🔒 It’s immutable, so you can’t delete or edit the records in place.
Kafka stores its data in topics.… twitter.com/i/web/status/1…
— Stanislav Kozlovski (@BdKozlovski)
Jan 13, 2024
Apache®, Apache Kafka®, Kafka, and the Kafka logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.