Apache Kafka Lines of Code Analysis (Java, Scala, Python)

🔍 an analysis of apache kafka's codebase from versions 0.7.2 to 3.7.0

Talk is Cheap. Show Me The Code

A distributed streaming platform like Kafka is the infrastructure backbone of many, many companies today.

But how much code does it actually take to create such a platform?

Around 1.2 million.

That is… a lot of code when you think about it.

Apache Kafka 3.7 lines of code by module

If we have to give them a rough grouping:

  • Backend Server (420k)

    • core - 242k

    • metadata - 57k

    • group-coordinator - 50k

    • storage - 28k

    • raft - 27k

    • server-common - 16k

  • Clients (329k)

    • clients - 287k

    • tools - 24k

    • trogdor - 18k

  • Kafka Streams

    • streams - 329k

  • Connect

    • connect - 134k

We see it starts to make more sense, as the Kafka repository is well split between a few different projects.

🐣 Started From the Bottom, Now We’re Here

Can you guess how many lines of code Kafka started with?

Picture a number in your mind.

I will now give you a hint: the repository grew at an average rate of 24% per release. There were 24 releases. (cue mental algebra 🙂 )

Kafka started with 24,400 lines of code!

That’s literally as much as the tools module today!

But then. Developers started cracking…

Release over release code growth rate (in percentage)

The very first release tripled the codebase’s size, and each of the two subsequent roughly doubled it.

With such a growth rate, you didn’t need many releases to grow the size substantially.

  • 2012 - 24k

  • 2015 - 138k

  • 2017 - 400k

  • 2020 - 726k

  • 2022 - 994k

  • (start of) 2024 - 1,262k

One thing is clear - development has NOT slowed at all.

If anything, Kafka is having more code contributed to it than ever.

Talk about a healthy community.

👑 Top Contributors

What’s a newsletter without some shout outs?

While many people have written a lot of code, the top contributors in Apache Kafka have consistently contributed for the last ~7 years. This is an amazing feat.

  1. Ismael Juma 👑

  2. Jason Gustafson

  3. Guozhang Wang

  4. Matthias J. Sax

  5. Colin Patrick McCabe

  6. Rajini Sivaram

  7. (many more, but this is all that fit on my screen)

Thank you to all the open source contributors!

🐍 Languages

Kafka is mainly Java. But there was a fair amount of Scala code back in the day, and large parts of the server (core) are still in Scala.

The project is slowly migrating away from Scala by simply writing most new code in Java. For example - all the new major Kafka features are written in Java:

Bonus: Largest Files

As a final present - I present you the top 5 Java/Scala files in terms of size. Challenge yourself to go read some of them 😉 

📝 Less-Interesting Notes

  • this data was counted via cloc

  • we only count the three main languages in Kafka - Java, Scala and Python

  • we count blank lines, comments and code lines

  • the data includes test code, which is in all likelihood a large majority of the codebase (that’s a good thing!)

More than half of Kafka’s code is test code!

Liked this edition?

Help support our growth so that we can continue to deliver valuable content!

🗣This Week’s Socials

We were quiet on the socials this time.

Back to basics, this is a great share with any newbie you know:

Apache®, Apache Kafka®, Kafka, and the Kafka logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.