Introduction to Apache Kafka

⏳ the Kafka introduction you wish you had before

Kafka

Developed originally at LinkedIn in 2011, Kafka has since been open-sourced, gifted to Apache and has grown both itself and its network effect tremendously.

Over 70% of Fortune 500 companies have used Kafka.

At its core, though, it’s a distributed commit log.

A log (a.k.a. {write-ahead, commit, transaction} log) is the simplest data structure.

It’s an ordered structure that only supports appends. It’s immutable - you can’t edit or delete records in place.

There are a lot of benefits to it:

The Log Data Structure

Data Hierarchy & Terminology

Kafka stores its data in topics. They’re split into partitions, and replicated across brokers.

The structure is roughly:

  • topics

    • partitions

      • replicas

        • a bunch of files (the log)

Brokers host replicas of partitions.

A partition is replicated according to the replication factor (usually 3).
One of those 3 brokers is the leader for the partition, meaning all writes go to it.
Followers can read from any replica.

Replication helps with durability and availability - if a broker/disk breaks, another broker becomes the leader and usage continues.

Message Flow

  1. Clients (producers) send messages (records) to a Kafka node (broker).

  2. Other clients, called consumers, read and processed said messages.

These connect to the broker via TCP and use a custom protocol (the Kafka protocol) to communicate to the broker.

The producer/consumer clients are simply a Java library that implements the Kafka protocol and gives you a nice interface to interact with. Nowadays there’s an implementation in virtually every language out there.

Controllers

A distributed system requires coordination.

Kafka achieves this by having specific brokers called “controllers”.

At any one point, there is only one active controller.

The controller is the main source of truth for metadata within the cluster.
It’s responsible for a bunch of things, one of the main being handling broker failures by changing partition leadership.

  • Leader election between regular brokers is done through the controller.

  • Leader election between the controllers (picking the active one) is done through a variant of Raft (KRaft)

All the cluster’s metadata is stored inside a simple (but special) Kafka topic whose leader is the active controller.

All brokers replicate this topic - that’s how they get metadata.

Sample cluster hierarchy

Adoption

Kafka is excellent for decoupling your architecture.

Anything wrong here?

It is frequently used as the central nervous system inside a company - somewhere data goes, is processed/transformed and is consumed by other systems (data warehouse, indexes, microservices, etc.).

To help with this use, Kafka has two major open-source components - Connect and Streams.

  • Streams - an embeddable lightweight library for stream processing with Kafka.

  • Connect - a framework and runtime for plugins which integrate Kafka with external systems.

Streams helps you process and transform your data.

Connect helps you connect it.

Apache®, Apache Kafka®, Kafka, and the Kafka logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.