2 Minute Streaming
Posts
🪵 Kafka Log Segments Explained

🪵 Kafka Log Segments Explained

topic -> partitions -> replicas -> segments. We dive into the last one.

Stanislav Kozlovski
June 16, 2023

What’s a Kafka Segment?

The hierarchy of data in Kafka is:

topic
- partition(s)
  - replica(s)
    - segment(s)

The basic storage unit in Kafka is a replica.
The basic storage unit in the OS is a file.

A broker stores a replica of a partition and said replica is simply a bunch of files on the filesystem.

❝

A Kafka Log Segment is a file on the filesystem containing data of a Kafka partition.

At any given time, there is only one “active” segment. The active segment is the one where new records are written to.

Combined together in consecutive order, all the files form the full partition.

a visualization of the segments

How Large is a Segment?

Depends on when we roll them.

❝

A Segment Roll is when we convert an active segment into an inactive segment.

The process simply closes the file and opens it in read-only mode.

It is configurable. You can limit size based on time or bytes.

Static Per-Broker Configs:

log.segment.bytes - the maximum size (in bytes) of a single segment (file). Defaults to 1GB.
log.segment.ms - the duration (in milliseconds) from file creation that Kafka will wait until it closes the segment, only in case log.segment.bytes is not reached before that.

These configs represent the default topic configuration.
They are dynamic too, i.e updatable via kafka-configs.sh without a restart.

Dynamic Per-Topic Configs:

You can re-configure these settings for any specific topic.

The convention is that broker-level topic configurations are prefixed with log..
To get the topic-level one, simply omit the prefix:

segment.bytes
segment.ms

🚨Gotcha: Kafka File Descriptors

In Linux, file descriptors (also called file handles) are used to reference open files, sockets, and other I/O resources. Kafka uses a large number of open file descriptors.

This is because Kafka keeps open file handles to inactive segments - every segment in every partition is an open file.

Configure your OS to allow for that. (/proc/sys/fs/file-max (NR_FILES) in Linux for the system-wide limit)

Too Many File Handles?

In theory, having many open file handles can lead to more resource usage and performance.

In practice, I wasn’t able to find any resource on the web that had someone suffer from that.

An open file descriptor takes around 1KB of extra memory in the Linux kernel, hence having as many as 1 million files (2^{^20} ) would only use 1GB of RAM.

Tradeoffs - What Value to Configure?

The fundamental trade-off between segment size is write latency vs. retention.

small segment → larger write latency (due to having to roll a new segment and direct the write there, instead of simply writing to the file)
large segment → more disk usage (more data in active segments not considered for retention)

NOTE: Only non-active Kafka segments can be considered for deletion.

Therefore, the larger the segment size (relative to the producer throughput for that partition), the longer it will take for the segment to be closed.

It is only after it is closed that the retention settings (retention.ms/ retention.bytes) begin to take effect in counting down when it’ll be deleted.

Note that these settings are different from what we’ve mentioned so far.

…

More Kafka? 🔥

Of course, there was more content out this week on socials.

Most notably - we passed 3000 Twitter followers 🥳

I did a celebratory thread rounding up all the best content so far. Help boost it!

We just crossed 3,000 followers!
Lots of new good people have come along.
It would only be right if we do a quick recap thread of the content posted so far.
I wouldn't want you to miss out:
👇
— Stanislav Kozlovski (@BdKozlovski)
8:25 AM • Jun 13, 2023

_{Apache®, Apache Kafka®, Kafka, and the Kafka logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks}_.