mTLS 🤝 Kafka (2 minute kafka)

3 simple examples of mTLS Kafka security setups in Uber, ZenDesk & Wise

☂️ mTLS

mTLS is a complex way of securing your Kafka deployment. Let’s define the terms:

☂️ TLS - a protocol for encrypting the data before it travels over the wire, so that no bad actor can inspect it. It’s used interchangeably with SSL, although that’s the older protocol.

The way it works is that the broker has a signed certificate on it. The client then verifies the certificate to create an encrypted connection.

☂️ mTLS - mutual TLS. 🤝 (TLS + Auth)

Here, the client also has a signed certificate. ⭐️

Both the broker and client verify each other’s certificates, and this allows the broker to authenticate the client since it now knows WHO that client is.

❌ Problems With It

It’s notoriously hard to manage mTLS.

It requires extra infrastructure around Kafka and your apps to ensure you can rotate certificates properly.

You NEED to be able to revoke certificates efficiency for cases where:

  • 💧 the certificate leaks (equivalent of your password being leaked)

  • 💩 is stale enough that requires a refresh (proper security practice to refresh every now and then)

And, unfortunately, there is no industry-wide consensus on how to revoke.

The two most popular ones - Certificate Revocation List (CRLs) and Online Certificate Status Protocol (OCSP) have their own set of problems.

The result?

Not all clients or Certificate Authorities (CAs) implement them consistently. The tooling and library support isn’t exactly there.

What Do Companies Use?

The reason many engineers are against mTLS is because of the lackluster support, resulting in every company implementing it in their own way.

This risks security gaps and maintenance overhead in the future.

The Manage-It-Yourself Example

ZenDesk uses:

  • Vault for a CA generating X.509 certs.

  • Consul for the source of truth regarding who the CA is.

  • a PKI Auth Manager sidecar to generate certs remotely in vault & store them locally.

  • a TLS Monitor sidecar to watch for these changes & make sure Kafka reloads the cert.

It is much more complex than it seems. Especially certificate rotation.

See more of the details in this post:

They also very recently posted an update to their setup (which I haven’t had a chance to go through yet)

The Wise Example (pun intended)

In the two other examples I have, SPIFFE and Spire are used.

  • 🔹 SPIFFE (Secure Production Identity Framework For Everyone) - is a framework and set of standards for secure cross-service communication and identification. It is a graduated cloud-native foundation project (CNCF).

  • 🔹 Spire - a reference implementation of SPIFFE, also a graduated CNCF project.

You can also use Istio, which puts an Envoy next to your apps and implements the SPIFFE spec, handling all this for you (source)

This automates cert rotation and expiry for you.

SPIFFE relies on frequent cert expiry as the mechanism to handle leaked certificates.

Bonus?

See how Uber does it:

Like and Share too if you like it (please)

Liked this edition? 😎

Help support our growth so that we can continue to deliver valuable content!

Related Post 🔥

With TLS, we lose the zero-copy optimization too! If you’d like to read our 2-minute edition regarding that, see:

Quick Favor? 🙏

If you enjoyed this letter (or enjoy 2minutestreaming in general) - please show your support by forwarding it to an engineer.

It only takes 5 seconds. Writing it takes me 5+ hours.

Apache®, Apache Kafka®, Kafka, and the Kafka logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.