2 Minute Streaming
Posts
the requirements of: Uber's Exabyte-Scale Data Infrastructure (The Uber Series Part I)

the requirements of: Uber's Exabyte-Scale Data Infrastructure (The Uber Series Part I)

how 9 open source technologies solve for massive scale that continues to grow exponentially amidst complex, competing requirements

Stanislav Kozlovski
July 28, 2024

Nobody does data like Uber.

With more people than Mexico’s population 🇲🇽 using Uber each month (137M) - you know they have data:

1 EXABYTE of data across tens of thousands of servers in EACH of their two regions (that’s 2,000,000 terabytes)
500,000+ Hive Tables
400,000+ Spark apps ran each day, processing HUNDREDS of PETABYTES
Each day, they stream 12 TRILLION messages worth 7.7 petabytes of data. 🔥

Let’s go through their stack:

The Stack

Uber’s Data Infra Stack

GCS ☁️
Apache Hadoop - HDFS 🐘
Apache Spark ⚡️
Apache Kafka 🏎️
Apache Pinot 🍷
Apache Flink 🐿️
Apache Parquet 🪵
Apache Hudi 🔼
Apache Hive 🐝
Presto 🪄

The common denominator?

Apache. 🪶

It’s all open source. (well… 99%)
The reasons they cited:

The Requirements

They generally relate to these aspects:

consistency / exactly once - mission-critical financial dashboards require consistent data across regions. Can’t skew data when it comes to 💵 !
availability - 99.99%. Loss of availability == significant financial losses (e.g unable to price rides)
data freshness - most use cases require seconds-level freshness. i.e data must be available for processing seconds after being produced (critical for demand-supply skews, etc)
query latency - some cases require p99 latency < 1 second.
- e.g UberEats’ Restaurant manager executes a few analytical queries for each page load
scalability - the ability to scale with the ever-growing data set (petabytes collected a day) without re-architecting is a fundamental requirement
cost - Uber is low margin, so they need to ensure the cost is low.
- This influences a lot of things - like data to keep in memory, tiered storage, pre-materialization vs. runtime computation, etc.
flexibility - need to provide both a SQL and programmatic interface.

You can’t guarantee all of these for one use case. (CAP theorem)

So it’s a question of how you provide the right mixture for each particular use case.

More Challenges

As if that wasn’t hard enough! 😮‍💨

Similar to a 3D cube, they have requirements growing in 3 opposite directions:

the 3 dimensions

Scaling Data - total incoming data volume is growing at an exponential rate
1. Replication factor & several geo regions copy data.
2. Can’t afford to regress on data freshness, e2e latency & availability while growing.
Scaling Use Cases - new use cases arise from various verticals & groups, each with competing requirements.
Scaling Users - the diverse users fall on a big spectrum of technical skills. (some none, some a lot)

The Use Cases

just some of the use cases

compute Surge Pricing - dynamic trip prices based on real-time supply & demand
- needs data freshness, availability
fine tune your ETA prediction ML model with real-world outcomes
- needs scale: absurd volume & cardinality - highest QPS model in their stack
power internal dashboards for ad-hoc exploration (e.g Growth Marketing teams figuring out incentives grow the business)
- needs flexibility
generate reports for your Ops team to share with local authorities
- needs completeness
push notifications / track ad impressions
- needs exactly once, scale, low latency
UberEats:
- visualize restaurant’s numbers in a dashboard
  - needs low latency, data completeness
- generate daily restaurant statements
  - needs completeness
- show popular orders near me
  - needs low latency

The Architecture

That’s all for two minutes!

the missing pieces

Wanna learn more about the details?

Tune in next week.

Liked this edition?

Help support our growth so that we can continue to deliver valuable content!

And if you really enjoy the newsletter in general - please forward it to a fellow engineer who would be interested in this topic. It only takes 10 seconds. Writing it takes me 10+ hours.

What more did we post on socials?
Let’s see:

👋 My Confluent Goodbye Post
- Twitter
- LinkedIn

After 6 years at Confluent, starting with:
• just ~300 people
• a hacky Kafka SaaS product
• a share price of $1
and growing into:
• $91 share price, then back to $30
• 3000+ people
• the world’s best Kafka cloud service
I bid my farewell 👋
A quick story... 🧵
— Stanislav Kozlovski (@BdKozlovski)
3:02 PM • Jun 27, 2024

CrowdStrike breaks the internet 💥
- Twitter
- LinkedIn

The Blue July Crisis.
When one bad production deployment took down MILLIONS of machines in EVERY country across the globe...
Today we experienced the largest global software-inflicted outage.
A world-wide deployment of the CrowdStrike Falcon agent caused massive widespread… x.com/i/web/status/1…
— Stanislav Kozlovski (@BdKozlovski)
1:19 PM • Jul 19, 2024

🎥 (Podcast) Top Tips to Run Kafka in Prod… in 120 seconds 🤯
- Twitter
- LinkedIn

After 6 years of working on the world’s largest Kafka SaaS (1000s of customers) and participating in 100s of incidents, I share my top 9 DEAD-SIMPLE tips to manage large Kafka deployments.
Guaranteed to make you want to buy a Kafka SaaS 🥲 (not joking)
Why?
Because of how hard… x.com/i/web/status/1…
— Stanislav Kozlovski (@BdKozlovski)
4:10 PM • Jul 18, 2024

4.2x lower Kafka latency with one easy tweak? 🔥
- Twitter
- LinkedIn

Most engineers don’t understand the largest performance improvements often come the easiest.
Here’s yet another Kafka config tweak to increase your performance by 50% 🔥
a 1-minute thread 🧵
— Stanislav Kozlovski (@BdKozlovski)
2:54 PM • Jul 25, 2024

😱 Databricks acquiring Tabular for $1B-$2B 💸
- Twitter
- LinkedIn

Databricks spent $1-2 BILLION dollars to acquire a ~30 person company from the creators of Apache Iceberg.

A revolution is going on in the Big Data space and its centering around Iceberg. 🧊

Why would Databricks spend this outrageous amount of money on such a small company?… x.com/i/web/status/1…
— Stanislav Kozlovski (@BdKozlovski)
3:07 PM • Jun 21, 2024

💬 How Slack uses Kafka
- Twitter
- LinkedIn

Slack uses Apache Kafka at scale:
• 6.5 Gbps
• 700TB of data
• 100s of nodes
Here's their fun story 👇
— Stanislav Kozlovski (@BdKozlovski)
3:07 PM • Jun 29, 2024

0️⃣ Kafka’s Zero Copy Demystified
- Twitter
- LinkedIn

Stop wasting hours trying to understand Kafka's difficult zero copy optimization.

Just read this short & simple summary 👇

(there is a surprise at the end)

If you’ve ever read about Kafka, a particular optimization it makes use of might have caught your eye — the operating… x.com/i/web/status/1…
— Stanislav Kozlovski (@BdKozlovski)
3:12 PM • Jul 3, 2024

✨ Infinite Storage in Kafka
- Twitter
- LinkedIn

Everyone is using Kafka.

But almost no one is using its new Infinite Storage feature. ✨

KIP-405 is introducing the ability to store Kafka data in S3.

And any other external store for that matter.

It’s incredibly needed, because storage is Kafka’s biggest flaw right now.… x.com/i/web/status/1…
— Stanislav Kozlovski (@BdKozlovski)
5:40 PM • Jul 23, 2024