• 2 Minute Streaming
  • Posts
  • the requirements of: Uber's Exabyte-Scale Data Infrastructure (The Uber Series Part I)

the requirements of: Uber's Exabyte-Scale Data Infrastructure (The Uber Series Part I)

how 9 open source technologies solve for massive scale that continues to grow exponentially amidst complex, competing requirements

Nobody does data like Uber.

With more people than Mexico’s population 🇲🇽 using Uber each month (137M) - you know they have data:

  • 1 EXABYTE of data across tens of thousands of servers in EACH of their two regions (that’s 2,000,000 terabytes)

  • 500,000+ Hive Tables

  • 400,000+ Spark apps ran each day, processing HUNDREDS of PETABYTES

  • Each day, they stream 12 TRILLION messages worth 7.7 petabytes of data. 🔥

Let’s go through their stack:

The Stack

Uber’s Data Infra Stack

  • GCS ☁️

  • Apache Hadoop - HDFS 🐘

  • Apache Spark ⚡️

  • Apache Kafka 🏎️

  • Apache Pinot 🍷

  • Apache Flink 🐿️

  • Apache Parquet 🪵

  • Apache Hudi 🔼

  • Apache Hive 🐝

  • Presto 🪄

The common denominator?

Apache. 🪶

It’s all open source. (well… 99%)
The reasons they cited:

The Requirements

They generally relate to these aspects:

  • consistency / exactly once - mission-critical financial dashboards require consistent data across regions. Can’t skew data when it comes to 💵 !

  • availability - 99.99%. Loss of availability == significant financial losses (e.g unable to price rides)

  • data freshness - most use cases require seconds-level freshness. i.e data must be available for processing seconds after being produced (critical for demand-supply skews, etc)

  • query latency - some cases require p99 latency < 1 second.

    • e.g UberEats’ Restaurant manager executes a few analytical queries for each page load

  • scalability - the ability to scale with the ever-growing data set (petabytes collected a day) without re-architecting is a fundamental requirement

  • cost - Uber is low margin, so they need to ensure the cost is low.

    • This influences a lot of things - like data to keep in memory, tiered storage, pre-materialization vs. runtime computation, etc.

  • flexibility - need to provide both a SQL and programmatic interface.

You can’t guarantee all of these for one use case. (CAP theorem)

So it’s a question of how you provide the right mixture for each particular use case.

More Challenges

As if that wasn’t hard enough! 😮‍💨

Similar to a 3D cube, they have requirements growing in 3 opposite directions:

the 3 dimensions

  1. Scaling Data - total incoming data volume is growing at an exponential rate

    1. Replication factor & several geo regions copy data.

    2. Can’t afford to regress on data freshness, e2e latency & availability while growing.

  2. Scaling Use Cases - new use cases arise from various verticals & groups, each with competing requirements.

  3. Scaling Users - the diverse users fall on a big spectrum of technical skills. (some none, some a lot)

The Use Cases

just some of the use cases

  • compute Surge Pricing - dynamic trip prices based on real-time supply & demand

    • needs data freshness, availability

  • fine tune your ETA prediction ML model with real-world outcomes

    • needs scale: absurd volume & cardinality - highest QPS model in their stack

  • power internal dashboards for ad-hoc exploration (e.g Growth Marketing teams figuring out incentives grow the business)

    • needs flexibility

  • generate reports for your Ops team to share with local authorities

    • needs completeness

  • push notifications / track ad impressions

    • needs exactly once, scale, low latency

  • UberEats:

    • visualize restaurant’s numbers in a dashboard

      • needs low latency, data completeness

    • generate daily restaurant statements

      • needs completeness

    • show popular orders near me

      • needs low latency

The Architecture

That’s all for two minutes!

the missing pieces

Wanna learn more about the details?

Tune in next week.

Liked this edition?

Help support our growth so that we can continue to deliver valuable content!

And if you really enjoy the newsletter in general - please forward it to a fellow engineer who would be interested in this topic. It only takes 10 seconds. Writing it takes me 10+ hours.

🗣Recent Social Posts

What more did we post on socials?
Let’s see:

  • 🎥 (Podcast) Top Tips to Run Kafka in Prod… in 120 seconds 🤯