• 2 Minute Streaming
  • Posts
  • The Numbers Behind Uber's Data Infrastructure (The Uber Series Part II)

The Numbers Behind Uber's Data Infrastructure (The Uber Series Part II)

what it takes to process 100s of GB/s with competing requirements

Last edition we mentioned:

  • the 9 open source Apache products Uber uses

  • the exabyte-scale they operate at

  • the complex requirements their data infra demands

With:

  • 138m messages/s processing 89GB/s in Kafka

  • 170k+ peak QPS, with 1m+ events a second and 6k+ tables over 800 nodes in Pinot

  • 4000 jobs processing 6.5 petabytes a day in Flink

  • 500K Presto queries/day reading over 90PB a day

    • (7k weekly active users, 3 regions, 12k nodes, 20 clusters)

  • 400K Spark apps/day

    • and 10k+ processing hundreds of petabytes every day - it accounts for more than 95% of analytics compute resources

  • 2M Hive queries/day over 500k+ tables

  • an HDFS with 10B calls a day and 150k peak RPS storing exabytes of data

This is a master class example of what a world renowned data infrastructure stack can look like. A must see for any data engineer.

Table of Contents

Lambda Architecture

Their architecture maintains two separate pipelines for different use cases:

  • real-time processing: Flink processes, Pinot stores to serve in real time

  • batch processing: the data is ingested through Spark into an HDFS data lake using Hudi as the ingestion table format and Parquet as the file format.

    • Hive-on-Spark jobs then transform the data to build ML models

  • SQL for everything: Presto allows you to access all sources, as well as deploy jobs.

All the data comes from Kafka.

Kafka

It all starts from Kafka. Data is ingested into Kafka and dispersed into the appropriate places.

Batch - Data Lake

Storage - HDFS & GCS

Their data lake is largely based on HDFS but slowly moving to GCP’s cloud storage now.

HDFS is so useful it’s used by other platforms to manage their own storage - e.g Flink, Pinot.

Formats

The data in the lake is in the Apache Hive and Hudi table formats.

The main file format is Apache Parquet. Avro is used for data in motion.

Apache Hudi was originally created in Uber to solve incremental data processing - the traditional formats don’t support that efficiently, requiring you to copy over the whole table (terabytes+) rather than editing it.

Hudi decreased their batch runtime by 82%! 🔥

Hive & Spark

Spark accounts for more than 95% of Uber’s analytics compute resources…

With such scale - what isn’t Spark used for?

They also run Hive as a query engine on the Spark execution framework too.

Real Time

Flink is used for both customer-facing products (e.g city-specific market conditions calculations, surge pricing) & internal analytics. (e.g global financial estimations)

Pinot

Pinot is used for everything OLAP, getting data from both the real-time system and the batch system.

Especially user-facing OLAP - low-latency high QPS cases.

A lot of pre-aggregated tables are used here to make querying faster.

Presto

Presto allows them to construct ad-hoc queries that join tables from different systems together.

As we will cover in a later newsletter, Uber actually very tightly integrated Presto with Pinot. (it didn’t support SQL before)

Next edition, we will cover how SQL dictates everything Uber does.

Liked this edition?

Help support our growth so that we can continue to deliver valuable content!

Where else are you going to find content like this?

It took me 40+ hours to compile this information, browings through way too many blog posts, watching too many talks and directly reaching out to numerous stakeholders in Uber.

It takes 2 seconds to boost this post, whereas this took me 2 weeks to gather!

Why don’t you? 😇

NOTE: This was not an exhaustive list of all the infrastructure in Uber.

A large company like Uber uses a ton of technologies (teams are free to choose after all), but they tend to converge and standardize on a few. These are the main technologies used.

Nevertheless, as a counter-example, there are cases where Uber uses a Kappa architecture as well.

🗣 The Latest on The Socials

What more did we post on other platforms?

Let’s see:

More Content?

Make sure to follow me on all mediums to not miss anything: