2 Minute Streaming
Posts
Stochastic Sublinear Streaming Algorithms

Stochastic Sublinear Streaming Algorithms

✍️ drawing a sketch is always faster than the full painting

Stanislav Kozlovski
July 24, 2023

The Big Data Problem

Imagine you start in a new company as someone in charge of their real-time streaming infrastructure.

You are tasked with the problem of computing the percentile distribution of the company’s terabytes’ stream of data consisting of billions of records in Kafka, each representing a latency metric data point.

Calculating a percentile for a large dataset is very expensive, to do so - you need to:

store all the values
sort them
return the value whose rank matches the percentile (e.g 99th item)

Such big data aggregations are very tricky to solve. There is no way you can do this with terabytes.

The solution?

Streaming Stochastic Algorithms

Also called Sketch algorithms, these are algorithms that trade off a bit of accuracy for massive efficiency gains. They are:

probabilistic (not 100% correct) - they usually have a strict, known error bound.
one-pass - they go over each item in the stream only once.
have sub-linear space growth - input data grows, but the algorithm’s memory requirement does NOT grow linearly with it.
parallelizable & composable - you can split the data into two sets, compute sketches on them and then merge the results while guaranteeing the same accuracy. Parallelization can really scale this to infinity.
data insensitive - Big Data is extremely messy and disorganized. These algorithms handle things like NaN, infinites, nulls and are insensitive to the distribution/order of the data.

DataDog’s Example

In the example above, you were actually placed in DataDog.

They invented their own sketch algorithm named DDSketch.

It offers a 2% relative error bound, which means that if the true p99 is 60s → the sketch would return 58.8-61.2s.

The algorithm is pretty simple:

Create buckets covering ranges of the desired error rate (+-2% in this case)
Each bucket keeps a counter of the amount of data points within that range.
When processing an item (latency metric data point), increment the counter of the appropriate bucket
To count the desired percentile, you sum up the bucket’s values until you get to the desired percentile. Whatever bucket that percentile is in - that’s your value.

With this, you only need:

- 275 buckets
- ~2KB
to cover the range from 1 millisecond to 1 minute.

Another key point? This can be endlessly parallelized.

As we learned from S3, parallelization is key to unlocking great performance at tremendous scale.

Notice - merging the sketch results together is as simple as merging two dictionaries/hashmaps of size 275!

an example of mergeability

Other Uses

Sketch algorithms are used heavily in the industry for other things like:

uniqueness - distinct elements
frequency of items (heavy hitters)
set union/intersection/difference
AI large vector/matrix decomposition
graph analysis - connectivity, weighted matching

For more examples, see the social media links below.

Liked this edition?

Help support our growth so that we can continue to deliver valuable content!

More Sketches? 🔥

Before we go into this week’s content, let me iterate through the stuff I’ve posted about Sketch Algorithms. If you found this edition interesting, you’ll surely love reading more about them!

🦩 How LinkedIn reduces their data usage by 88% (1TB → 120GB)

How did LinkedIn use a sketch algorithm in Apache Pinot to achieve:
• an 88% reduction of data (1TB → 120GB) 🔥
• improve data freshness 50%?
🧵 After this short thread, you will have:
• learned a way to compute complex audience intersection sizes
• seen some good memes 😁
— Stanislav Kozlovski (@BdKozlovski)
1:45 PM • Jul 24, 2023

🔮 How Reddit stored 80MB in 12KB via HyperLogLog

You want to solve the cardinality problem with 80 megabytes worth of unique IDs.
How do you do it using just 12 kilobytes?
Simple - read the story of how Reddit did it: 👇
In 2017, Reddit wanted to better communicate the scale of its communities to its users.
The easiest way… twitter.com/i/web/status/1…
— Stanislav Kozlovski (@BdKozlovski)
7:50 AM • Jul 1, 2023

🧩 The Mergeability Property of Sketch Algorithms

The great thing about Sketches?
They’re infinitely parallelizable. 🧬
Sketch Algorithms possess one very valuable quality - and that is their additivity.
Most algorithms allow you to combine their results together to generate a brand new sketch over the sum of all of the… twitter.com/i/web/status/1…
— Stanislav Kozlovski (@BdKozlovski)
5:34 PM • Jul 4, 2023

🗣This Week’s Socials

What more did we post on socials this week?
Let’s see:

🎭 Uber’s Real-Time Analytics Data Infra

This is what an optimal real-time analytics data infrastructure looks like:👇
Uber has paved the way in showing how to both:
• build infrastructure to support massive amounts of data.
• leverage the data in diverse, often conflicting, use cases.
Each day, they process… twitter.com/i/web/status/1…
— Stanislav Kozlovski (@BdKozlovski)
3:06 PM • Jul 19, 2023

🐧 What Kafka’s High Watermark Offset Is

I used to misunderstand Kafka’s concept of the high watermark.
That’s normal - the concept is entangled with advanced distributed systems terms - replication, fault tolerance & durability.
But once I visualized it, it became clear.
Let me explain it first, and then you can… httptwitter.com/i/web/status/1…p
— Stanislav Kozlovski (@BdKozlovski)
3:29 PM • Jul 22, 2023

🏎 10 Kafka Performance Tips to 10x

Ever wish someone had made a performance checklist for Kafka?
Here it is. 🔥
PS: What did I miss? twitter.com/i/web/status/1…
— Stanislav Kozlovski (@BdKozlovski)
3:46 PM • Jul 21, 2023

📕 Wise’s High Level Streaming Library

😰 Writing your stream processing with low-level APIs
😠 Writing your stream processing with high-level abstractions
What's right?
The right abstraction, of course! 😇
But most companies and projects lack them.
Which is normal - as one size does not fit all.
The only way to… twitter.com/i/web/status/1…
— Stanislav Kozlovski (@BdKozlovski)
6:25 PM • Jul 20, 2023