Stochastic Sublinear Streaming Algorithms

✍️ drawing a sketch is always faster than the full painting

The Big Data Problem

Imagine you start in a new company as someone in charge of their real-time streaming infrastructure.

You are tasked with the problem of computing the percentile distribution of the company’s terabytes’ stream of data consisting of billions of records in Kafka, each representing a latency metric data point.

Calculating a percentile for a large dataset is very expensive, to do so - you need to:

  1. store all the values

  2. sort them

  3. return the value whose rank matches the percentile (e.g 99th item)

Such big data aggregations are very tricky to solve. There is no way you can do this with terabytes.

The solution?

Streaming Stochastic Algorithms

Also called Sketch algorithms, these are algorithms that trade off a bit of accuracy for massive efficiency gains. They are:

  • probabilistic (not 100% correct) - they usually have a strict, known error bound.

  • one-pass - they go over each item in the stream only once.

  • have sub-linear space growth - input data grows, but the algorithm’s memory requirement does NOT grow linearly with it.

  • parallelizable & composable - you can split the data into two sets, compute sketches on them and then merge the results while guaranteeing the same accuracy. Parallelization can really scale this to infinity.

  • data insensitive - Big Data is extremely messy and disorganized. These algorithms handle things like NaN, infinites, nulls and are insensitive to the distribution/order of the data.

DataDog’s Example

In the example above, you were actually placed in DataDog.

They invented their own sketch algorithm named DDSketch.

It offers a 2% relative error bound, which means that if the true p99 is 60s → the sketch would return 58.8-61.2s.

The algorithm is pretty simple:

  1. Create buckets covering ranges of the desired error rate (+-2% in this case)

  2. Each bucket keeps a counter of the amount of data points within that range.

  3. When processing an item (latency metric data point), increment the counter of the appropriate bucket

  4. To count the desired percentile, you sum up the bucket’s values until you get to the desired percentile. Whatever bucket that percentile is in - that’s your value.

With this, you only need:

- 275 buckets
- ~2KB
to cover the range from 1 millisecond to 1 minute.

Another key point? This can be endlessly parallelized.

As we learned from S3, parallelization is key to unlocking great performance at tremendous scale.

Notice - merging the sketch results together is as simple as merging two dictionaries/hashmaps of size 275!

an example of mergeability

Other Uses

Sketch algorithms are used heavily in the industry for other things like:

  • uniqueness - distinct elements

  • frequency of items (heavy hitters)

  • set union/intersection/difference

  • AI large vector/matrix decomposition

  • graph analysis - connectivity, weighted matching

For more examples, see the social media links below.

Liked this edition?

Help support our growth so that we can continue to deliver valuable content!

More Sketches? 🔥

Before we go into this week’s content, let me iterate through the stuff I’ve posted about Sketch Algorithms. If you found this edition interesting, you’ll surely love reading more about them!

🗣This Week’s Socials

What more did we post on socials this week?
Let’s see: