Bigstrum
Bigstrum
Bigstrum
Book Consultation
All Insights
Data Engineering·8 min read·Dec 2025
Data Engineering

Ingesting 5 Petabytes a Day: Rebuilding Our Data Pipeline in Rust

When our telemetry pipeline began buckling under its own weight, just adding more Kafka nodes stopped working. A deep dive into custom partitioning strategies and ditching JVM overhead to reclaim throughput.

Category

Data Engineering

Read Time

8 min

Published

Dec 2025

Stack

6 technologies

Overview

At 2 petabytes per day, the JVM-based telemetry pipeline was consuming 68% of its CPU budget just managing garbage collection pauses. Throughput had plateaued. Adding Kafka partitions helped — until it didn't. At 5PB/day projected, back-of-envelope math showed the existing architecture would require a cluster three times its current size just to keep up. We rebuilt the critical path in Rust.

The Problem
  • 01

    GC pauses on the JVM consumer fleet were introducing 200–400ms latency spikes every 30–90 seconds under sustained load

  • 02

    Kafka's default partitioning strategy created hot partitions under skewed telemetry device distributions

  • 03

    Deserialization of a mix of Avro, Protobuf, and legacy JSON schemas added significant per-message overhead

  • 04

    The pipeline had no backpressure mechanism — upstream producers would overwhelm consumers during traffic bursts

Our Approach

The consumer layer was rewritten in Rust using Tokio for async I/O and a custom Kafka client that implemented consistent-hash partitioning based on device ID cardinality. A unified schema registry with zero-copy deserialization eliminated per-message allocation. Backpressure was implemented via a token-bucket rate limiter at the ingestion boundary. The result: 5PB/day on 40% fewer nodes with p99 latency under 12ms.

Key Takeaways
  • 01

    GC pause budgets compound at petabyte scale — if you are seeing 200ms spikes at 2PB, you will not survive 5PB

  • 02

    Custom partitioning based on your actual key distribution beats Kafka defaults every time at this scale

  • 03

    Zero-copy deserialization is not premature optimisation when you are processing billions of messages per day

  • 04

    Backpressure is an architectural decision, not an afterthought — design it in before you need it

Article Details

Category

Data Engineering

Read Time

8 min

Published

Dec 2025

Tech Stack

RustTokioApache KafkaAvroProtobufClickHouse

Ready to build?

Start a similar project

View all insights

Next Insight

Applied AI·Nov 2025

Taming LLM Hallucinations in Clinical Triage Routing