Ingesting 5 Petabytes a Day: Rebuilding Our Data Pipeline in Rust

↗

When our telemetry pipeline began buckling under its own weight, just adding more Kafka nodes stopped working. A deep dive into custom partitioning strategies and ditching JVM overhead to reclaim throughput.

Category

Data Engineering

Read Time

8 min

Published

Dec 2025

Stack

6 technologies

Overview

At 2 petabytes per day, the JVM-based telemetry pipeline was consuming 68% of its CPU budget just managing garbage collection pauses. Throughput had plateaued. Adding Kafka partitions helped — until it didn't. At 5PB/day projected, back-of-envelope math showed the existing architecture would require a cluster three times its current size just to keep up. We rebuilt the critical path in Rust.

The Problem

01
GC pauses on the JVM consumer fleet were introducing 200–400ms latency spikes every 30–90 seconds under sustained load
02
Kafka's default partitioning strategy created hot partitions under skewed telemetry device distributions
03
Deserialization of a mix of Avro, Protobuf, and legacy JSON schemas added significant per-message overhead
04
The pipeline had no backpressure mechanism — upstream producers would overwhelm consumers during traffic bursts

Our Approach

The consumer layer was rewritten in Rust using Tokio for async I/O and a custom Kafka client that implemented consistent-hash partitioning based on device ID cardinality. A unified schema registry with zero-copy deserialization eliminated per-message allocation. Backpressure was implemented via a token-bucket rate limiter at the ingestion boundary. The result: 5PB/day on 40% fewer nodes with p99 latency under 12ms.

Key Takeaways

01
GC pause budgets compound at petabyte scale — if you are seeing 200ms spikes at 2PB, you will not survive 5PB
02
Custom partitioning based on your actual key distribution beats Kafka defaults every time at this scale
03
Zero-copy deserialization is not premature optimisation when you are processing billions of messages per day
04
Backpressure is an architectural decision, not an afterthought — design it in before you need it

Article Details

Ingesting 5 Petabytes a Day: Rebuilding Our Data Pipeline in Rust

Taming LLM Hallucinations in Clinical Triage Routing