Ingesting 5 Petabytes a Day: Rebuilding Our Data Pipeline in Rust
When our telemetry pipeline began buckling under its own weight, just adding more Kafka nodes stopped working. A deep dive into custom partitioning strategies and ditching JVM overhead to reclaim throughput.
Category
Data Engineering
Read Time
8 min
Published
Dec 2025
Stack
6 technologies
At 2 petabytes per day, the JVM-based telemetry pipeline was consuming 68% of its CPU budget just managing garbage collection pauses. Throughput had plateaued. Adding Kafka partitions helped — until it didn't. At 5PB/day projected, back-of-envelope math showed the existing architecture would require a cluster three times its current size just to keep up. We rebuilt the critical path in Rust.
- 01
GC pauses on the JVM consumer fleet were introducing 200–400ms latency spikes every 30–90 seconds under sustained load
- 02
Kafka's default partitioning strategy created hot partitions under skewed telemetry device distributions
- 03
Deserialization of a mix of Avro, Protobuf, and legacy JSON schemas added significant per-message overhead
- 04
The pipeline had no backpressure mechanism — upstream producers would overwhelm consumers during traffic bursts
The consumer layer was rewritten in Rust using Tokio for async I/O and a custom Kafka client that implemented consistent-hash partitioning based on device ID cardinality. A unified schema registry with zero-copy deserialization eliminated per-message allocation. Backpressure was implemented via a token-bucket rate limiter at the ingestion boundary. The result: 5PB/day on 40% fewer nodes with p99 latency under 12ms.
- 01
GC pause budgets compound at petabyte scale — if you are seeing 200ms spikes at 2PB, you will not survive 5PB
- 02
Custom partitioning based on your actual key distribution beats Kafka defaults every time at this scale
- 03
Zero-copy deserialization is not premature optimisation when you are processing billions of messages per day
- 04
Backpressure is an architectural decision, not an afterthought — design it in before you need it
Article Details
Category
Data Engineering
Read Time
8 min
Published
Dec 2025
Tech Stack
Ready to build?
Start a similar project