Surviving the Spike: Scaling to 10M Concurrent WebSockets
When a global streaming event caused traffic to surge by 40,000% in minutes, traditional auto-scaling couldn’t react fast enough.We designed a system built for spikes—pre-warmed infrastructure, optimized WebSocket handling, and a simplified routing layer—to support over 10 million concurrent connections without compromising latency or reliability.
Standard cloud auto-scaling has a dirty secret: it takes 3–5 minutes to provision and warm new instances. When a global sporting event drove a 40,000% traffic spike in under three minutes, that gap was lethal. We had 180 seconds to absorb 10 million concurrent WebSocket connections — or drop the stream entirely.
The Problem
- Auto-scaling groups couldn't provision fast enough — new EC2 instances took 4+ minutes from trigger to ready state
- Standard Application Load Balancers impose a 3,000 concurrent connection limit per target by default
- WebSocket state was partially sticky to individual nodes, making seamless failover impossible
- Backpressure from 10M simultaneous connection attempts overwhelmed the TLS handshake queue
Our Approach
We replaced reactive auto-scaling with predictive pre-warming triggered by upstream ticketing data. Seventy-two hours before the event, we provisioned and fully warmed a dedicated WebSocket tier — bypassing ALBs entirely in favour of a custom NLB + eBPF-based connection router that distributed load at the kernel level before TCP handshake. State was externalised to Redis Cluster with sub-millisecond replication lag.
Key Takeaways
- 1
Reactive auto-scaling is fundamentally incompatible with sudden, predictable spikes — pre-warm based on upstream signals
- 2
ALBs are the wrong tool for massive WebSocket workloads; NLBs with custom routing give you the control you need
- 3
eBPF-based load distribution at the kernel level eliminates userspace overhead at scale
- 4
Externalise all connection state before you think you need to — retrofitting it under load is impossible