Insights06 articles

From the
Engineering Desk

Technical deep-dives, architecture decisions, and lessons learned building software at scale.

Distributed SystemsFeaturedMar 202601 / 06

Surviving the Spike: Scaling to 10M Concurrent WebSockets

↗

When a global streaming event caused traffic to surge by 40,000% in minutes, traditional auto-scaling couldn’t react fast enough.We designed a system built for spikes—pre-warmed infrastructure, optimized WebSocket handling, and a simplified routing layer—to support over 10 million concurrent connections without compromising latency or reliability.

Standard cloud auto-scaling has a dirty secret: it takes 3–5 minutes to provision and warm new instances. When a global sporting event drove a 40,000% traffic spike in under three minutes, that gap was lethal. We had 180 seconds to absorb 10 million concurrent WebSocket connections — or drop the stream entirely.

The Problem

Auto-scaling groups couldn't provision fast enough — new EC2 instances took 4+ minutes from trigger to ready state
Standard Application Load Balancers impose a 3,000 concurrent connection limit per target by default
WebSocket state was partially sticky to individual nodes, making seamless failover impossible
Backpressure from 10M simultaneous connection attempts overwhelmed the TLS handshake queue

Our Approach

We replaced reactive auto-scaling with predictive pre-warming triggered by upstream ticketing data. Seventy-two hours before the event, we provisioned and fully warmed a dedicated WebSocket tier — bypassing ALBs entirely in favour of a custom NLB + eBPF-based connection router that distributed load at the kernel level before TCP handshake. State was externalised to Redis Cluster with sub-millisecond replication lag.

Key Takeaways

1
Reactive auto-scaling is fundamentally incompatible with sudden, predictable spikes — pre-warm based on upstream signals
2
ALBs are the wrong tool for massive WebSocket workloads; NLBs with custom routing give you the control you need
3
eBPF-based load distribution at the kernel level eliminates userspace overhead at scale
4
Externalise all connection state before you think you need to — retrofitting it under load is impossible

Tech Stack

Backend

eBPFRust

Database

Redis Cluster

Infrastructure

AWS EC2NLBKubernetes

Read Time·7 min

Read Article

Core ModernisationFeb 202602 / 06

Strangling the Monolith: Zero-Downtime Migration for a Tier-1 Bank

↗

Ripping out a 30-year-old mainframe while processing $5B in daily transactions. We detail the shadow-routing layer and dual-write database strategy that made the switch completely invisible to end users.

A tier-1 bank needed to decommission a 30-year-old COBOL mainframe that processed $5 billion in daily transactions. A hard cutover was out of the question. Any downtime window long enough to migrate cleanly would violate regulatory SLAs and trigger contractual penalties. The migration had to be invisible — to users, to auditors, and to the core banking system itself.

The Problem

The COBOL mainframe had no API surface — all integration was via batch file exchange and proprietary MQ queues
Transaction semantics varied by product line with undocumented edge cases accumulated over 30 years
Dual-write consistency had to be maintained across two fundamentally different transaction models simultaneously
Regulatory requirement: zero transaction loss with a full audit trail across both systems during transition

Our Approach

We implemented the strangler fig pattern with a custom shadow-routing proxy that intercepted all mainframe-bound traffic. Every transaction was dual-written: once to the mainframe (authoritative) and once to the new microservices layer (shadow). Comparison jobs ran continuously, flagging divergences for reconciliation. Traffic was shifted in 5% increments by product line over 14 months, with the mainframe remaining hot until the final product line reached 100% confidence.

Key Takeaways

1
Shadow routing with continuous reconciliation is the only safe approach when you cannot tolerate a single lost transaction
2
Migrate by product line, not by technical layer — it gives you a blast radius you can reason about
3
Undocumented business logic is the primary risk in legacy migration; instrument everything before you move anything
4
Keep the old system authoritative until you have 99.999% reconciliation confidence — not 99%

Tech Stack

Backend

COBOLJava Spring Boot

Database

KafkaPostgreSQL

Infrastructure

IstioAWS

Read Time·9 min

Read Article

Edge ComputingJan 202603 / 06

Sub-10ms Analytics: Pushing Inference to the Manufacturing Edge

↗

Cloud round-trip latency is too slow for robotic defect detection on a live assembly line. Here is how we deployed highly quantized computer vision models directly onto ruggedized factory-floor hardware.

A high-volume electronics manufacturer was running defect detection through a cloud-hosted vision model. Cloud round-trip latency averaged 180ms — fast enough for dashboards, too slow for the assembly line robots that needed to act on results within 8ms or miss the rejection window entirely. The model had to move to the edge, but the factory floor imposed constraints that made standard edge deployments non-trivial.

The Problem

Factory-floor hardware was ruggedised ARM-based units with no GPU — standard inference frameworks were too slow
Intermittent connectivity meant cloud fallback was not a viable safety net
Model accuracy could not degrade below 99.2% — below that, false negatives reached an unacceptable rate
OTA update cycles on factory hardware are months-long; the deployment model had to be right first time

Our Approach

We applied INT8 post-training quantisation to a MobileNetV3 backbone, reducing model size by 74% with less than 0.3% accuracy loss. Inference ran via ONNX Runtime on the ARM units, hitting consistent 6ms latency. A local model registry handled versioning and rollback without cloud dependency. Drift detection ran on-device and triggered alerts when incoming image distributions deviated beyond a calibrated threshold.

Key Takeaways

1
INT8 quantisation is usually the right first step for edge deployment — the accuracy trade-off is smaller than engineers expect
2
ONNX Runtime's ARM backend is production-grade; stop assuming you need a GPU for real-time inference
3
On-device drift detection is non-negotiable when the update cycle is months, not hours
4
Design for zero-connectivity from the start — cloud fallback is an illusion on factory floors

Tech Stack

Backend

ONNX RuntimeMobileNetV3INT8 QuantisationPythonC++

Infrastructure

ARM

Read Time·6 min

Read Article

Data EngineeringDec 202504 / 06

Ingesting 5 Petabytes a Day: Rebuilding Our Data Pipeline in Rust

↗

When our telemetry pipeline began buckling under its own weight, just adding more Kafka nodes stopped working. A deep dive into custom partitioning strategies and ditching JVM overhead to reclaim throughput.

At 2 petabytes per day, the JVM-based telemetry pipeline was consuming 68% of its CPU budget just managing garbage collection pauses. Throughput had plateaued. Adding Kafka partitions helped — until it didn't. At 5PB/day projected, back-of-envelope math showed the existing architecture would require a cluster three times its current size just to keep up. We rebuilt the critical path in Rust.

The Problem

GC pauses on the JVM consumer fleet were introducing 200–400ms latency spikes every 30–90 seconds under sustained load
Kafka's default partitioning strategy created hot partitions under skewed telemetry device distributions
Deserialization of a mix of Avro, Protobuf, and legacy JSON schemas added significant per-message overhead
The pipeline had no backpressure mechanism — upstream producers would overwhelm consumers during traffic bursts

Our Approach

The consumer layer was rewritten in Rust using Tokio for async I/O and a custom Kafka client that implemented consistent-hash partitioning based on device ID cardinality. A unified schema registry with zero-copy deserialization eliminated per-message allocation. Backpressure was implemented via a token-bucket rate limiter at the ingestion boundary. The result: 5PB/day on 40% fewer nodes with p99 latency under 12ms.

Key Takeaways

1
GC pause budgets compound at petabyte scale — if you are seeing 200ms spikes at 2PB, you will not survive 5PB
2
Custom partitioning based on your actual key distribution beats Kafka defaults every time at this scale
3
Zero-copy deserialization is not premature optimisation when you are processing billions of messages per day
4
Backpressure is an architectural decision, not an afterthought — design it in before you need it

Tech Stack

Backend

RustTokioAvroProtobuf

Database

Apache KafkaClickHouse

Read Time·8 min

Read Article

Applied AINov 202505 / 06

Taming LLM Hallucinations in Clinical Triage Routing

↗

Building an automated clinical routing system where mistakes are catastrophic. We share the architectural blueprint of our deterministic guardrail layer and RAG pipeline that strictly enforces medical ontology.

An NHS trust wanted to automate the first-pass triage routing of clinical referral letters — a task currently performed by specialist nurses. The potential efficiency gain was significant. The risk was also significant: a misrouted referral could delay urgent cancer treatment by weeks. Standard LLM output was non-deterministic and occasionally hallucinated clinical codes that did not exist in the ICD-11 taxonomy. We needed determinism the model couldn't provide on its own.

The Problem

LLMs produced plausible but invalid ICD-11 codes at a rate of approximately 2.3% — catastrophic in a clinical context
Referral letters mixed structured data, clinical shorthand, and free text with no consistent schema
The system needed to escalate ambiguous cases to human review without disrupting routing throughput
NHS data governance required all processing to remain on-premises with no data leaving the trust boundary

Our Approach

We built a two-stage architecture: a RAG pipeline grounded in the full ICD-11 taxonomy and the trust's clinical protocol documents, followed by a deterministic validation layer that rejected any output not resolvable to a verified code. Ambiguity scoring routed uncertain cases to a human review queue with pre-populated suggested codes. The entire stack ran on-premises using a fine-tuned Mistral 7B model, with no external API calls.

Key Takeaways

1
Never trust LLM output directly in clinical or safety-critical contexts — always enforce downstream validation against authoritative data
2
RAG grounded in your specific domain ontology reduces hallucination rate by an order of magnitude compared to base models
3
Ambiguity escalation to humans is not a failure state — design it as a first-class feature from day one
4
On-premises fine-tuned open-weight models are production-viable and often the only compliant option in regulated environments

Tech Stack

Backend

Mistral 7BRAGICD-11PythonLangChain

Infrastructure

On-Premises

Read Time·7 min

Read Article

InfrastructureOct 202506 / 06

Automated Ephemeral Environments for 500+ Engineers

↗

Staging environments were our biggest bottleneck. We built a custom Kubernetes operator that spins up completely isolated, data-anonymized replicas of production for every single pull request in under 90 seconds.

With 500 engineers across 40 teams sharing three staging environments, merge queues were routinely 48 hours long. Environment contention caused more delays than code review. The solution wasn't more staging environments — it was a system that could create an isolated, production-accurate environment for every pull request in under 90 seconds and tear it down automatically when the PR closed.

The Problem

Production database snapshots were 8TB — a full restore per PR was impractical on any timeline
Shared staging environments meant one team's bad deployment could block all other teams
Service-to-service dependencies meant isolated environments needed intelligent traffic routing to avoid calling production
Data anonymisation for GDPR compliance had to run within the 90-second spin-up budget

Our Approach

We built a custom Kubernetes operator that listened to GitHub PR webhooks and provisioned namespaced environments using copy-on-write volume snapshots — eliminating the full-restore problem. Service mesh routing rules isolated all inter-service traffic to the PR namespace by default, with explicit overrides for stable shared services. A streaming anonymisation job processed only the tables touched by changed migrations, completing in under 40 seconds for 95% of PRs.

Key Takeaways

1
Copy-on-write volume snapshots are the unlock for fast ephemeral environments at scale — they make 8TB databases practical
2
Namespace isolation with service mesh routing is cleaner than VPC-per-environment and dramatically faster to provision
3
Scope your anonymisation to changed migrations, not the full database — it is the only way to meet aggressive SLAs
4
Build the teardown logic first — ephemeral environments that are not reliably cleaned up become permanent environments

Tech Stack

Backend

Database

PostgreSQL

Infrastructure

KubernetesHelmIstioGitHub Actions

Read Time·5 min

Read Article

Have a problem worth solving?

We partner with teams to build software that performs in high-stakes environments. Start with a 30-minute discovery call.

Start a conversation

From the Engineering Desk

Surviving the Spike: Scaling to 10M Concurrent WebSockets

Strangling the Monolith: Zero-Downtime Migration for a Tier-1 Bank

Sub-10ms Analytics: Pushing Inference to the Manufacturing Edge

Ingesting 5 Petabytes a Day: Rebuilding Our Data Pipeline in Rust

Taming LLM Hallucinations in Clinical Triage Routing

Automated Ephemeral Environments for 500+ Engineers

Have a problem worth solving?

From the
Engineering Desk