Automated Ephemeral Environments for 500+ Engineers

↗

Staging environments were our biggest bottleneck. We built a custom Kubernetes operator that spins up completely isolated, data-anonymized replicas of production for every single pull request in under 90 seconds.

Category

Infrastructure

Read Time

5 min

Published

Oct 2025

Stack

6 technologies

Overview

With 500 engineers across 40 teams sharing three staging environments, merge queues were routinely 48 hours long. Environment contention caused more delays than code review. The solution wasn't more staging environments — it was a system that could create an isolated, production-accurate environment for every pull request in under 90 seconds and tear it down automatically when the PR closed.

The Problem

01
Production database snapshots were 8TB — a full restore per PR was impractical on any timeline
02
Shared staging environments meant one team's bad deployment could block all other teams
03
Service-to-service dependencies meant isolated environments needed intelligent traffic routing to avoid calling production
04
Data anonymisation for GDPR compliance had to run within the 90-second spin-up budget

Our Approach

We built a custom Kubernetes operator that listened to GitHub PR webhooks and provisioned namespaced environments using copy-on-write volume snapshots — eliminating the full-restore problem. Service mesh routing rules isolated all inter-service traffic to the PR namespace by default, with explicit overrides for stable shared services. A streaming anonymisation job processed only the tables touched by changed migrations, completing in under 40 seconds for 95% of PRs.

Key Takeaways

01
Copy-on-write volume snapshots are the unlock for fast ephemeral environments at scale — they make 8TB databases practical
02
Namespace isolation with service mesh routing is cleaner than VPC-per-environment and dramatically faster to provision
03
Scope your anonymisation to changed migrations, not the full database — it is the only way to meet aggressive SLAs
04
Build the teardown logic first — ephemeral environments that are not reliably cleaned up become permanent environments

Article Details

Automated Ephemeral Environments for 500+ Engineers

Taming LLM Hallucinations in Clinical Triage Routing