Bigstrum
Bigstrum
Bigstrum
Book Consultation
All Insights
Infrastructure·5 min read·Oct 2025
Infrastructure

Automated Ephemeral Environments for 500+ Engineers

Staging environments were our biggest bottleneck. We built a custom Kubernetes operator that spins up completely isolated, data-anonymized replicas of production for every single pull request in under 90 seconds.

Category

Infrastructure

Read Time

5 min

Published

Oct 2025

Stack

6 technologies

Overview

With 500 engineers across 40 teams sharing three staging environments, merge queues were routinely 48 hours long. Environment contention caused more delays than code review. The solution wasn't more staging environments — it was a system that could create an isolated, production-accurate environment for every pull request in under 90 seconds and tear it down automatically when the PR closed.

The Problem
  • 01

    Production database snapshots were 8TB — a full restore per PR was impractical on any timeline

  • 02

    Shared staging environments meant one team's bad deployment could block all other teams

  • 03

    Service-to-service dependencies meant isolated environments needed intelligent traffic routing to avoid calling production

  • 04

    Data anonymisation for GDPR compliance had to run within the 90-second spin-up budget

Our Approach

We built a custom Kubernetes operator that listened to GitHub PR webhooks and provisioned namespaced environments using copy-on-write volume snapshots — eliminating the full-restore problem. Service mesh routing rules isolated all inter-service traffic to the PR namespace by default, with explicit overrides for stable shared services. A streaming anonymisation job processed only the tables touched by changed migrations, completing in under 40 seconds for 95% of PRs.

Key Takeaways
  • 01

    Copy-on-write volume snapshots are the unlock for fast ephemeral environments at scale — they make 8TB databases practical

  • 02

    Namespace isolation with service mesh routing is cleaner than VPC-per-environment and dramatically faster to provision

  • 03

    Scope your anonymisation to changed migrations, not the full database — it is the only way to meet aggressive SLAs

  • 04

    Build the teardown logic first — ephemeral environments that are not reliably cleaned up become permanent environments

Article Details

Category

Infrastructure

Read Time

5 min

Published

Oct 2025

Tech Stack

KubernetesHelmIstioPostgreSQLGitHub ActionsGo

Ready to build?

Start a similar project

View all insights

Next Insight

Distributed Systems·Mar 2026

Surviving the Spike: Scaling to 10M Concurrent WebSockets