Emergency

260 Production Deployments.
Solo. Zero Downtime.

The control plane died. SSH keys were lost. TLS certs were expiring in three weeks. No documentation. No IaC. No team. One person, one new cluster, one shot.

260

Production deployments

~3 wks

Until first cert expired

Extended downtime

Person

The Short Version

The production cluster became unmanageable. I built a new one and moved everything.

Our Kubernetes cluster — running all client production workloads — lost its control plane. No remote access keys existed. TLS certificates were going to start expiring in about three weeks with no way to renew them. No documentation, no IaC, no recovery path.

I built a new EKS cluster from scratch (learning EKS provisioning on the job), and started migrating the most critical pods — earliest cert expiration first — within three weeks of discovery. Over 2.75 months, I migrated or deprecated all 260 deployments, rationalized the portfolio down to ~170, and continued to 139 today.

257 of 260 migrated with zero client awareness. I did it while carrying my full existing workload.

The Crisis

A cascade of failures with a hard deadline

This wasn't a planned migration. It was a rescue operation with a deadline set by certificate expiration.

Control plane offline

Kubernetes 1.17 — years past end-of-life — on unmanaged infrastructure with no IaC. When the control plane went down, there was no documented recovery path.

No SSH access

The remote access keys for the underlying EC2 instances had been lost by a previous developer. The cluster was a black box with the lights still on — pods running, but no one could manage them.

No new nodes

Without a functioning control plane, new nodes couldn't attach. The infrastructure couldn't scale, couldn't self-heal, couldn't be patched.

TLS time bomb

Let's Encrypt certificates require a functioning control plane to renew. Renewals had stopped. The first certs were set to expire in approximately three weeks. After that — a rolling wave of outages.

No documentation. Anywhere.

No cluster docs. No code docs. No architecture diagrams. No runbooks. No IaC. The original cluster was built from flat YAML files with no version control, no comments, and no consistency.

What I Had

✓

Familiarity with Kubernetes administration from managing the existing cluster

✓

Access to the running pods and their configurations (read-only, effectively)

✓

My own partial documentation from prior efforts to understand the system

✓

AI tools (Claude, OpenAI API, KiloCode) that I could build task-specific agents with

What I Didn't Have

Any experience provisioning an EKS cluster from scratch

SSH access to the underlying infrastructure

IaC, documentation, or architecture records from the previous developer

A team. This was me.

Slack on any other responsibility — my full client workload continued throughout

Learned During the Project

EKS provisioningALB ingress controllersLambda@Edge rewritesVPC bridge configCloudFront distribution mgmtACM integration

Architecture Decisions

Not a version upgrade. A rearchitecture.

Every major subsystem had to be redesigned — not because I wanted to, but because the legacy patterns couldn't survive the migration. Each decision eliminated a class of failure.

TLS

From In-Cluster NGINX to ALB

Before

A central NGINX deployment (not even a DaemonSet) terminated TLS via Let's Encrypt, with each pod also running its own NGINX layer. Certificate renewal depended on the control plane. When it died, TLS renewal died with it.

After

TLS termination at the Application Load Balancer. Certificates via AWS Certificate Manager — centralized, auto-renewing, zero in-cluster dependency. Negotiated an ACM quota increase to 100 domains per cert, reducing management to 4 certificates total.

Networking

NLB/NGINX → ALB/Ingress

Before

NLB with NGINX routing inside the cluster. The subdomain scheme (appname.clientname) made the apps variable — without restructuring, each new app would need its own ALB. That path led to 10+ ALBs.

After

Inverted the subdomain pattern to clientname.appname — making apps the constant. Wildcard certs per app, two ALBs total. Every routing rule rebuilt from scratch for the new model.

Domains

Lambda@Edge Rewrites

Before

Subdomain pattern: appname.clientname.apps.site.com — couldn't redirect without breaking React apps with hardcoded asset paths.

After

Lambda@Edge intercepts at CloudFront edge and rewrites transparently. No redirect, no client awareness. Legacy URLs continue to work.

Security

Segmentation & Isolation

Before

No internal network controls. Any pod could talk to any other pod — including pods running Python 2.7 with known vulnerabilities.

After

Proper network policies: Python 2 pods isolated to NGINX-only communication. Explicit ingress/egress rules. Least-privilege enforcement.

Modernization

Python 2 → 3

Several applications still on Python 2.7 — EOL since January 2020. Not a simple version bump: dependency resolution, syntax migration, testing, and deployment for each app, folded into an already compressed timeline.

Data

107 GB Legacy Filestore Recovery

Legacy file data needed to stay accessible during migration. VPC bridges between legacy and new infrastructure — another skill learned during the project because it was necessary.

Automation

One person can't do this manually. So I built the tooling.

This migration could not have been executed manually by one person in three months. The only reason it was possible is that I built the automation to do it at scale.

YAML Migration Agents

AI agents that parse legacy YAMLs, identify deployment configuration, and generate equivalents for the new ALB/ingress architecture. Hours of manual translation → minutes of review.

TLS Validation Workflows

Automated scanning of certificate states — which certs expiring when, which migrated to ACM, which still on legacy Let's Encrypt. The critical prioritization layer.

DNS Rule Generation

260 deployments = hundreds of DNS entries, subdomain mappings, and routing rules. AI agents generated configurations, flagged conflicts, produced Lambda@Edge rewrite rules.

Deployment Verification

After each migration batch, automated checks validated correct TLS, correct routing, correct response codes, no broken assets on the new cluster.

Ingress Conflict Detection

Restructuring the subdomain scheme and collapsing routing to two ALBs created complex overlap potential. Automated detection caught conflicting rules and ambiguous routes before they hit production.

Documentation Generators

Since no documentation existed, agents generated architecture docs from configurations as they were migrated. Documentation as a byproduct, not an afterthought.

Execution

Eleven weeks. Start to finish.

Weeks 1–2

Discovery & Triage

Assess the full scope. Inventory all 260 deployments. Map certificate expiration dates. Identify the pods that will go dark first. Begin learning EKS provisioning.

Week 3

New Cluster & First Migrations

New EKS cluster stood up — Fargate for serverless, EC2 autoscaling for everything else. First batch of critical pods migrated. The first deployment moved over with approximately two days before its cert expired.

Weeks 4–8

Bulk Migration

Systematic migration in priority-ordered batches. Generate configs via automation → review → deploy → validate → deprecate legacy. Concurrent with full client workload. 50–65 hour weeks.

Weeks 9–11

Cleanup & Rationalization

Remaining migrations completed. Portfolio rationalized from 260 to ~170 active deployments. Legacy cluster wound down. Documentation finalized.

Ongoing

Post-Migration

Continued rationalization to 139 deployments. Architecture supports scaling. Documentation exists. Security posture fundamentally transformed.

The Outcome

What came out the other side

257 / 260

Seamless migrations

Zero client awareness. Three clients experienced minor issues from surfaced legacy tech debt — pre-existing problems hidden by the old architecture. All resolved within SLA.

260 → 139

Portfolio rationalized

Not just lift-and-shift. Every deployment evaluated — what's needed, what's redundant, what's obsolete. The result is leaner, more secure, and cheaper to operate.

~25%

AWS cost reduction

Consolidating 10+ ALBs to two, deprecating unused deployments, and right-sizing resources cut monthly infrastructure spend by a quarter.

∞ → documented

From nothing to full documentation

Architecture records, deployment configurations, runbooks — generated as a byproduct of the migration tooling. Documentation because the process demanded it.

Retrospective

What I'd do differently

Build the automation tooling before the crisis

The AI agents and migration scripts I built under emergency pressure could have been developed incrementally during quieter periods. Having that tooling ready would have turned a three-month sprint into a smoother execution.

Push harder on Python 2 modernization scope

I modernized the applications that had to be modernized. Ideally, every Python 2 app would have migrated to 3. Some remain sandboxed — functional but carrying tech debt. The timeline didn't allow a complete sweep.

Start deprecation before the crisis

But it couldn't have happened — not realistically. You can't deprecate what you can't identify, and nothing was documented. What was really needed was a strangler fig approach. The emergency created the mandate that planning couldn't.

The Design Insight

When everything is on fire, the instinct is to start moving things. That's how you burn out in week two.

The decision that made this project survivable was investing the first two weeks in understanding the full scope and building the automation to execute at scale. The other decision was treating the migration as an architectural opportunity, not just a rescue.

Invest in understanding before acting

Every hour on a YAML migration agent saved dozens of hours of manual config translation.

Fix the system, not just the symptom

If you're going to touch every deployment, fix the networking, fix the security, fix the documentation.

Discipline over urgency

The hardest part wasn't the technology. It was maintaining the discipline to work systematically when the pressure was to work frantically.

Technical Stack

Orchestration Kubernetes (EKS), Docker

Cloud AWS — EC2, Fargate, ALB, ACM, CloudFront, Lambda@Edge, Route53, VPC

Networking ALB/Ingress (replacing NLB/NGINX), wildcard routing, proxy-pass

Security In-cluster network policies, pod isolation, TLS at ALB

Automation Custom AI agents (OpenAI API, Claude, KiloCode), bash scripting

Languages Python (2→3 migration), React, YAML/Helm

Monitoring CloudWatch, deployment verification scripts

Methodology Kanban workflows, priority-based triage

More case studies

See how I approach different kinds of problems — from AI-governed development to enterprise platform architecture.

AI Sprint Management Salesforce Architecture All Work →

260 Production Deployments. Solo. Zero Downtime.

The production cluster became unmanageable. I built a new one and moved everything.

A cascade of failures with a hard deadline

Control plane offline

No SSH access

No new nodes

TLS time bomb

No documentation. Anywhere.

Not a version upgrade. A rearchitecture.

From In-Cluster NGINX to ALB

NLB/NGINX → ALB/Ingress

Lambda@Edge Rewrites

Segmentation & Isolation

Python 2 → 3

107 GB Legacy Filestore Recovery

One person can't do this manually. So I built the tooling.

YAML Migration Agents

TLS Validation Workflows

DNS Rule Generation

Deployment Verification

Ingress Conflict Detection

Documentation Generators

Eleven weeks. Start to finish.

Discovery & Triage

New Cluster & First Migrations

Bulk Migration

Cleanup & Rationalization

Post-Migration

What came out the other side

Seamless migrations

Portfolio rationalized

AWS cost reduction

From nothing to full documentation

What I'd do differently

Build the automation tooling before the crisis

Push harder on Python 2 modernization scope

Start deprecation before the crisis

When everything is on fire, the instinct is to start moving things. That's how you burn out in week two.

Invest in understanding before acting

Fix the system, not just the symptom

Discipline over urgency

More case studies

260 Production Deployments.
Solo. Zero Downtime.