Emergency
260 Production Deployments.
Solo. Zero Downtime.
The control plane died. SSH keys were lost. TLS certs were expiring in three weeks. No documentation. No IaC. No team. One person, one new cluster, one shot.
Production deployments
Until first cert expired
Extended downtime
Person
The Short Version
The production cluster became unmanageable. I built a new one and moved everything.
Our Kubernetes cluster — running all client production workloads — lost its control plane. No remote access keys existed. TLS certificates were going to start expiring in about three weeks with no way to renew them. No documentation, no IaC, no recovery path.
I built a new EKS cluster from scratch (learning EKS provisioning on the job), and started migrating the most critical pods — earliest cert expiration first — within three weeks of discovery. Over 2.75 months, I migrated or deprecated all 260 deployments, rationalized the portfolio down to ~170, and continued to 139 today.
257 of 260 migrated with zero client awareness. I did it while carrying my full existing workload.
The Crisis
A cascade of failures with a hard deadline
This wasn't a planned migration. It was a rescue operation with a deadline set by certificate expiration.
Control plane offline
Kubernetes 1.17 — years past end-of-life — on unmanaged infrastructure with no IaC. When the control plane went down, there was no documented recovery path.
No SSH access
The remote access keys for the underlying EC2 instances had been lost by a previous developer. The cluster was a black box with the lights still on — pods running, but no one could manage them.
No new nodes
Without a functioning control plane, new nodes couldn't attach. The infrastructure couldn't scale, couldn't self-heal, couldn't be patched.
TLS time bomb
Let's Encrypt certificates require a functioning control plane to renew. Renewals had stopped. The first certs were set to expire in approximately three weeks. After that — a rolling wave of outages.
No documentation. Anywhere.
No cluster docs. No code docs. No architecture diagrams. No runbooks. No IaC. The original cluster was built from flat YAML files with no version control, no comments, and no consistency.
What I Had
Familiarity with Kubernetes administration from managing the existing cluster
Access to the running pods and their configurations (read-only, effectively)
My own partial documentation from prior efforts to understand the system
AI tools (Claude, OpenAI API, KiloCode) that I could build task-specific agents with
What I Didn't Have
Any experience provisioning an EKS cluster from scratch
SSH access to the underlying infrastructure
IaC, documentation, or architecture records from the previous developer
A team. This was me.
Slack on any other responsibility — my full client workload continued throughout
Learned During the Project
Architecture Decisions
Not a version upgrade. A rearchitecture.
Every major subsystem had to be redesigned — not because I wanted to, but because the legacy patterns couldn't survive the migration. Each decision eliminated a class of failure.
TLS
From In-Cluster NGINX to ALB
Before
A central NGINX deployment (not even a DaemonSet) terminated TLS via Let's Encrypt, with each pod also running its own NGINX layer. Certificate renewal depended on the control plane. When it died, TLS renewal died with it.
After
TLS termination at the Application Load Balancer. Certificates via AWS Certificate Manager — centralized, auto-renewing, zero in-cluster dependency. Negotiated an ACM quota increase to 100 domains per cert, reducing management to 4 certificates total.
Networking
NLB/NGINX → ALB/Ingress
Before
NLB with NGINX routing inside the cluster. The subdomain scheme (appname.clientname) made the apps variable — without restructuring, each new app would need its own ALB. That path led to 10+ ALBs.
After
Inverted the subdomain pattern to clientname.appname — making apps the constant. Wildcard certs per app, two ALBs total. Every routing rule rebuilt from scratch for the new model.
Domains
Lambda@Edge Rewrites
Before
Subdomain pattern: appname.clientname.apps.site.com — couldn't redirect without breaking React apps with hardcoded asset paths.
After
Lambda@Edge intercepts at CloudFront edge and rewrites transparently. No redirect, no client awareness. Legacy URLs continue to work.
Security
Segmentation & Isolation
Before
No internal network controls. Any pod could talk to any other pod — including pods running Python 2.7 with known vulnerabilities.
After
Proper network policies: Python 2 pods isolated to NGINX-only communication. Explicit ingress/egress rules. Least-privilege enforcement.
Modernization
Python 2 → 3
Several applications still on Python 2.7 — EOL since January 2020. Not a simple version bump: dependency resolution, syntax migration, testing, and deployment for each app, folded into an already compressed timeline.
Data
107 GB Legacy Filestore Recovery
Legacy file data needed to stay accessible during migration. VPC bridges between legacy and new infrastructure — another skill learned during the project because it was necessary.
Automation
One person can't do this manually. So I built the tooling.
This migration could not have been executed manually by one person in three months. The only reason it was possible is that I built the automation to do it at scale.
YAML Migration Agents
AI agents that parse legacy YAMLs, identify deployment configuration, and generate equivalents for the new ALB/ingress architecture. Hours of manual translation → minutes of review.
TLS Validation Workflows
Automated scanning of certificate states — which certs expiring when, which migrated to ACM, which still on legacy Let's Encrypt. The critical prioritization layer.
DNS Rule Generation
260 deployments = hundreds of DNS entries, subdomain mappings, and routing rules. AI agents generated configurations, flagged conflicts, produced Lambda@Edge rewrite rules.
Deployment Verification
After each migration batch, automated checks validated correct TLS, correct routing, correct response codes, no broken assets on the new cluster.
Ingress Conflict Detection
Restructuring the subdomain scheme and collapsing routing to two ALBs created complex overlap potential. Automated detection caught conflicting rules and ambiguous routes before they hit production.
Documentation Generators
Since no documentation existed, agents generated architecture docs from configurations as they were migrated. Documentation as a byproduct, not an afterthought.
Execution
Eleven weeks. Start to finish.
Discovery & Triage
Assess the full scope. Inventory all 260 deployments. Map certificate expiration dates. Identify the pods that will go dark first. Begin learning EKS provisioning.
New Cluster & First Migrations
New EKS cluster stood up — Fargate for serverless, EC2 autoscaling for everything else. First batch of critical pods migrated. The first deployment moved over with approximately two days before its cert expired.
Bulk Migration
Systematic migration in priority-ordered batches. Generate configs via automation → review → deploy → validate → deprecate legacy. Concurrent with full client workload. 50–65 hour weeks.
Cleanup & Rationalization
Remaining migrations completed. Portfolio rationalized from 260 to ~170 active deployments. Legacy cluster wound down. Documentation finalized.
Post-Migration
Continued rationalization to 139 deployments. Architecture supports scaling. Documentation exists. Security posture fundamentally transformed.
The Outcome
What came out the other side
257 / 260
Seamless migrations
Zero client awareness. Three clients experienced minor issues from surfaced legacy tech debt — pre-existing problems hidden by the old architecture. All resolved within SLA.
260 → 139
Portfolio rationalized
Not just lift-and-shift. Every deployment evaluated — what's needed, what's redundant, what's obsolete. The result is leaner, more secure, and cheaper to operate.
~25%
AWS cost reduction
Consolidating 10+ ALBs to two, deprecating unused deployments, and right-sizing resources cut monthly infrastructure spend by a quarter.
∞ → documented
From nothing to full documentation
Architecture records, deployment configurations, runbooks — generated as a byproduct of the migration tooling. Documentation because the process demanded it.
Retrospective
What I'd do differently
Build the automation tooling before the crisis
The AI agents and migration scripts I built under emergency pressure could have been developed incrementally during quieter periods. Having that tooling ready would have turned a three-month sprint into a smoother execution.
Push harder on Python 2 modernization scope
I modernized the applications that had to be modernized. Ideally, every Python 2 app would have migrated to 3. Some remain sandboxed — functional but carrying tech debt. The timeline didn't allow a complete sweep.
Start deprecation before the crisis
But it couldn't have happened — not realistically. You can't deprecate what you can't identify, and nothing was documented. What was really needed was a strangler fig approach. The emergency created the mandate that planning couldn't.
The Design Insight
When everything is on fire, the instinct is to start moving things. That's how you burn out in week two.
The decision that made this project survivable was investing the first two weeks in understanding the full scope and building the automation to execute at scale. The other decision was treating the migration as an architectural opportunity, not just a rescue.
Invest in understanding before acting
Every hour on a YAML migration agent saved dozens of hours of manual config translation.
Fix the system, not just the symptom
If you're going to touch every deployment, fix the networking, fix the security, fix the documentation.
Discipline over urgency
The hardest part wasn't the technology. It was maintaining the discipline to work systematically when the pressure was to work frantically.
Technical Stack
More case studies
See how I approach different kinds of problems — from AI-governed development to enterprise platform architecture.