Autonomous Agent
Sysadmin Claude.
Production Infrastructure on Autopilot.
An autonomous infrastructure operations agent that SSHes into production servers, runs kubectl against a live EKS cluster, executes fleet-wide operations, and gets faster every session — because the institutional knowledge it accumulates makes each session more informed than the last.
Sequel to Emergency Kubernetes Migration
Production servers
WordPress sites
K8s deployments
Daily production use
The Short Version
The migration solved the crisis. It didn't solve operations.
After migrating 260 production deployments to EKS solo, I had a new problem: operating the resulting infrastructure. A fleet of nine WordPress servers hosting approximately 70 sites. A Kubernetes cluster running 150+ deployments and 25+ CronJobs. Two infrastructure planes that interact in ways that aren't obvious until something breaks.
Rather than manually operating this indefinitely, I built an autonomous agent to do it. Sysadmin Claude is a production infrastructure operations agent built on Claude Code. It gets faster every session — not because the model improves, but because the institutional knowledge it accumulates in the repo makes each session more informed than the last.
This isn't a chatbot that answers questions about infrastructure. It's an operator that executes real work on production systems.
The Problem
Two infrastructure planes. One person.
The infrastructure spans two distinct planes that aren't independent — they interact in ways that make debugging require understanding both layers simultaneously.
The Server Plane
Nine AWS Lightsail servers managed through GridPane
Approximately 70 WordPress sites
Plugin management, database operations, nginx/PHP-FPM config
WP-CLI operations, Cloudflare integration, security maintenance
The Kubernetes Plane
EKS cluster running 150+ deployments and 25+ CronJobs
React widget platform, observability bot, client-facing services
Deployment patching, ECR image management, CronJob monitoring
OOMKill investigation, service debugging
The cross-layer problem
A WordPress site embeds a widget served from EKS. If that widget's pod gets OOMKilled, the WordPress page shows a broken embed. If PHP-FPM on the server starts timing out, the widget iframe never loads — and the symptom looks like a Kubernetes problem until you check the server layer. Debugging requires understanding both layers and their interactions.
The Evolution
From disposable scripts to governed agent
This didn't start as a designed system. It grew — and each phase expanded the blast radius, which demanded corresponding governance.
Phase 1: Scripts and APIs
During the Kubernetes migration, small scripts hooked into AI APIs to automate repetitive work: YAML generation, TLS validation, DNS record creation. Disposable — built for the migration, used during the migration. But the patterns stuck.
Phase 2: Consolidation
The loose scripts consolidated into a single Claude Code instance. One agent with growing context about the infrastructure. The CLAUDE.md started accumulating operational knowledge — server IPs, known failure modes, configuration gotchas. The shift from scripts to agent required a shift from "just run it" to "what are the boundaries?"
Phase 3: Governed Agent
The current system. A standalone Claude Code instance built around governance that makes it completely portable. Explicit boundaries, layered authority, persistent memory, and a self-maintaining knowledge base. The governance didn't come from theory — it was refined through daily use on live infrastructure.
The Governance Model
SSH access to production is bold or reckless depending entirely on the governance.
Not all operations carry the same risk. The governance reflects that with layered authority — the same principle behind Pass@1.
Autonomous
Pre-Approved Operations
Discovery, maintenance, and management within existing script stacks. Checking plugin versions across all servers, auditing cache configurations, verifying site health. Low risk, high value of speed.
Contextual
Context-Dependent Approval
Operations that touch production but where context determines risk. Restart PHP-FPM during a known remediation? Permitted. Restart it because the agent independently decides it might help? Requires approval. Same operation, different governance.
Always Approval
Human Judgment Required
Slack posts (including after-action reports), irreversible changes, and novel interventions. These are operations where human judgment matters more than speed.
Enforcement
Strict Mode
Claude Code's built-in permission system provides the enforcement layer. Every operation that touches a remote server requires permission, with carved-out exceptions for pre-approved script categories. This isn't trust-based — it's a structural constraint.
Security
Credential Isolation
The agent doesn't hold credentials directly. SSH keys, API tokens, and IAM credentials live in system environment variables — the agent invokes tools that use them, but credentials never enter the agent's context or get committed to the repository.
Memory Architecture
What makes this different from a chatbot with SSH access is persistence.
Working Directories
Every incident or task produces a structured working directory — summary, verified findings, after-action reports. When a new session starts, the agent reads the working directory and picks up exactly where the last session left off.
Self-Maintaining Knowledge Base
The agent updates its own documentation as it learns. New server IPs from the Lightsail API, configuration gotchas from operations, known failure modes and resolutions, EKS topology changes — all recorded automatically.
Verify Before Writing
Never write a finding based on assumed facts. Every claim gets verified against live infrastructure via SSH or kubectl before being documented. This was learned the hard way — early incident notes sometimes contained stale information.
The compound effect
Session 1 might take thirty minutes to investigate a problem because the agent has to discover the server layout. Session 50 takes five minutes because the layout, the common failure modes, and the resolution patterns are all already in the knowledge base. The agent gets faster over time — not through model improvements, but through accumulated institutional knowledge.
A Typical Session
What one session actually looks like
Starting point: raw incident notes from a production issue, with explicit warnings that some facts were unverified. What the agent did, in sequence — with zero human intervention between tasks:
Verified every factual claim via SSH
Connected to relevant servers, checked configurations, confirmed actual state against reported state. Corrected errors in the incident notes before writing anything.
Drafted and posted internal AARs
After-action reports in Slack Block Kit format — structured, readable incident summaries ready for team consumption. Posted after human approval.
Fleet-wide security operation
Removal of a CVE-affected plugin across 24 installations on 9 servers. During this operation, discovered 4 previously undocumented servers via the AWS Lightsail API — added them to the knowledge base.
Fleet-wide cache plugin audit
Audited 17 sites across 6 servers. Checked plugin versions, configurations, and known compatibility issues.
Post-change health verification
Hit origin servers directly (bypassing Cloudflare) to confirm sites were actually serving correctly — not just returning cached responses.
The Dual-Plane Problem
Most infrastructure agents operate on one layer. This one operates on two.
Without Dual-Plane Awareness
Three people. Multiple context switches. Hours.
Client reports slow loading. Check WordPress — looks fine. Escalate to Kubernetes team. They investigate the pod — find an OOMKill. Restart the pod. Mark resolved. Three people, multiple context switches, hours of elapsed time.
With Sysadmin Claude
One agent. One session. Continuous context.
Check WordPress (fine) → check the widget's pod on EKS (OOMKilled) → investigate memory pressure → identify root cause → remediate → verify the WordPress page loads correctly end-to-end. One agent, one session, continuous context.
The agent's knowledge base includes the mapping between WordPress sites and their corresponding EKS services. When it investigates a WordPress problem, it automatically checks the Kubernetes layer. When it investigates an EKS problem, it knows which WordPress sites are affected. This cross-layer awareness isn't intelligence — it's documentation. The agent maintains the relationship map and consults it during every investigation.
Retrospective
What I'd do differently
Build the knowledge base structure from the start
The current knowledge base grew organically — information added as discovered, without a predetermined schema. A designed schema from day one would make retrieval faster and more reliable.
Implement IAM scoping earlier
Running under my credentials works for the current scope, but a dedicated IAM role with least-privilege permissions should have been the starting configuration, not a planned improvement.
Establish dual-plane awareness architecture earlier
The mapping between WordPress sites and EKS services was learned incrementally as problems surfaced. Mapping those relationships proactively — before the first cross-layer incident — would have caught issues faster.
Formalize the after-action report template sooner
The Block Kit format for Slack AARs evolved through trial and error. A consistent, structured template from session one would have made the historical record more uniform and more useful.
The Design Insight
AI can run infrastructure safely — if the governance is layered correctly.
The key insight isn't "AI can run infrastructure." It's that the safety comes from the same governance-as-architecture pattern that makes every other system I build reliable. Pre-approved operations for known patterns. Staged approval for novel decisions. Structural enforcement via permission systems. Persistent memory so institutional knowledge accumulates.
Governance is the safety layer
Not trust. Not promises. Structural constraints that prevent the agent from executing unapproved operations.
Memory compounds over time
Every session leaves the agent more informed. Institutional knowledge accumulates in the repo, not in a context window that compacts.
The operator loop shifts
From "alert → human investigation → human execution" to "alert → agent execution → human review." The human reviews documentation, not logs.
Technical Stack
Read the full story
The complete case study covers the evolution from disposable migration scripts to a governed production agent — including the governance model, memory architecture, and concrete session walkthroughs.