The Short Version
After migrating 260 production deployments to EKS solo, I had a new problem: operating the resulting infrastructure. A fleet of nine WordPress servers hosting approximately 70 sites. A Kubernetes cluster running 150+ deployments and 25+ CronJobs. Two infrastructure planes that interact in ways that aren't obvious until something breaks — a widget OOMKill on EKS breaks a WordPress page; a PHP-FPM misconfiguration surfaces as a widget timeout.
Rather than manually operating this indefinitely, I built an autonomous agent to do it. Sysadmin Claude is a production infrastructure operations agent built on Claude Code. It SSHes into servers, runs kubectl against a live EKS cluster, executes fleet-wide operations, posts after-action reports, and maintains its own knowledge base. It gets faster every session — not because the model improves, but because the institutional knowledge it accumulates in the repo makes each session more informed than the last.
The agent started as loose scripts during the K8s migration in mid-2025. Over eight months of daily production use, it evolved into a governed, portable Claude Code instance with layered authority, persistent memory, and dual-plane awareness. This isn't a chatbot that answers questions about infrastructure. It's an operator that executes real work on production systems.
The Problem
The Kubernetes migration solved the crisis. It didn't solve operations.
The infrastructure I was now responsible for spans two distinct planes:
The server plane: nine AWS Lightsail servers managed through GridPane, hosting approximately 70 WordPress sites. These require plugin management, database operations, nginx and PHP-FPM configuration, WP-CLI operations, Cloudflare integration, and ongoing security maintenance.
The Kubernetes plane: an EKS cluster running 150+ deployments and 25+ CronJobs. This hosts a React widget platform, an observability bot, and a suite of client-facing application services that embed into the WordPress sites. It requires deployment patching, ECR image management, CronJob monitoring, OOMKill investigation, and service debugging.
These planes aren't independent. A WordPress site embeds a widget served from the EKS cluster. If that widget's pod runs out of memory and gets killed, the WordPress page shows a broken embed. If PHP-FPM on the GridPane server starts timing out, the widget iframe never loads — and the symptom looks like a Kubernetes problem until you check the server layer. Debugging requires understanding both layers and their interactions.
I'm one person. I have a full client workload on top of infrastructure operations. The manual overhead of operating two infrastructure planes — SSHing into servers to check configs, running kubectl to investigate pod health, correlating symptoms across layers — was eating time I needed for everything else. And because the work is reactive (something breaks, you investigate, you fix), it's the worst kind of context switch for an ADHD brain: unpredictable, interruptive, and high-urgency.
I needed an agent that could handle the execution. Not a monitoring dashboard — those tell you something is wrong, they don't fix it. An actual operator that could SSH in, diagnose, execute, document, and move to the next task.
The Evolution
This didn't start as a designed system. It grew.
Phase 1: Scripts and APIs (Mid-2025). During the Kubernetes migration, I was using AI tools heavily — Claude, OpenAI API, KiloCode — for specific tasks. I built small scripts that hooked into these APIs to automate repetitive migration work: YAML generation, TLS validation, DNS record creation. These were disposable — built for the migration, used during the migration, forgotten after.
But some of the patterns stuck. The habit of writing a script to automate a task, then keeping that script for next time. The realization that an AI agent with the right context could execute infrastructure tasks faster than I could type the commands myself.
Phase 2: Consolidation (Late 2025). The loose scripts consolidated into a single Claude Code instance. Instead of separate tools for separate tasks, one agent with growing context about the infrastructure. The CLAUDE.md file started accumulating operational knowledge: server IPs, known failure modes, configuration gotchas, cluster topology.
This is where the governance started to matter. A script that runs one specific command has a narrow blast radius. An agent that can SSH into any server and run arbitrary commands has an enormous one. The shift from scripts to agent required a corresponding shift from "just run it" to "what are the boundaries?"
Phase 3: Governed Agent (Early 2026). The current system. A standalone Claude Code instance built around governance that makes it completely portable. The agent has explicit boundaries (it knows what it owns and what requires human approval), layered authority (pre-approved operations run autonomously, novel decisions get staged), persistent memory (every incident produces working artifacts the next session inherits), and a self-maintaining knowledge base (the agent updates its own documentation as it discovers new facts about the infrastructure).
The Governance Model
This is the section that matters most. Giving an AI agent SSH access to production servers is either bold or reckless depending entirely on the governance.
Layered Authority. Not all operations carry the same risk. The governance reflects that:
Autonomous execution (pre-approved): Discovery, maintenance, and management operations that are part of existing script stacks. The agent generates scripts for fleet-wide operations — checking plugin versions across all servers, auditing cache configurations, verifying site health — and executes them within pre-approved parameters. These are the operations where the risk is low and the value of speed is high.
Contextual approval: Operations that touch production but where the context determines the risk level. Restart PHP-FPM as part of a known remediation procedure that the agent is already executing? Permitted — it's part of the script stack. Restart PHP-FPM because the agent independently decides it might fix an unrelated problem? That requires staging and explicit approval. Same operation, different governance, because the risk profile is different.
Always requires approval: Slack posts (including after-action reports), irreversible changes, and novel interventions where the agent is proposing a solution rather than executing a known one. These are the operations where human judgment matters more than speed.
Strict Mode as Enforcement. Claude Code's built-in permission system provides the enforcement layer. The agent runs in strict mode — every operation that touches a remote server requires permission, with carved-out exceptions for pre-approved script categories. This isn't a trust-based system where the agent promises to ask before doing risky things. It's a structural constraint where the agent cannot execute unapproved operations.
The Security Model. The agent doesn't hold credentials directly. SSH keys, API tokens, and IAM credentials live in system environment variables and env files — the agent invokes tools that use them, but the credentials never enter the agent's context or get committed to the repository. Currently, operations run under my IAM credentials. As automation expands into more sensitive areas, the plan is to scaffold a dedicated IAM role with scoped permissions — the agent only gets access to what it needs.
The Memory Architecture
What makes this agent different from a chatbot with SSH access is persistence.
Working Directories. Every incident or task produces a structured working directory: a summary of what happened, verified findings checked against live infrastructure, and after-action reports formatted for Slack posting in Block Kit format. When a session ends, the working directory is archived with a one-line index entry. When a new session starts, the agent reads the working directory and picks up exactly where the last session left off. No human re-briefing required. No context loss between sessions.
Self-Maintaining Knowledge Base. The agent updates its own documentation as it learns. New server IPs discovered via the Lightsail API get added to the server inventory. Configuration gotchas encountered during operations get documented for future sessions. Known failure modes and their resolutions get cataloged. EKS cluster topology changes are recorded.
This is the mechanism that makes the agent faster over time. Session 1 might take thirty minutes to investigate a problem because the agent has to discover the server layout. Session 50 takes five minutes because the layout, the common failure modes, and the resolution patterns are all already in the knowledge base.
Verify Before Writing. A discipline that the agent follows: never write a finding or report based on assumed facts. Every claim gets verified against live infrastructure before it's documented. This was learned the hard way — early in the agent's evolution, incident notes sometimes contained stale or incorrect information. The agent now treats all input as unverified and confirms via SSH or kubectl before committing anything to the record.
The example that demonstrates this: the agent received raw incident notes with an explicit warning that some facts might be wrong. Rather than writing a report based on the notes, it SSHed into the relevant servers, verified every factual claim, corrected the errors, and then drafted the after-action reports from verified data.
A Typical Session
To make this concrete, here's what one session looked like:
Starting point: Raw incident notes from a production issue, with explicit warnings that some facts were unverified.
What the agent did, in sequence:
First, it verified every factual claim via SSH before writing anything. Connected to relevant servers, checked configurations, confirmed actual state against the reported state. Corrected errors in the incident notes.
Then it drafted and posted internal AARs in Slack Block Kit format — structured, readable incident summaries ready for team consumption. (Posted after human approval.)
Next, it executed a fleet-wide security operation: removal of a CVE-affected plugin across 24 installations on 9 servers. During this operation, the agent discovered 4 previously undocumented servers via the AWS Lightsail API — servers that weren't in any existing inventory. It added them to the knowledge base.
It then conducted a fleet-wide cache plugin audit affecting 17 sites across 6 servers. Checked plugin versions, configurations, and known compatibility issues.
Finally, post-change health verification on every touched site. Hit origin servers directly (bypassing Cloudflare) to confirm the sites were actually serving correctly, not just returning cached responses.
All with zero human intervention between tasks — within the pre-approved automation boundaries. The agent moved from incident response through fleet-wide security remediation through health verification in a single session, documenting everything as it went.
The Dual-Plane Problem
Most infrastructure agents operate on one layer. This one operates on two and understands their interactions. Here's why that matters.
Scenario: A client reports their website is loading slowly. The WordPress admin panel shows the page loading fine. But the page contains an embedded widget served from the EKS cluster.
Without dual-plane awareness, the debugging path is: check WordPress (looks fine) → escalate to someone who knows Kubernetes → they investigate the pod → find an OOMKill → restart the pod → mark resolved. Three people, multiple context switches, hours of elapsed time.
With the sysadmin agent: check WordPress (fine) → check the widget's pod on EKS (OOMKilled) → investigate memory pressure → identify root cause → remediate → verify the WordPress page loads correctly end-to-end. One agent, one session, continuous context.
The agent's knowledge base includes the mapping between WordPress sites and their corresponding EKS services. When it investigates a WordPress problem, it automatically checks the Kubernetes layer for related issues. When it investigates an EKS problem, it knows which WordPress sites are affected. This cross-layer awareness isn't intelligence — it's documentation. The agent maintains the relationship map and consults it during every investigation.
What I'd Do Differently
I'd build the knowledge base structure from the start instead of letting it emerge. The current knowledge base grew organically — information was added as it was discovered, without a predetermined schema. This works, but it means the organizational structure is inconsistent. Some knowledge is in the CLAUDE.md, some is in dedicated files, some is embedded in working directory archives. A designed schema from day one would make retrieval faster and more reliable.
I'd implement IAM scoping earlier. Running under my credentials works for the current scope, but it's a known compromise. A dedicated IAM role with least-privilege permissions should have been the starting configuration, not a planned improvement. The governance is sound at the application layer (Claude Code's permission system), but the infrastructure layer should match.
I'd establish the dual-plane awareness architecture earlier. The mapping between WordPress sites and EKS services was learned incrementally as problems surfaced. Mapping those relationships proactively — before the first cross-layer incident — would have caught issues faster and reduced initial investigation times.
I'd formalize the after-action report template sooner. The Block Kit format for Slack AARs evolved through trial and error. Having a consistent, structured template from session one would have made the historical record more uniform and more useful for pattern analysis.
The Design Insight
The conventional response to infrastructure complexity is monitoring — dashboards, alerts, PagerDuty rotations. These systems tell you something is wrong. They don't fix it. They create a notification that a human has to triage, investigate, diagnose, and resolve. For a sole operator managing two infrastructure planes, the monitoring-to-human pipeline is the bottleneck.
Sysadmin Claude doesn't replace monitoring. It replaces the human execution that follows monitoring. The alert fires, the agent investigates, the agent executes the remediation, the agent documents what it found and what it did. The human reviews the documentation and approves the Slack post. The operational loop goes from "alert → human investigation → human execution → human documentation" to "alert → agent investigation → agent execution → agent documentation → human review."
The key insight isn't "AI can run infrastructure." It's "AI can run infrastructure safely if the governance is layered correctly." Pre-approved operations for known patterns. Staged approval for novel decisions. Structural enforcement via permission systems. Persistent memory so institutional knowledge accumulates. Self-correcting inputs so the agent doesn't act on bad information.
This took eight months of production refinement to get right. The scripts-to-agent evolution wasn't planned — it was driven by operational pressure. Each phase added governance because each phase expanded the blast radius. That incremental, pressure-tested development is why the governance model works: it wasn't designed theoretically, it was refined through daily use on live infrastructure.
Technical Stack
← Back to case study overview · Part 1: Kubernetes Migration · All case studies