Autonomous Agent

Sysadmin Claude.
Production Infrastructure on Autopilot.

An autonomous infrastructure operations agent that SSHes into production servers, runs kubectl against a live EKS cluster, executes fleet-wide operations, and gets faster every session — because the institutional knowledge it accumulates makes each session more informed than the last.

Sequel to Emergency Kubernetes Migration

9

Production servers

~70

WordPress sites

150+

K8s deployments

8 mo

Daily production use

The Short Version

The migration solved the crisis. It didn't solve operations.

After migrating 260 production deployments to EKS solo, I had a new problem: operating the resulting infrastructure. A fleet of nine WordPress servers hosting approximately 70 sites. A Kubernetes cluster running 150+ deployments and 25+ CronJobs. Two infrastructure planes that interact in ways that aren't obvious until something breaks.

Rather than manually operating this indefinitely, I built an autonomous agent to do it. Sysadmin Claude is a production infrastructure operations agent built on Claude Code. It gets faster every session — not because the model improves, but because the institutional knowledge it accumulates in the repo makes each session more informed than the last.

This isn't a chatbot that answers questions about infrastructure. It's an operator that executes real work on production systems.

The Problem

Two infrastructure planes. One person.

The infrastructure spans two distinct planes that aren't independent — they interact in ways that make debugging require understanding both layers simultaneously.

The Server Plane

Nine AWS Lightsail servers managed through GridPane

Approximately 70 WordPress sites

Plugin management, database operations, nginx/PHP-FPM config

WP-CLI operations, Cloudflare integration, security maintenance

The Kubernetes Plane

EKS cluster running 150+ deployments and 25+ CronJobs

React widget platform, observability bot, client-facing services

Deployment patching, ECR image management, CronJob monitoring

OOMKill investigation, service debugging

The cross-layer problem

A WordPress site embeds a widget served from EKS. If that widget's pod gets OOMKilled, the WordPress page shows a broken embed. If PHP-FPM on the server starts timing out, the widget iframe never loads — and the symptom looks like a Kubernetes problem until you check the server layer. Debugging requires understanding both layers and their interactions.

The Evolution

From disposable scripts to governed agent

This didn't start as a designed system. It grew — and each phase expanded the blast radius, which demanded corresponding governance.

Mid-2025

Phase 1: Scripts and APIs

During the Kubernetes migration, small scripts hooked into AI APIs to automate repetitive work: YAML generation, TLS validation, DNS record creation. Disposable — built for the migration, used during the migration. But the patterns stuck.

Late 2025

Phase 2: Consolidation

The loose scripts consolidated into a single Claude Code instance. One agent with growing context about the infrastructure. The CLAUDE.md started accumulating operational knowledge — server IPs, known failure modes, configuration gotchas. The shift from scripts to agent required a shift from "just run it" to "what are the boundaries?"

Early 2026

Phase 3: Governed Agent

The current system. A standalone Claude Code instance built around governance that makes it completely portable. Explicit boundaries, layered authority, persistent memory, and a self-maintaining knowledge base. The governance didn't come from theory — it was refined through daily use on live infrastructure.

The Governance Model

SSH access to production is bold or reckless depending entirely on the governance.

Not all operations carry the same risk. The governance reflects that with layered authority — the same principle behind Pass@1.

Autonomous

Pre-Approved Operations

Discovery, maintenance, and management within existing script stacks. Checking plugin versions across all servers, auditing cache configurations, verifying site health. Low risk, high value of speed.

Contextual

Context-Dependent Approval

Operations that touch production but where context determines risk. Restart PHP-FPM during a known remediation? Permitted. Restart it because the agent independently decides it might help? Requires approval. Same operation, different governance.

Always Approval

Human Judgment Required

Slack posts (including after-action reports), irreversible changes, and novel interventions. These are operations where human judgment matters more than speed.

Enforcement

Strict Mode

Claude Code's built-in permission system provides the enforcement layer. Every operation that touches a remote server requires permission, with carved-out exceptions for pre-approved script categories. This isn't trust-based — it's a structural constraint.

Security

Credential Isolation

The agent doesn't hold credentials directly. SSH keys, API tokens, and IAM credentials live in system environment variables — the agent invokes tools that use them, but credentials never enter the agent's context or get committed to the repository.

Memory Architecture

What makes this different from a chatbot with SSH access is persistence.

Working Directories

Every incident or task produces a structured working directory — summary, verified findings, after-action reports. When a new session starts, the agent reads the working directory and picks up exactly where the last session left off.

Self-Maintaining Knowledge Base

The agent updates its own documentation as it learns. New server IPs from the Lightsail API, configuration gotchas from operations, known failure modes and resolutions, EKS topology changes — all recorded automatically.

Verify Before Writing

Never write a finding based on assumed facts. Every claim gets verified against live infrastructure via SSH or kubectl before being documented. This was learned the hard way — early incident notes sometimes contained stale information.

The compound effect

Session 1 might take thirty minutes to investigate a problem because the agent has to discover the server layout. Session 50 takes five minutes because the layout, the common failure modes, and the resolution patterns are all already in the knowledge base. The agent gets faster over time — not through model improvements, but through accumulated institutional knowledge.

A Typical Session

What one session actually looks like

Starting point: raw incident notes from a production issue, with explicit warnings that some facts were unverified. What the agent did, in sequence — with zero human intervention between tasks:

01

Verified every factual claim via SSH

Connected to relevant servers, checked configurations, confirmed actual state against reported state. Corrected errors in the incident notes before writing anything.

02

Drafted and posted internal AARs

After-action reports in Slack Block Kit format — structured, readable incident summaries ready for team consumption. Posted after human approval.

03

Fleet-wide security operation

Removal of a CVE-affected plugin across 24 installations on 9 servers. During this operation, discovered 4 previously undocumented servers via the AWS Lightsail API — added them to the knowledge base.

04

Fleet-wide cache plugin audit

Audited 17 sites across 6 servers. Checked plugin versions, configurations, and known compatibility issues.

05

Post-change health verification

Hit origin servers directly (bypassing Cloudflare) to confirm sites were actually serving correctly — not just returning cached responses.

The Dual-Plane Problem

Most infrastructure agents operate on one layer. This one operates on two.

Without Dual-Plane Awareness

Three people. Multiple context switches. Hours.

Client reports slow loading. Check WordPress — looks fine. Escalate to Kubernetes team. They investigate the pod — find an OOMKill. Restart the pod. Mark resolved. Three people, multiple context switches, hours of elapsed time.

With Sysadmin Claude

One agent. One session. Continuous context.

Check WordPress (fine) → check the widget's pod on EKS (OOMKilled) → investigate memory pressure → identify root cause → remediate → verify the WordPress page loads correctly end-to-end. One agent, one session, continuous context.

The agent's knowledge base includes the mapping between WordPress sites and their corresponding EKS services. When it investigates a WordPress problem, it automatically checks the Kubernetes layer. When it investigates an EKS problem, it knows which WordPress sites are affected. This cross-layer awareness isn't intelligence — it's documentation. The agent maintains the relationship map and consults it during every investigation.

Retrospective

What I'd do differently

Build the knowledge base structure from the start

The current knowledge base grew organically — information added as discovered, without a predetermined schema. A designed schema from day one would make retrieval faster and more reliable.

Implement IAM scoping earlier

Running under my credentials works for the current scope, but a dedicated IAM role with least-privilege permissions should have been the starting configuration, not a planned improvement.

Establish dual-plane awareness architecture earlier

The mapping between WordPress sites and EKS services was learned incrementally as problems surfaced. Mapping those relationships proactively — before the first cross-layer incident — would have caught issues faster.

Formalize the after-action report template sooner

The Block Kit format for Slack AARs evolved through trial and error. A consistent, structured template from session one would have made the historical record more uniform and more useful.

The Design Insight

AI can run infrastructure safely — if the governance is layered correctly.

The key insight isn't "AI can run infrastructure." It's that the safety comes from the same governance-as-architecture pattern that makes every other system I build reliable. Pre-approved operations for known patterns. Staged approval for novel decisions. Structural enforcement via permission systems. Persistent memory so institutional knowledge accumulates.

Governance is the safety layer

Not trust. Not promises. Structural constraints that prevent the agent from executing unapproved operations.

Memory compounds over time

Every session leaves the agent more informed. Institutional knowledge accumulates in the repo, not in a context window that compacts.

The operator loop shifts

From "alert → human investigation → human execution" to "alert → agent execution → human review." The human reviews documentation, not logs.

Technical Stack

Agent Platform Claude Code (standalone instance with strict mode)
Server Management SSH, WP-CLI, GridPane CLI, nginx/PHP-FPM configuration
Container Operations kubectl, ECR, AWS CLI (EKS cluster management)
Cloud Infrastructure AWS Lightsail (server fleet), AWS EKS (Kubernetes)
CDN / DNS Cloudflare API integration
Communication Slack Block Kit format for AARs (posted with approval)
Memory File-based working directories, self-maintaining knowledge base, archived session records
Credential Management System environment variables and env files (credentials never in agent context)
Governance Claude Code strict mode with carved-out exceptions for pre-approved operations
Evolution Timeline ~8 months (loose scripts mid-2025 → governed agent early 2026)
Infrastructure Scope 9 servers, ~70 WordPress sites, 150+ K8s deployments, 25+ CronJobs

Read the full story

The complete case study covers the evolution from disposable migration scripts to a governed production agent — including the governance model, memory architecture, and concrete session walkthroughs.