Infrastructure Governance

AI-Assisted AWS Governance.
Read-Only First. Document Everything.

An AWS account in continuous use for over a decade. Zero documentation. Multiple staff turnovers. Resources created ad-hoc and never decommissioned. Nobody knew what was in it, nobody had time to find out, and nobody could safely clean it up without understanding it first. An AI agent with read-only access broke the stalemate.

Part 3 of the Infrastructure Arc — sequel to Sysadmin Claude

$954

Monthly savings realized

30.9%

Cost reduction from baseline

Working days elapsed

100+

AWS resources remediated

Production incidents

The Short Version

You can't audit what isn't documented. You can't document without an audit.

The organization had an AWS account accumulated over years by multiple people across multiple staff turnovers. Services were provisioned for client projects, developers came and went, and nobody maintained a central record. Every previous attempt to impose governance stalled at the same paradox: auditing required documentation that didn't exist, and creating documentation required the audit that kept getting deferred.

An AI agent broke the stalemate by treating the entire AWS account as a read-only discovery surface — enumerating every service, cross-referencing findings across service boundaries, and building the documentation as a byproduct of the audit itself. What would have taken weeks of manual console-clicking happened in a single session.

The governance framework for this engagement was written by Sysadmin Claude — the autonomous infrastructure agent from the previous case study. One AI system bootstrapping the next.

The Problem

The information asymmetry stalemate

The documentation gap was self-reinforcing. Without documentation, every investigation started from zero. You couldn't delegate to someone junior because there was no context to hand them. You couldn't prioritize because you didn't know what was there. The only people who could do the audit were the same people who didn't have time.

And the blast radius of guessing was high. In an undocumented environment, you can't safely delete things without understanding what they're connected to. A Lambda function that looks dead might be triggered by a CloudWatch Events rule you didn't know about. An S3 bucket that looks empty might be referenced by a running application.

The stalemate

The audit was too large for the team to do manually, too risky to do without understanding, and too undocumented to understand without doing the audit. The cost of getting it wrong — breaking a client's production service — far outweighed the cost of leaving the waste in place. So the waste stayed.

■

IAM access keys from years ago still active, belonging to people who no longer work there

■

S3 buckets without lifecycle policies — unbounded storage growth with no expiration

■

Stopped EC2 instances and unattached volumes still accruing costs

■

Lambda functions on end-of-life runtimes, still being invoked on schedule

■

Orphaned resources across every service — the residue of projects decommissioned at the application layer but never at infrastructure

How It Works

Five-step audit methodology

The approach is not "AI replaces the sysadmin." It's closer to "AI acts as an extremely thorough, infinitely patient junior engineer who can query every AWS service simultaneously, cross-reference the results, and surface findings for a human to evaluate."

Broad sweep

Query all relevant AWS services in parallel — IAM, S3, EC2, Lambda, RDS, CloudWatch, Route53, API Gateway, EKS, EFS, Cost Explorer. Build the raw inventory in minutes. Hours of console-clicking, compressed.

Cross-referencing

Map relationships across service boundaries. Which IAM keys belong to which services? Which Lambda functions are triggered by which API Gateways? Which S3 buckets are referenced by running applications? These cross-service correlations are where the real findings emerge.

Thread-pulling

Investigate anomalies. A Lambda function still being invoked every four minutes with no API Gateway to serve? A stopped EC2 instance with an attached EBS volume accumulating charges? Follow each thread to resolution — understanding what something is before deciding what to do about it.

Action with permission

All remediation requires explicit human approval. Read-only by default. The agent recommends; the human decides. No credentials rotated, no resources deleted, no configurations changed without authorization.

Documentation as output

Every finding, every recommendation, every action taken becomes part of a structured documentation system. The audit produces the documentation that makes future governance possible. The gap that caused the stalemate is closed as a byproduct of doing the work.

First Session Findings

What the first audit uncovered

A single broad-sweep session across the full AWS account revealed the accumulated drift of years of ungoverned provisioning.

IAM Sprawl

Dozens of active access keys, many with no recent usage. Some belonged to employees who had left years ago. No rotation policy. No documentation of which keys served which purpose.

Storage Without Governance

The vast majority of S3 buckets had no lifecycle policies — data accumulating indefinitely with no expiration, no transition to cheaper storage tiers, and no visibility into what was critical versus disposable.

Zombie Infrastructure

Stopped EC2 instances with attached EBS volumes. Unattached Elastic IPs. Dead Lambda functions still being invoked on schedule despite their consuming services being deleted. Resources burning costs with no active purpose.

The Warm Corpse

An application framework from years ago had a Lambda function being invoked every four minutes by a CloudWatch rule — keeping it "warm" for an API Gateway that had been deleted long ago. Thousands of invocations per day to maintain readiness for a service that no longer existed.

Indefinite Log Retention

Every CloudWatch log group set to retain data indefinitely. Large cluster logs growing uncapped with no retention policy, no archival strategy, and no one reviewing them.

Silent Failures

Automated processes that had been failing for months — snapshot schedulers running daily and erroring every invocation, logging failures silently into the void. No alerts. No monitoring. No one knew.

Cost Reduction Impact

From $3,091/mo to $2,137/mo in one working week

The audit didn't just produce documentation — it produced $954/mo in realized savings ($11,453/year) with an additional $174–290/mo still realizable. No downtime. No data loss. No angry stakeholders. The savings didn't come from one big thing. They came from everywhere at once — which is exactly why a human alone struggles to find them.

43%

Extended Support Fees — $410/mo

Two RDS instances running PostgreSQL 12 and MySQL 5.7 — both past end-of-life, both silently billing ~$200/mo each in AWS Extended Support fees. The PostgreSQL instance was upgraded to PG16 via Blue/Green deployment. The MySQL instance turned out to be a 49-database WordPress graveyard with zero connections — snapshotted and stopped.

33%

Old Kubernetes Cluster — $313/mo

A decommissioned Kubernetes cluster still running: 3 r5a.large worker nodes, a master node, a jobs node, a network load balancer. The cluster had been replaced by EKS months earlier, but the old nodes were still burning money. Teardown required auditing DNS records, archiving 3 client websites, and confirming every workload was migrated.

Backup Over-Retention — $80/mo

PostgreSQL backup retention was set to 35 days on a 30 GB database. Reduced to 7 days — still well within any reasonable recovery need.

16%

Storage & Compute Waste — $151/mo

Orphaned EBS volumes. Unused Elastic IPs. Stopped EC2 instances still billing for storage. A 9-year-old ELK stack running idle. A 195 GB EFS volume left on Standard storage after its cluster was torn down. Dead CloudWatch log groups growing forever. None large individually — $151/mo together.

Five-day timeline

Mar 13 CloudWatch retention, dead Lambdas, orphaned EIPs + EBS $12.56/mo

Mar 16 S3 lifecycle policies (25 buckets), IAM cleanup (25 users, 44 policies) $10.72/mo

Mar 17 7 EC2 instances terminated, old cluster nodes, RDS backup retention $181/mo

Mar 18 Old cluster big-nodes (3× r5a.large), NLB, EFS tiering, ELK terminated $337/mo

Mar 19 PG12→16 upgrade (extended support eliminated), MySQL graveyard stopped $410/mo

The Governance Model

Three layers of documentation, each serving a different purpose

The AI agent doesn't replace judgment — it makes informed judgment possible. The governance model that emerges has three layers, each building on the last.

Audit Reports

The forensic layer

Point-in-time snapshots of infrastructure state. What exists, what's configured, what's anomalous. Establishes the baseline and makes future drift measurable.

Remediation Logs

The accountability layer

Every action taken, why it was taken, and what changed. Each entry includes the finding, the recommendation, the approval, and the verification. Structured templates that support after-action review.

Living Knowledge Base

The governance layer

A self-updating operational document. Service relationships, known configurations, resolved incidents, and governance rules that accumulate across sessions. Future audits start from a documented baseline instead of zero.

Read-only by default

The core operating principle, defined before the first query was ever run. The agent can query anything. It cannot modify, delete, or create any AWS resource without explicit per-action permission from the operator. Not a guardrail bolted on after the fact — the foundational constraint. The same governance-as-architecture pattern from every other engagement in this portfolio.

Agent Lineage

One AI system bootstrapping the next

The Sysadmin Claude agent — which had been operating the production infrastructure daily for months — wrote the initial governance framework for this audit system. It understood the environment, the constraints, and the risk model. It produced the operating instructions that define how the audit agent works.

An AI agent creating the operating instructions for the next AI engagement. That's not just a case study — it's a pattern for how AI-assisted organizations scale: each governed agent produces the governance artifacts for the next one.

The Infrastructure Arc

Act 1

Emergency Kubernetes Migration

The crisis — inherited undocumented infrastructure, dead control plane, expiring certs. Migrated 260 deployments to EKS solo. Zero extended downtime.

Act 2

Sysadmin Claude — Autonomous Infrastructure Agent

The operations — built an autonomous agent to operate the resulting infrastructure daily. Eight months of production use. Wrote the governance framework for Act 3.

Act 3

AI-Assisted AWS Infrastructure Governance

The audit — years of accumulated drift, documented and governed. Read-only-by-default. Documentation generated as a byproduct. You are here.

Retrospective

What I'd do differently

Start the governance framework earlier

This audit would have been dramatically easier if any governance had existed from the start — even a basic inventory spreadsheet. The AI agent is solving a problem that didn't need to exist. For any new AWS account or inherited infrastructure, the first thing I'd establish is the living documentation pattern.

Build the cross-reference map before thread-pulling

Some anomalies took longer to resolve because the relationship mapping happened concurrently with investigation. Building the full dependency map first — which IAM roles connect to which services, which Lambda functions serve which API Gateways — would have made anomaly resolution faster.

Scope the IAM role from day one

The audit agent uses the operator's IAM credentials, which are broader than strictly necessary. A dedicated audit IAM role with read-only permissions plus scoped write access for approved remediation would have been cleaner from the start.

The Design Insight

AI breaks the information asymmetry stalemate that makes infrastructure governance impossible for small teams.

A solo infrastructure operator managing 200+ EKS deployments, 11 Lightsail instances, 60 S3 buckets, 200+ ECR repos, and 70+ ACM certificates does not have time to audit everything. The urgent always displaces the important. An AI agent doesn't have competing priorities — it can spend 20 minutes tracing every security group rule, querying CloudWatch metrics over 90 days, and cross-referencing Kubernetes manifests to answer "is anything actually talking to this?"

Bandwidth, not intelligence

The savings came from everywhere at once — a 15-minute investigation across 6 AWS consoles per finding, multiplied by 100+ resources. A human can do each one. No human has the bandwidth to do all of them.

Safety through verification

Zero production incidents because every destructive action was preceded by CloudWatch metrics, connection queries, DNS audits, and explicit human approval. Safer than either working alone.

Documentation as byproduct

17 cleanup session logs, a running cost tracker, a remediation roadmap. The institutional memory that prevents the next person — or the next AI session — from re-investigating resolved findings.

Technical Stack

Agent Platform Claude Code (read-only strict mode with explicit approval gates)

AWS Services Audited IAM, S3, EC2, Lambda, API Gateway, RDS, CloudWatch, CloudTrail, EKS, EFS, Lightsail

Governance Read-only-by-default; human-in-the-loop approval for all modifications

Documentation Outputs Audit reports, remediation logs, living knowledge base

Parent System Sysadmin Claude (governance framework authored by predecessor agent)

Methodology Broad sweep → cross-referencing → thread-pulling → permissioned action → documentation-as-output

Credential Management System environment variables (credentials never in agent context)

Engagement Scope Single AWS account, multi-service, full-lifecycle audit and remediation

Work with me

Ready to bring your infrastructure under real governance?

IAM policies, cost controls, security baselines — I build the governance layer that turns AWS from a liability into an asset.

Tell me about your situation

The complete infrastructure arc

From crisis response through autonomous operations to governed infrastructure — three engagements that each build on the last.

Part 1: K8s Migration Part 2: Sysadmin Claude All Work →

AI-Assisted AWS Governance. Read-Only First. Document Everything.