Infrastructure Governance
AI-Assisted AWS Governance.
Read-Only First. Document Everything.
An AWS account in continuous use for over a decade. Zero documentation. Multiple staff turnovers. Resources created ad-hoc and never decommissioned. Nobody knew what was in it, nobody had time to find out, and nobody could safely clean it up without understanding it first. An AI agent with read-only access broke the stalemate.
Part 3 of the Infrastructure Arc — sequel to Sysadmin Claude
AWS services audited
Remediation phases
Production disruptions
Documentation model
The Short Version
You can't audit what isn't documented. You can't document without an audit.
The organization had an AWS account accumulated over years by multiple people across multiple staff turnovers. Services were provisioned for client projects, developers came and went, and nobody maintained a central record. Every previous attempt to impose governance stalled at the same paradox: auditing required documentation that didn't exist, and creating documentation required the audit that kept getting deferred.
An AI agent broke the stalemate by treating the entire AWS account as a read-only discovery surface — enumerating every service, cross-referencing findings across service boundaries, and building the documentation as a byproduct of the audit itself. What would have taken weeks of manual console-clicking happened in a single session.
The governance framework for this engagement was written by Sysadmin Claude — the autonomous infrastructure agent from the previous case study. One AI system bootstrapping the next.
The Problem
The information asymmetry stalemate
The documentation gap was self-reinforcing. Without documentation, every investigation started from zero. You couldn't delegate to someone junior because there was no context to hand them. You couldn't prioritize because you didn't know what was there. The only people who could do the audit were the same people who didn't have time.
And the blast radius of guessing was high. In an undocumented environment, you can't safely delete things without understanding what they're connected to. A Lambda function that looks dead might be triggered by a CloudWatch Events rule you didn't know about. An S3 bucket that looks empty might be referenced by a running application.
The stalemate
The audit was too large for the team to do manually, too risky to do without understanding, and too undocumented to understand without doing the audit. The cost of getting it wrong — breaking a client's production service — far outweighed the cost of leaving the waste in place. So the waste stayed.
IAM access keys from years ago still active, belonging to people who no longer work there
S3 buckets without lifecycle policies — unbounded storage growth with no expiration
Stopped EC2 instances and unattached volumes still accruing costs
Lambda functions on end-of-life runtimes, still being invoked on schedule
Orphaned resources across every service — the residue of projects decommissioned at the application layer but never at infrastructure
How It Works
Five-step audit methodology
The approach is not "AI replaces the sysadmin." It's closer to "AI acts as an extremely thorough, infinitely patient junior engineer who can query every AWS service simultaneously, cross-reference the results, and surface findings for a human to evaluate."
Broad sweep
Query all relevant AWS services in parallel — IAM, S3, EC2, Lambda, RDS, CloudWatch, Route53, API Gateway, EKS, EFS, Cost Explorer. Build the raw inventory in minutes. Hours of console-clicking, compressed.
Cross-referencing
Map relationships across service boundaries. Which IAM keys belong to which services? Which Lambda functions are triggered by which API Gateways? Which S3 buckets are referenced by running applications? These cross-service correlations are where the real findings emerge.
Thread-pulling
Investigate anomalies. A Lambda function still being invoked every four minutes with no API Gateway to serve? A stopped EC2 instance with an attached EBS volume accumulating charges? Follow each thread to resolution — understanding what something is before deciding what to do about it.
Action with permission
All remediation requires explicit human approval. Read-only by default. The agent recommends; the human decides. No credentials rotated, no resources deleted, no configurations changed without authorization.
Documentation as output
Every finding, every recommendation, every action taken becomes part of a structured documentation system. The audit produces the documentation that makes future governance possible. The gap that caused the stalemate is closed as a byproduct of doing the work.
First Session Findings
What the first audit uncovered
A single broad-sweep session across the full AWS account revealed the accumulated drift of years of ungoverned provisioning.
IAM Sprawl
Dozens of active access keys, many with no recent usage. Some belonged to employees who had left years ago. No rotation policy. No documentation of which keys served which purpose.
Storage Without Governance
The vast majority of S3 buckets had no lifecycle policies — data accumulating indefinitely with no expiration, no transition to cheaper storage tiers, and no visibility into what was critical versus disposable.
Zombie Infrastructure
Stopped EC2 instances with attached EBS volumes. Unattached Elastic IPs. Dead Lambda functions still being invoked on schedule despite their consuming services being deleted. Resources burning costs with no active purpose.
The Warm Corpse
An application framework from years ago had a Lambda function being invoked every four minutes by a CloudWatch rule — keeping it "warm" for an API Gateway that had been deleted long ago. Thousands of invocations per day to maintain readiness for a service that no longer existed.
Indefinite Log Retention
Every CloudWatch log group set to retain data indefinitely. Large cluster logs growing uncapped with no retention policy, no archival strategy, and no one reviewing them.
Silent Failures
Automated processes that had been failing for months — snapshot schedulers running daily and erroring every invocation, logging failures silently into the void. No alerts. No monitoring. No one knew.
The Governance Model
Three layers of documentation, each serving a different purpose
The AI agent doesn't replace judgment — it makes informed judgment possible. The governance model that emerges has three layers, each building on the last.
Audit Reports
The forensic layer
Point-in-time snapshots of infrastructure state. What exists, what's configured, what's anomalous. Establishes the baseline and makes future drift measurable.
Remediation Logs
The accountability layer
Every action taken, why it was taken, and what changed. Each entry includes the finding, the recommendation, the approval, and the verification. Structured templates that support after-action review.
Living Knowledge Base
The governance layer
A self-updating operational document. Service relationships, known configurations, resolved incidents, and governance rules that accumulate across sessions. Future audits start from a documented baseline instead of zero.
Read-only by default
The core operating principle, defined before the first query was ever run. The agent can query anything. It cannot modify, delete, or create any AWS resource without explicit per-action permission from the operator. Not a guardrail bolted on after the fact — the foundational constraint. The same governance-as-architecture pattern from every other engagement in this portfolio.
Agent Lineage
One AI system bootstrapping the next
The Sysadmin Claude agent — which had been operating the production infrastructure daily for months — wrote the initial governance framework for this audit system. It understood the environment, the constraints, and the risk model. It produced the operating instructions that define how the audit agent works.
An AI agent creating the operating instructions for the next AI engagement. That's not just a case study — it's a pattern for how AI-assisted organizations scale: each governed agent produces the governance artifacts for the next one.
The Infrastructure Arc
Emergency Kubernetes Migration
The crisis — inherited undocumented infrastructure, dead control plane, expiring certs. Migrated 260 deployments to EKS solo. Zero extended downtime.
Sysadmin Claude — Autonomous Infrastructure Agent
The operations — built an autonomous agent to operate the resulting infrastructure daily. Eight months of production use. Wrote the governance framework for Act 3.
AI-Assisted AWS Infrastructure Governance
The audit — years of accumulated drift, documented and governed. Read-only-by-default. Documentation generated as a byproduct. You are here.
Retrospective
What I'd do differently
Start the governance framework earlier
This audit would have been dramatically easier if any governance had existed from the start — even a basic inventory spreadsheet. The AI agent is solving a problem that didn't need to exist. For any new AWS account or inherited infrastructure, the first thing I'd establish is the living documentation pattern.
Build the cross-reference map before thread-pulling
Some anomalies took longer to resolve because the relationship mapping happened concurrently with investigation. Building the full dependency map first — which IAM roles connect to which services, which Lambda functions serve which API Gateways — would have made anomaly resolution faster.
Scope the IAM role from day one
The audit agent uses the operator's IAM credentials, which are broader than strictly necessary. A dedicated audit IAM role with read-only permissions plus scoped write access for approved remediation would have been cleaner from the start.
The Design Insight
AI breaks the information asymmetry stalemate that makes infrastructure governance impossible for small teams.
Small teams defer governance because the cost of auditing undocumented infrastructure exceeds the available time. The documentation debt compounds. An AI agent changes the economics: it can enumerate, cross-reference, and document at a speed that makes the audit feasible in the time a small team actually has.
Read-only breaks the deadlock
Discovery and documentation happen at machine speed without risk. Governance becomes possible because understanding no longer requires a dedicated project.
Agent lineage compounds knowledge
Each engagement leaves the next one more informed. The governance framework written by one agent becomes the operating manual for the next. Institutional knowledge doesn't walk out the door.
Documentation is the product
The three-layer model produces the exact documentation that prevents future drift. The audit output is the governance infrastructure.
Technical Stack
The complete infrastructure arc
From crisis response through autonomous operations to governed infrastructure — three engagements that each build on the last.