Infrastructure Governance

AI-Assisted AWS Governance.
Read-Only First. Document Everything.

An AWS account in continuous use for over a decade. Zero documentation. Multiple staff turnovers. Resources created ad-hoc and never decommissioned. Nobody knew what was in it, nobody had time to find out, and nobody could safely clean it up without understanding it first. An AI agent with read-only access broke the stalemate.

Part 3 of the Infrastructure Arc — sequel to Sysadmin Claude

12+

AWS services audited

5

Remediation phases

0

Production disruptions

3-layer

Documentation model

The Short Version

You can't audit what isn't documented. You can't document without an audit.

The organization had an AWS account accumulated over years by multiple people across multiple staff turnovers. Services were provisioned for client projects, developers came and went, and nobody maintained a central record. Every previous attempt to impose governance stalled at the same paradox: auditing required documentation that didn't exist, and creating documentation required the audit that kept getting deferred.

An AI agent broke the stalemate by treating the entire AWS account as a read-only discovery surface — enumerating every service, cross-referencing findings across service boundaries, and building the documentation as a byproduct of the audit itself. What would have taken weeks of manual console-clicking happened in a single session.

The governance framework for this engagement was written by Sysadmin Claude — the autonomous infrastructure agent from the previous case study. One AI system bootstrapping the next.

The Problem

The information asymmetry stalemate

The documentation gap was self-reinforcing. Without documentation, every investigation started from zero. You couldn't delegate to someone junior because there was no context to hand them. You couldn't prioritize because you didn't know what was there. The only people who could do the audit were the same people who didn't have time.

And the blast radius of guessing was high. In an undocumented environment, you can't safely delete things without understanding what they're connected to. A Lambda function that looks dead might be triggered by a CloudWatch Events rule you didn't know about. An S3 bucket that looks empty might be referenced by a running application.

The stalemate

The audit was too large for the team to do manually, too risky to do without understanding, and too undocumented to understand without doing the audit. The cost of getting it wrong — breaking a client's production service — far outweighed the cost of leaving the waste in place. So the waste stayed.

IAM access keys from years ago still active, belonging to people who no longer work there

S3 buckets without lifecycle policies — unbounded storage growth with no expiration

Stopped EC2 instances and unattached volumes still accruing costs

Lambda functions on end-of-life runtimes, still being invoked on schedule

Orphaned resources across every service — the residue of projects decommissioned at the application layer but never at infrastructure

How It Works

Five-step audit methodology

The approach is not "AI replaces the sysadmin." It's closer to "AI acts as an extremely thorough, infinitely patient junior engineer who can query every AWS service simultaneously, cross-reference the results, and surface findings for a human to evaluate."

01

Broad sweep

Query all relevant AWS services in parallel — IAM, S3, EC2, Lambda, RDS, CloudWatch, Route53, API Gateway, EKS, EFS, Cost Explorer. Build the raw inventory in minutes. Hours of console-clicking, compressed.

02

Cross-referencing

Map relationships across service boundaries. Which IAM keys belong to which services? Which Lambda functions are triggered by which API Gateways? Which S3 buckets are referenced by running applications? These cross-service correlations are where the real findings emerge.

03

Thread-pulling

Investigate anomalies. A Lambda function still being invoked every four minutes with no API Gateway to serve? A stopped EC2 instance with an attached EBS volume accumulating charges? Follow each thread to resolution — understanding what something is before deciding what to do about it.

04

Action with permission

All remediation requires explicit human approval. Read-only by default. The agent recommends; the human decides. No credentials rotated, no resources deleted, no configurations changed without authorization.

05

Documentation as output

Every finding, every recommendation, every action taken becomes part of a structured documentation system. The audit produces the documentation that makes future governance possible. The gap that caused the stalemate is closed as a byproduct of doing the work.

First Session Findings

What the first audit uncovered

A single broad-sweep session across the full AWS account revealed the accumulated drift of years of ungoverned provisioning.

IAM Sprawl

Dozens of active access keys, many with no recent usage. Some belonged to employees who had left years ago. No rotation policy. No documentation of which keys served which purpose.

Storage Without Governance

The vast majority of S3 buckets had no lifecycle policies — data accumulating indefinitely with no expiration, no transition to cheaper storage tiers, and no visibility into what was critical versus disposable.

Zombie Infrastructure

Stopped EC2 instances with attached EBS volumes. Unattached Elastic IPs. Dead Lambda functions still being invoked on schedule despite their consuming services being deleted. Resources burning costs with no active purpose.

The Warm Corpse

An application framework from years ago had a Lambda function being invoked every four minutes by a CloudWatch rule — keeping it "warm" for an API Gateway that had been deleted long ago. Thousands of invocations per day to maintain readiness for a service that no longer existed.

Indefinite Log Retention

Every CloudWatch log group set to retain data indefinitely. Large cluster logs growing uncapped with no retention policy, no archival strategy, and no one reviewing them.

Silent Failures

Automated processes that had been failing for months — snapshot schedulers running daily and erroring every invocation, logging failures silently into the void. No alerts. No monitoring. No one knew.

The Governance Model

Three layers of documentation, each serving a different purpose

The AI agent doesn't replace judgment — it makes informed judgment possible. The governance model that emerges has three layers, each building on the last.

Audit Reports

The forensic layer

Point-in-time snapshots of infrastructure state. What exists, what's configured, what's anomalous. Establishes the baseline and makes future drift measurable.

Remediation Logs

The accountability layer

Every action taken, why it was taken, and what changed. Each entry includes the finding, the recommendation, the approval, and the verification. Structured templates that support after-action review.

Living Knowledge Base

The governance layer

A self-updating operational document. Service relationships, known configurations, resolved incidents, and governance rules that accumulate across sessions. Future audits start from a documented baseline instead of zero.

Read-only by default

The core operating principle, defined before the first query was ever run. The agent can query anything. It cannot modify, delete, or create any AWS resource without explicit per-action permission from the operator. Not a guardrail bolted on after the fact — the foundational constraint. The same governance-as-architecture pattern from every other engagement in this portfolio.

Agent Lineage

One AI system bootstrapping the next

The Sysadmin Claude agent — which had been operating the production infrastructure daily for months — wrote the initial governance framework for this audit system. It understood the environment, the constraints, and the risk model. It produced the operating instructions that define how the audit agent works.

An AI agent creating the operating instructions for the next AI engagement. That's not just a case study — it's a pattern for how AI-assisted organizations scale: each governed agent produces the governance artifacts for the next one.

Retrospective

What I'd do differently

Start the governance framework earlier

This audit would have been dramatically easier if any governance had existed from the start — even a basic inventory spreadsheet. The AI agent is solving a problem that didn't need to exist. For any new AWS account or inherited infrastructure, the first thing I'd establish is the living documentation pattern.

Build the cross-reference map before thread-pulling

Some anomalies took longer to resolve because the relationship mapping happened concurrently with investigation. Building the full dependency map first — which IAM roles connect to which services, which Lambda functions serve which API Gateways — would have made anomaly resolution faster.

Scope the IAM role from day one

The audit agent uses the operator's IAM credentials, which are broader than strictly necessary. A dedicated audit IAM role with read-only permissions plus scoped write access for approved remediation would have been cleaner from the start.

The Design Insight

AI breaks the information asymmetry stalemate that makes infrastructure governance impossible for small teams.

Small teams defer governance because the cost of auditing undocumented infrastructure exceeds the available time. The documentation debt compounds. An AI agent changes the economics: it can enumerate, cross-reference, and document at a speed that makes the audit feasible in the time a small team actually has.

Read-only breaks the deadlock

Discovery and documentation happen at machine speed without risk. Governance becomes possible because understanding no longer requires a dedicated project.

Agent lineage compounds knowledge

Each engagement leaves the next one more informed. The governance framework written by one agent becomes the operating manual for the next. Institutional knowledge doesn't walk out the door.

Documentation is the product

The three-layer model produces the exact documentation that prevents future drift. The audit output is the governance infrastructure.

Technical Stack

Agent Platform Claude Code (read-only strict mode with explicit approval gates)
AWS Services Audited IAM, S3, EC2, Lambda, API Gateway, RDS, CloudWatch, CloudTrail, EKS, EFS, Lightsail
Governance Read-only-by-default; human-in-the-loop approval for all modifications
Documentation Outputs Audit reports, remediation logs, living knowledge base
Parent System Sysadmin Claude (governance framework authored by predecessor agent)
Methodology Broad sweep → cross-referencing → thread-pulling → permissioned action → documentation-as-output
Credential Management System environment variables (credentials never in agent context)
Engagement Scope Single AWS account, multi-service, full-lifecycle audit and remediation

The complete infrastructure arc

From crisis response through autonomous operations to governed infrastructure — three engagements that each build on the last.