The Costume Change Problem
A repo hit 11,000 stars in its first week by solving a real problem: Claude Code in one generic mode produces mediocre output.
Garry Tan’s gstack formalizes “modes” for Claude Code — slash commands that switch the AI between named roles. To name a few:
- A CEO lens for product decisions
- A staff engineer for paranoid code review
- A QA lead for testing
- An engineering manager for retrospectives
The core insight is correct and worth calling out directly: forcing the AI into an explicit role with explicit constraints produces better output than letting it be a generalist.
The browse tool — a persistent Chromium binary that gives Claude Code eyes on a running app — is genuine engineering, not a prompt trick. The sequential workflow discipline (plan → engineering review → build → code review → ship) is better than what most people do with AI, which is nothing. This is a meaningful step up from ad-hoc prompting.
I also noticed a structural limitation within minutes of reading it, and it’s the same one I’ve been building against for months.
All the hats, one head
Every mode in gstack runs inside the same context window.
The “paranoid staff engineer” reviewing your code is the same Claude instance that helped architect it. It already knows why every decision was made — which means it’s primed to find those decisions reasonable.
This is a self-review wearing a different costume.
I don’t mean that dismissively, because self-assessment checklists have real value — a pilot running preflight catches mistakes that muscle memory alone won’t, and that’s worth doing every time. But there’s a categorical difference between a checklist and an independent review, and the distinction matters — considerably more than it sounds.
When the reviewer already has the builder’s reasoning in context, it’s not an evaluation of the output; it’s pattern-matching against the justifications that produced it. The same mechanism that makes LLMs coherent — self-consistency — makes them structurally blind to their own errors when asked to self-review. You’re not getting a second opinion; you’re getting the first opinion wearing a different hat.
This is the same reason you don’t ask the person who wrote a PR to also approve it. A different pair of eyes catches what the author is blind to — not because the author is bad, but because familiarity breeds pattern blindness. AI doesn’t change this principle; if anything, it amplifies it — an LLM’s self-consistency is more deterministic than a human’s.
Parallelism is not independence
Gstack can also use Conductor to spin up ten parallel Claude Code sessions. That sounds like separation until you realize it’s a performance optimization, not an epistemic one, and more workers in the same bath isn’t the same as a clean pool.
Genuine review requires what I’ll call epistemic separation: different priors, no access to the rationalization chain that produced the artifact, and independently accumulated judgment about what builders consistently miss. Without that separation, you get confirmation with extra steps.
Each of gstack’s modes starts fresh every invocation — no accumulated lessons, no pattern library built from previous reviews. The “paranoid staff engineer” is equally paranoid about everything, every time. That’s thorough but undirected. A reviewer who doesn’t learn which mistakes this builder tends to make hasn’t read the codebase’s history.
For organizations where bugs have real consequences — compliance failures, donor trust violations, limited technical staff to recover from incidents — the difference between costume-change review and independent review is operational risk.
What genuine separation looks like
I built the answer to this problem months before gstack existed. I call it The Adversary.
It’s a separate Claude Code project in its own repo with its own governance files, its own accumulated lessons-learned corpus, and zero shared context with the building agent. It receives a read-only symlink to the target codebase and produces a structured review report. It doesn’t know what decisions were made or why. It sees output, not reasoning — which is exactly how real external review works.
I’d been building and reviewing this codebase for months — manual human review and agentic self-review, the whole time. The Adversary’s first pass found 102 issues. Ten critical. Security vulnerabilities hiding in plain sight — not because the builder was bad, but because independent review catches what self-review structurally cannot.
The architecture makes it work, not the prompt:
Separate context. Different project, different memory, different governance documents. The builder’s reasoning chain doesn’t exist in The Adversary’s world.
Different priors. The Adversary accumulates its own pattern library over time — “here’s what builders consistently miss” — which makes it sharper with each review. A stateless skill file can’t do this.
Structured handoff. Artifacts move through a defined channel (symlinks and reports), not a shared session. The reviewer can’t be influenced by the builder’s justifications because it never sees them. This is the same principle that keeps financial auditors separate from the accounting department.
An honest limitation
This can’t be fully productized today. The architectural requirement — genuinely independent agents with separate memory, separate accumulated judgment, and separate lesson histories — requires human orchestration: someone who understands where the boundaries need to be and maintains them. The tooling will get there. The architecture won’t design itself.
Anyone can fork a repo of markdown files. The judgment behind “here’s where the boundaries need to be and why” is the part that requires experience to get right.
The methodology is the deliverable, not the CLI tool. And that distinction matters for understanding where gstack fits.
What your AI shouldn’t know
Gstack represents where most people are in their thinking about AI-assisted development: “I need structured roles for different tasks.” That’s correct and necessary. The workflow discipline, the browser tooling, the explicit-gear metaphor — all genuinely valuable. The fact that it’s open source and spreading is good for the ecosystem.
But the harder question isn’t which hat to put on your AI, or “what persona should your AI wear?”
It’s “what should the AI not know when it evaluates this work?”
The full Adversary architecture and first-run findings: Your AI Builds the Code. Who Reviews It? The governance methodology: What Is Pass@1? and The Governance Documents. The thesis connecting it all: Governance Is Architecture.
Sources: gstack — GitHub — Garry Tan’s Claude Code skill files (MIT) · Large Language Models Hallucination: Comprehensive Survey — arXiv (self-consistency and self-review blind spots) · Google Engineering Practices — Code Review — Google
Get new posts in your inbox
Occasional writing on systems, ADHD, and AI. No cadence pressure.
You're in. I'll send you the next one.