Auto-generated specs aren't reliable until you've signed off on them

TL;DR: Auto-extracted specs read the code, not the decisions behind it. They silently promote placeholders, fallbacks, and exploration scaffolding to product canon — adding a second layer of un-reviewed commitments on top of the silent ones agents already leave in your codebase. A spec only becomes real when a human looks at each candidate behavior and assigns it a trust level.

I tried letting an AI agent generate product specs from my codebase. The pitch is honest — point an agent at the code, get back what the product does. Skip the spec-writing.

I wasn't trying to take a shortcut. I was trying to fix a problem I already had.

If you're running agents against production code and hitting the same gap, the short version is on Stewie for Engineering Teams.

The silent decisions piling up

About three months ago, I started noticing something across every coding session with frontier agents — Codex CLI, Claude Code, the usual stack. The agent would do good work, ship code, and then in the final session report, mention in passing the assumptions it had made and the defaults it had filled in.

If I missed that report — and I often did, because the session was over, the tests passed, and I had a meeting — those assumptions became silent decisions in the codebase.

The other half of the pattern was fallbacks. The agent would default to adding fallback code unless I explicitly told it not to. Sometimes a fallback is exactly what you want. More often, for me, a fallback hides a failure that should fail loud — and now you have a silently-degraded behavior that nobody decided to ship. I'd find myself in long back-and-forth threads with the agent: don't add a fallback, just let it throw, I need to see the error.

I was accumulating commitments I hadn't reviewed. Auto-filled defaults. Silent fallbacks. Edge-case handling I didn't ask for. They were all technically correct, and they were all things future-me would have to either honor or rip out.

A lot of these silent fills happen for a real reason: at the start of a feature, neither the agent nor I has full context. I haven't decided how the new billing tier interacts with the entitlement model. I haven't worked out what "pro" means versus "team." The agent picks a reasonable default — a sensible-looking fallback, a permissive guard, an exception swallowed quietly — and keeps moving. That's not unreasonable. Early-stage products genuinely don't have all the answers at the start, and pretending they do is its own failure mode.

The problem isn't the missing context. The problem is silently filling in the missing context, instead of marking the gap as a gap.

So I started asking the agent to generate a rough product spec from the code. Not as documentation. As a way to see what was in there.

What I got back

The output looked great. Clean structure, consistent sections, every feature catalogued. I would have approved it. I should have approved it.

I didn't, because something felt wrong, and I couldn't articulate it for a week.

The bandwidth problem hides the real problem

AI agents generate specs at the same speed they generate code — faster than I can think. I had meetings. I had interns to mentor. I had my cofounder pulling me into discussions about pricing and positioning. The agent could produce a 40-page product spec in the time it took me to drink coffee.

So when I needed to "review" the spec, the realistic option was: skim, vibe-check, accept.

I did this for a while. Eventually I asked my cofounder to review some for me. She had her own tower of work. The spec sat in her queue. I shipped against it anyway.

For a stretch, this felt like the modern way to work. Agents move fast, humans approve in bulk, productivity unlocked. But the specs that looked right kept generating a low-grade unease I couldn't shake.

The second silent layer

Eventually I figured out what was happening. The auto-extracted spec hadn't fixed the silent-decision problem. It had created a second layer of it.

The Stripe integration the agent extracted as a "core product behavior" was a placeholder — I was actively evaluating Polar. Half the auth surface marked as a "feature" was scaffolding I'd left in while I figured out the real permission model. A 24-hour refund window the agent captured as deliberate policy was a number I'd typed in to keep moving. Multiple fallbacks I'd never explicitly approved had been promoted to "behaviors."

These weren't bugs. The code worked. The agent's reading of the code was technically correct.

What the agent could not see — what no auto-extractor can see — is the difference between what I'd implemented and what I'd decided. Code captures the first. The second lives in three different conversations, a notebook, and a private intuition I haven't even written down yet.

The auto-spec was doing exactly what the agent that wrote the code was doing: silently promoting the current state to canon, with no marker telling me which parts I'd actually agreed to.

A spec is an agreement, not a description

The reframe that unlocked it for me: a real spec isn't a description of the system. It's a record of what the team has agreed is settled.

Auto-extraction can produce the first. It cannot produce the second — because "what's settled" lives in the gap between the code and the people who built it. The implicit decisions. The exploration paths still open. The fallbacks that snuck in. None of that shows up in the AST.

Four kinds of "spec":

Source	What it captures	What it misses
The codebase	What exists right now	Whether it was intended
The PRD	Initial strategy	Whether strategy survived contact with users
An auto-extracted spec	A guess at intent from code patterns	Which behaviors are settled vs. provisional vs. accidental
A reviewed behavior spec	What the team has confirmed, and at what trust level	(this is the missing artifact)

The third row is what most "AI generates your specs" tools deliver today. Useful as raw material. Dangerous if used as truth.

The decision is the spec

The thing that makes a spec real is the act of a human looking at a candidate behavior and saying yes, that one's settled or no, that's still open or that's a placeholder, don't build on it.

That act is not bureaucracy. It's the entire epistemic move that turns a code pattern into a product decision. Without it, you have a document that looks like a contract but binds nobody — including future-you.

This is why behavior specs (the PBC format) carry trust levels alongside the rules themselves. A behavior marked trusted is something the team has explicitly looked at and confirmed. A behavior marked provisional is "we're working with this for now, don't refactor on top of it." A behavior marked candidate is "the agent found this, no human has weighed in yet."

The trust level is the part the codebase cannot generate.

An auto-extractor can populate candidates all day. The product decision happens when a human sits with one of them and changes its trust level — or rejects it entirely.

That step is small. It's also the only step that produces something real.

Code is what is. The sign-off is what's been decided.

I kept the auto-extractor in my workflow. It's a useful starting point. It surfaces behaviors I would not have remembered to write down — including the silent assumptions and fallbacks I never explicitly approved. That's its real value: not as a spec, but as a surfacing tool for what's already in the code, awaiting review.

But the spec it produces is not the spec. The spec is what's left after I — or someone on the team — has looked at each candidate behavior and either confirmed it, flagged it as still open, or marked it as scaffolding to be replaced.

Without that step, an auto-generated spec is a confident-sounding hypothesis dressed as a contract. It will lie to your future self, your teammates, and your AI agents — not because the extractor is bad, but because nobody asked the question only a human can answer:

Did we actually decide this, or did the code just end up here?

The PBC format is open-source at github.com/stewie-sh/pbc-spec. If you're using auto-extraction tools and the resulting specs feel almost-right-but-not-quite, it's probably this gap.

Auto-generated specs aren't reliable until you've signed off on them

The silent decisions piling up

What I got back

The bandwidth problem hides the real problem

The second silent layer

A spec is an agreement, not a description

The decision is the spec

Code is what is. The sign-off is what's been decided.

Related posts

Why AI agents keep violating your product rules

Behaviors, decisions, execution: three layers of AI-safe memory

The product engineer era needs product truth

AGENTS.md, Memory Bank, and PBC solve different problems