Beyond CLAUDE.md and AGENTS.md: when agents need a behavior spec

TL;DR: CLAUDE.md and AGENTS.md are excellent at steering how agents write code. They were never designed to capture what the product promises to do. When agents refactor, extend, or integrate code, they need a behavior spec — not more workflow instructions. That's the layer above instruction files.

Your agent refactors the billing module and changes the grace period from 14 days to 30. The code is cleaner. The tests pass. The product promise is broken.

Your agent simplifies the auth flow and removes the step that forces email verification before first login. It looked like unnecessary friction — but you'd deliberately added it to prevent account abuse during your open beta.

You had CLAUDE.md or AGENTS.md in the repo. You had coding conventions written down. The agent followed those perfectly — and still broke the product.

This isn't a model capability problem. The models are better than they've ever been. It's a context architecture problem — instruction files tell agents how to write code, but not what the product promises to do.

What CLAUDE.md and AGENTS.md actually are

These files work. They solve a real problem. But it's worth being precise about which problem.

CLAUDE.md tells Claude Code how to operate in your repo. Use pnpm, not npm. Run tests before committing. Don't modify generated files. Keep responses concise. It's a workflow configuration — process knowledge that changes when your conventions change.

AGENTS.md does the same for Codex and other agents: coding conventions, build commands, architecture patterns, file organization rules. OpenAI's AGENTS.md spec has been adopted across 60,000+ repos because it solves the "how to work here" problem well.

Both are instruction files. They answer: "How should an agent behave in this repo?"

That's a useful question. But it's not the question that causes production incidents.

The question they don't answer

The question that causes production incidents is: "What does this product promise to do?"

When an agent refactors your billing module, it doesn't need to know whether to use pnpm or npm. It needs to know that the 14-day grace period is a confirmed product decision — not a magic number to clean up. It needs to know that the empty tax_id field is intentionally blank for compliance reasons — not a bug to fix. It needs to know that the auth flow is deliberately minimal because the permission model hasn't been decided yet — not because nobody got around to adding OAuth.

CLAUDE.md doesn't have this information. Neither does AGENTS.md. Not because they're badly designed — because they were designed for a different purpose.

Where instruction files hit their ceiling

The ceiling isn't one thing. It's a pattern that shows up in three ways:

1. No semantic structure. CLAUDE.md and AGENTS.md are freeform prose. A human reads them and infers what matters. An agent reads them and treats every line as equally weighted. "Use TypeScript" and "never change the grace period without product owner approval" have the same format — a bullet point. One is a preference. The other is load-bearing.

2. No trust signal. Everything in an instruction file has the same status: written down. There's no way to distinguish a confirmed product decision from a provisional assumption from an active experiment. An agent treats them all as current truth — and they aren't.

3. No verification path. After an agent runs, there's no way to check whether it honored the product constraints. You can lint code style. You can run tests. But "did the agent preserve the billing contract?" requires a human to review the diff line by line and remember every product decision in their head.

These aren't flaws in CLAUDE.md or AGENTS.md. They're the natural limits of instruction files trying to carry behavior specs they were never built for.

Recent research supports this: one study found that AGENTS.md files improve agent efficiency but not necessarily correctness, while another found they can actually reduce task success rates while increasing cost. Instruction files help agents operate — they don't guarantee correct outcomes.

What sits above instruction files

The layer above instruction files is a behavior spec — an artifact that captures what the product promises to do, in a format that's both human-reviewable and machine-readable.

A behavior spec (.pbc.md — formally a Product Behavior Contract) sits in your repo alongside CLAUDE.md and AGENTS.md, but it answers different questions:

	CLAUDE.md / AGENTS.md	.pbc.md
Answers	"How should the agent work here?"	"What does the product promise?"
Contains	Conventions, commands, patterns	Behaviors, rules, states, edge cases
Changes	When your workflow changes	When a product decision changes
Audience	Agents + new contributors	Everyone — product owner, eng, QA, agents
Structure	Freeform prose	Markdown with typed semantic blocks

Here's what the same knowledge looks like in each format:

In CLAUDE.md:

# Billing rules
- Grace period is 14 days
- Don't change billing logic without approval
- Tax ID is required for invoices

In a .pbc.md behavior spec:

## Grace period enforcement

### When
A subscription payment fails

### Then
- System enters a 14-day grace period
- User retains full access during grace period
- Daily retry attempts are made against the payment method
- On day 14, if no successful payment: downgrade to free tier

### Invariants
- Grace period duration must be exactly 14 days — not configurable per plan
- No data deletion occurs during grace period
- Grace period cannot be extended manually by support

### Edge cases
- If user upgrades plan during grace period: new payment attempt immediately
- If payment method is removed during grace period: grace period continues (retry stops)

The CLAUDE.md version tells an agent "don't touch this." The behavior spec tells the agent (and the product owner, and QA, and the next developer) exactly what the product promises — in enough detail to verify whether the promise is still being kept.

Notice the level of abstraction. A behavior spec doesn't describe the code ("line 42 checks daysSince > 14") — that changes with every refactor. It doesn't describe the strategy ("support billing grace periods") — that's too vague for agents to act on. It captures the decision ("14 days, not configurable, no data deletion") — the part that stays true even if you rewrite the module in a different language.

That's why product truth is the most durable layer in the stack. PRDs change when strategy changes. Code changes when implementation changes. Product truth changes only when someone explicitly decides to change a product policy.

The market knows something is missing

This isn't a theoretical gap. The pain is already showing up across the ecosystem:

Anthropic published hooks documentation — deterministic controls that run outside the model, because they recognized that prompt-level instructions aren't reliable enough for enforcement
OpenAI published a harness engineering series describing how they built their entire product with agents — structured documentation directories as the source of truth, architectural constraints enforced mechanically via linters and CI — then didn't productize the approach
Gartner started covering agentic AI governance as a distinct market category
Job postings at companies like Anthropic, Citizens Bank, and Synopsys now describe agent guardrails, safe execution paths, and governance controls as core responsibilities

The vocabulary is fragmenting — guardrails, governance, policies, behavior, intent — but the underlying need is converging: teams need a structured way to specify what agents are and aren't allowed to do, above the instruction file layer.

How to start

You don't need to spec your entire product on day one. Start with the module where an agent mistake would hurt most — usually billing, auth, or entitlements.

Create billing.pbc.md in your repo
Write the 3-5 behaviors that are non-negotiable (grace period, refund window, upgrade logic)
For each behavior: what must happen, what must not happen, edge cases
Point your CLAUDE.md or AGENTS.md at it: "Read *.pbc.md files before modifying any billing, auth, or entitlement logic"

That last step is the bridge — your existing instruction files become the pointer to the behavior spec. They work together, not against each other.

The stack, not the replacement

The right mental model isn't "PBC replaces CLAUDE.md." It's a stack:

Layer 4: Behavior specs (.pbc.md)                  ← product truth — what it promises (durable)
Layer 3: Feature specs / PRDs                      ← what we plan to build (temporary)
Layer 2: Session memory / context                  ← what we're doing now (session-scoped)
Layer 1: Instruction files (CLAUDE.md, AGENTS.md)  ← how to work here (operational)

Each layer is useful. None replaces the others. But they have very different lifespans.

Layer 3 is where product decisions are first articulated — a PRD says "build a 14-day grace period." But Layer 3 is temporary. It dies when the feature ships. The decision inside it should graduate into Layer 4, but most teams never make that graduation. The PRD closes, and the reasoning becomes implicit in the code.

For many AI-native teams, Layer 3 doesn't even exist — intent is verbal, ephemeral, gone when the conversation ends. That makes Layer 4 even more critical: it's not just the durable version of the spec, it's the only written record that a decision was ever made.

Most repos have layers 1-3 covered (or skip 3 entirely). Layer 4 is the one that prevents the production incident where an agent does exactly what it was told — and breaks a product promise nobody wrote down.

The PBC spec is open source at github.com/stewie-sh/pbc-spec. You can browse example contracts in the PBC viewer.

If your instruction files are working for code conventions but failing for product decisions — this is the layer that's missing.

Beyond CLAUDE.md and AGENTS.md: when agents need a behavior spec

What CLAUDE.md and AGENTS.md actually are

The question they don't answer

Where instruction files hit their ceiling

What sits above instruction files

The market knows something is missing

How to start

The stack, not the replacement

Related posts

AI agent context still misses the product layer

AGENTS.md, Memory Bank, and PBC solve different problems

Why AI agents keep violating your product rules