← All posts
·7 min read
View .md

Updated May 11, 2026

Beyond CLAUDE.md and AGENTS.md: when agents need a behavior spec

Beyond CLAUDE.md and AGENTS.md: when agents need a behavior spec

CLAUDE.md and AGENTS.md tell agents how to work in your repo. They don't tell agents what your product promises. That's a different artifact — the PBC layer.

product-behavior-contractai-codingpbc-specdeveloper-experience

TL;DR: CLAUDE.md and AGENTS.md are excellent at steering how agents write code. They were never designed to capture what the product promises to do. When agents refactor, extend, or integrate code, they need a behavior spec — not more workflow instructions. That's the layer above instruction files.

Your agent refactors the billing module and changes the grace period from 14 days to 30. The code is cleaner. The tests pass. The product promise is broken.

Your agent simplifies the auth flow and removes the step that forces email verification before first login. It looked like unnecessary friction — but you'd deliberately added it to prevent account abuse during your open beta.

You had CLAUDE.md or AGENTS.md in the repo. You had coding conventions written down. The agent followed those perfectly — and still broke the product.

This isn't a model capability problem. The models are better than they've ever been. It's a context architecture problem — instruction files tell agents how to write code, but not what the product promises to do.

What CLAUDE.md and AGENTS.md actually are

These files work. They solve a real problem. But it's worth being precise about which problem.

CLAUDE.md tells Claude Code how to operate in your repo. Use pnpm, not npm. Run tests before committing. Don't modify generated files. Keep responses concise. It's a workflow configuration — process knowledge that changes when your conventions change.

AGENTS.md does the same for Codex and other agents: coding conventions, build commands, architecture patterns, file organization rules. OpenAI's AGENTS.md spec has been adopted across 60,000+ repos because it solves the "how to work here" problem well.

Both are instruction files. They answer: "How should an agent behave in this repo?"

That's a useful question. But it's not the question that causes production incidents.

The question they don't answer

The question that causes production incidents is: "What does this product promise to do?"

When an agent refactors your billing module, it doesn't need to know whether to use pnpm or npm. It needs to know that the 14-day grace period is a confirmed product decision — not a magic number to clean up. It needs to know that the empty tax_id field is intentionally blank for compliance reasons — not a bug to fix. It needs to know that the auth flow is deliberately minimal because the permission model hasn't been decided yet — not because nobody got around to adding OAuth.

CLAUDE.md doesn't have this information. Neither does AGENTS.md. Not because they're badly designed — because they were designed for a different purpose.

Where instruction files hit their ceiling

The ceiling isn't one thing. It's a pattern that shows up in three ways:

1. No semantic structure. CLAUDE.md and AGENTS.md are freeform prose. A human reads them and infers what matters. An agent reads them and treats every line as equally weighted. "Use TypeScript" and "never change the grace period without product owner approval" have the same format — a bullet point. One is a preference. The other is load-bearing.

2. No trust signal. Everything in an instruction file has the same status: written down. There's no way to distinguish a confirmed product decision from a provisional assumption from an active experiment. An agent treats them all as current truth — and they aren't.

3. No verification path. After an agent runs, there's no way to check whether it honored the product constraints. You can lint code style. You can run tests. But "did the agent preserve the billing contract?" requires a human to review the diff line by line and remember every product decision in their head.

These aren't flaws in CLAUDE.md or AGENTS.md. They're the natural limits of instruction files trying to carry behavior specs they were never built for.

Recent research supports this: one study found that AGENTS.md files improve agent efficiency but not necessarily correctness, while another found they can actually reduce task success rates while increasing cost. Instruction files help agents operate — they don't guarantee correct outcomes.

What sits above instruction files

The layer above instruction files is a behavior spec — an artifact that captures what the product promises to do, in a format that's both human-reviewable and machine-readable.

A behavior spec (.pbc.md — formally a Product Behavior Contract) sits in your repo alongside CLAUDE.md and AGENTS.md, but it answers different questions:

CLAUDE.md / AGENTS.md.pbc.md
Answers"How should the agent work here?""What does the product promise?"
ContainsConventions, commands, patternsBehaviors, rules, states, edge cases
ChangesWhen your workflow changesWhen a product decision changes
AudienceAgents + new contributorsEveryone — product owner, eng, QA, agents
StructureFreeform proseMarkdown with typed semantic blocks

Here's what the same knowledge looks like in each format:

In CLAUDE.md:

# Billing rules
- Grace period is 14 days
- Don't change billing logic without approval
- Tax ID is required for invoices

In a .pbc.md behavior spec:

## Grace period enforcement

### When
A subscription payment fails

### Then
- System enters a 14-day grace period
- User retains full access during grace period
- Daily retry attempts are made against the payment method
- On day 14, if no successful payment: downgrade to free tier

### Invariants
- Grace period duration must be exactly 14 days — not configurable per plan
- No data deletion occurs during grace period
- Grace period cannot be extended manually by support

### Edge cases
- If user upgrades plan during grace period: new payment attempt immediately
- If payment method is removed during grace period: grace period continues (retry stops)

The CLAUDE.md version tells an agent "don't touch this." The behavior spec tells the agent (and the product owner, and QA, and the next developer) exactly what the product promises — in enough detail to verify whether the promise is still being kept.

Notice the level of abstraction. A behavior spec doesn't describe the code ("line 42 checks daysSince > 14") — that changes with every refactor. It doesn't describe the strategy ("support billing grace periods") — that's too vague for agents to act on. It captures the decision ("14 days, not configurable, no data deletion") — the part that stays true even if you rewrite the module in a different language.

That's why product truth is the most durable layer in the stack. PRDs change when strategy changes. Code changes when implementation changes. Product truth changes only when someone explicitly decides to change a product policy.

The market knows something is missing

This isn't a theoretical gap. The pain is already showing up across the ecosystem:

The vocabulary is fragmenting — guardrails, governance, policies, behavior, intent — but the underlying need is converging: teams need a structured way to specify what agents are and aren't allowed to do, above the instruction file layer.

How to start

You don't need to spec your entire product on day one. Start with the module where an agent mistake would hurt most — usually billing, auth, or entitlements.

  1. Create billing.pbc.md in your repo
  2. Write the 3-5 behaviors that are non-negotiable (grace period, refund window, upgrade logic)
  3. For each behavior: what must happen, what must not happen, edge cases
  4. Point your CLAUDE.md or AGENTS.md at it: "Read *.pbc.md files before modifying any billing, auth, or entitlement logic"

That last step is the bridge — your existing instruction files become the pointer to the behavior spec. They work together, not against each other.

The stack, not the replacement

The right mental model isn't "PBC replaces CLAUDE.md." It's a stack:

Layer 4: Behavior specs (.pbc.md)                  ← product truth — what it promises (durable)
Layer 3: Feature specs / PRDs                      ← what we plan to build (temporary)
Layer 2: Session memory / context                  ← what we're doing now (session-scoped)
Layer 1: Instruction files (CLAUDE.md, AGENTS.md)  ← how to work here (operational)

Each layer is useful. None replaces the others. But they have very different lifespans.

Layer 3 is where product decisions are first articulated — a PRD says "build a 14-day grace period." But Layer 3 is temporary. It dies when the feature ships. The decision inside it should graduate into Layer 4, but most teams never make that graduation. The PRD closes, and the reasoning becomes implicit in the code.

For many AI-native teams, Layer 3 doesn't even exist — intent is verbal, ephemeral, gone when the conversation ends. That makes Layer 4 even more critical: it's not just the durable version of the spec, it's the only written record that a decision was ever made.

Most repos have layers 1-3 covered (or skip 3 entirely). Layer 4 is the one that prevents the production incident where an agent does exactly what it was told — and breaks a product promise nobody wrote down.


The PBC spec is open source at github.com/stewie-sh/pbc-spec. You can browse example contracts in the PBC viewer.

If your instruction files are working for code conventions but failing for product decisions — this is the layer that's missing.

Vinh Nguyen

Written by Vinh Nguyen — founder of Stewie, building the workspace for capturing product behavior contracts.

Related posts

Stewie reads your codebase and helps you author a living product behavior spec. We're onboarding a small group of product and engineering teams before public launch. Request early access →

Subscribe via RSS