TL;DR: Agents “violate product rules” mostly because they can’t see which behaviors are confirmed decisions vs. temporary implementation details. Modern harnesses (AGENTS.md, memory, specs, monitors) steer code execution, but they still don’t provide product truth. A behavior spec adds that missing layer with trust levels and explicit must/must-not behaviors.
I started noticing a pattern with coding agents in my codebase. The agent would make a change that looked reasonable — cleaner code, passing tests — but it broke a product decision I'd already made.
Most of the time, these weren't bugs. The agent was correct based on what it could see. It just didn't know what I knew — product decisions, trade-offs, things I'd explicitly confirmed or rejected that existed nowhere in the code.
If you're running agents against production code, the short version is on Stewie for Engineering Teams.
The agent saw a Stripe integration. It didn't know I was actively evaluating switching to Polar. It saw a simple email/password auth flow and didn't know that was intentionally minimal because I hadn't decided on the permission model yet. It saw a 24-hour refund window and couldn't tell whether that was a deliberate product decision or a temporary hack.
These aren't coding mistakes. They're product context failures.
The state of the art is good — and it's not enough
The ecosystem for giving agents code-level context has gotten genuinely impressive in 2026:
- AGENTS.md / CLAUDE.md — Claude Code, Codex CLI, and others read these for conventions, build commands, architecture patterns
- Cline's memory bank — structured markdown files persisted across sessions for project knowledge
- Spec-driven development — GitHub's spec-kit, Kiro, and others turning specs into agent instructions
- Agent discovery — both Claude Code and Codex CLI now do initial codebase exploration before they start working
All of these solve the "how should I write code in this repo" problem. And they're getting really good at it.
What none of them solve: what should this product do and why.
The harness engineering gap
Both OpenAI and Anthropic published engineering posts in early 2026 saying the same thing: the model isn't the bottleneck — the harness is.
OpenAI's harness engineering series describes building their entire product with zero hand-written code. Their approach relies on structured documentation directories that serve as the single source of truth for agents, with architectural constraints enforced mechanically via custom linters and CI — not just prompt instructions.
Anthropic's post on effective harnesses for long-running agents describes the same pattern — every new agent session starts with no memory, so you need persistent structured context that agents can quickly load and reason about.
Both companies built these harnesses by hand for their own teams. Neither productized it.
This validates the problem from the two biggest AI companies — but it also reveals the gap. They're describing code-level harnesses. Neither addresses the product truth layer: the decisions, rules, and constraints that define what the product promises to do.
Four layers of agent context
When you look at the tools available today, there are really four distinct layers of context that agents need:
| Layer | What it provides | Example tools |
|---|---|---|
| Code conventions | How to write code in this repo | AGENTS.md, .cursorrules |
| Session memory | What happened in prior sessions | Cline memory bank, claude-progress.txt |
| Implementation specs | What to build (feature level) | spec-kit, Kiro, SDD |
| Product truth | What the product promises — what's decided, what's uncertain, and why | Missing |
Each layer is useful. None replaces the others. But right now, most repos have layers 1-3 covered (or at least have tools for them) and layer 4 is completely absent.
That's the layer where your product decisions live — and it's the layer agents violate most often.
What product violations look like
When an agent has AGENTS.md but no product truth layer, it knows how to work but not what to protect. So it:
- Refactors billing logic and silently changes the grace period from 14 days to 30
- "Simplifies" an auth check that was handling a deliberate edge case
- Auto-fills a field that the product intentionally leaves empty for compliance reasons
- Removes a validation rule because it looks redundant — but it was load-bearing
- Infers behavior from code patterns and builds confidently on wrong assumptions
The agent isn't being careless. It's following the instructions it has. The problem is that nobody wrote down which behaviors are off-limits.
Product truth as a behavior spec
The fix isn't more documentation. It's a different kind of artifact — one that captures product-level decisions in a format both humans and agents can reason about.
That's what a behavior spec is — formally a Product Behavior Contract (PBC). It's Markdown-first, machine-readable, and designed to sit alongside your existing agent context.
The key is the level of abstraction. A behavior spec doesn't describe the code — "line 42 checks daysSince > 14" changes with every refactor. It doesn't describe the strategy — "support billing grace periods" is too vague for agents. It captures the decision: "Grace period is 14 days. Not configurable per plan. No data deletion during grace." The part that stays true even if you rewrite the module from scratch.
That's why product truth is the most durable layer in the stack. PRDs change when strategy changes. Code changes when implementation changes. Product truth changes only when someone explicitly decides to change a product policy.
A behavior spec captures:
- What must happen — required outcomes for each behavior
- What must not happen — forbidden actions and states
- Edge cases — the exceptions that look like bugs but are deliberate
- Trust levels — which behaviors are confirmed vs. provisional vs. still being explored
A PBC doesn't replace AGENTS.md or memory banks or feature specs. It fills the specific gap those tools leave open — the product truth layer.
When your agent can read a behavior spec that says "the 24-hour refund window is a confirmed product decision, not a temporary value," it stops treating that number as refactorable. When it sees "auth is intentionally minimal — permission model is still in exploration," it stops building assumptions on top of an undecided foundation.
Code tells you what is. Product truth tells you what should be.
The distinction matters because code is descriptive — it shows the current state. Product truth is prescriptive — it says what the product promises and where those promises are still uncertain.
An agent that only reads code will always assume the current implementation is intentional. An agent that also reads a behavior spec can distinguish between "this is a deliberate design decision" and "this is a temporary hack that should be replaced."
That's the difference between an agent that writes correct code and an agent that builds the right product.
The PBC spec is open source at github.com/stewie-sh/pbc-spec. You can browse example contracts in the PBC viewer.
If you're using AGENTS.md, memory banks, or spec-driven development and still finding that your agents miss the product layer — this is the artifact that's missing.
