TL;DR: The agent ecosystem is rapidly improving harnesses, memory, and workflow rules — but agents still lack product truth: which behaviors are protected promises vs. incidental implementation. This post argues that missing layer is a durable behavior spec (PBC) with trust levels, edge cases, and “must/must-not” constraints so agents can refactor safely without drifting the product.
If you've been following AI coding tools closely, you've probably noticed the conversation changing.
A year ago, most of the discourse was about prompts, context windows, and whether agents could reliably finish tasks at all.
Now the serious work is happening one layer above the model:
- OpenAI is writing about harness engineering, internal monitoring for coding agents, and benchmarks like EVMbench.
- Anthropic is writing about effective harnesses for long-running agents and even parallel agent teams building a C compiler.
- Every tool ecosystem now has some version of workflow rules, repo instructions, memory files, and spec-driven coding.
Taken together, these point to the same conclusion: the model is only part of the system. Reliable agentic coding depends on the surrounding stack.
That's real progress. But it still leaves one missing layer.
The modern agent stack is getting better at telling agents how to work. It still does a poor job telling them what the product must continue to do.
What AI agent context solves today
Most serious AI-native repos are already building some version of the same stack:
- Repo instructions like
AGENTS.md,CLAUDE.md, or tool-specific rules tell the agent how to work in this codebase. - Memory files preserve what happened in prior sessions so the next run doesn't start cold.
- Harnesses manage long-running work, handoffs, tool access, and task decomposition.
- Evals and monitors check whether the agent stayed within technical or safety boundaries.
Each layer solves a real problem.
Repo instructions reduce workflow mistakes. Memory reduces repeated exploration. Harnesses help agents make progress across long tasks. Evals and monitors catch bad outputs and suspicious behavior.
If your goal is better software engineering execution, this stack makes sense.
Why better harnesses still don't protect product decisions
Here's the problem: an agent can follow every repo rule, use the right harness, pass the tests, and still break the product.
Not by writing obviously bad code. By changing something that looked reasonable from the code alone.
That happens because most product decisions are not explicit in the repo:
- Is the 14-day refund window a confirmed policy or a placeholder?
- Is the current permission model intentional or just the minimal thing that shipped first?
- Is an empty field a bug, a compliance requirement, or a deliberate product choice?
- Is this validation rule load-bearing, or leftover code that should be removed?
The agent sees implementation. It does not automatically see product intent, trust level, or business significance.
That is why teams end up saying the same thing after an agent made a "wrong" change: the code was plausible, but it violated something the team had already decided.
This is not a prompt quality problem. It is a missing artifact problem.
What is missing from AI agent context today?
The missing layer is product truth.
Not a PRD. Not a sprint spec. Not a memory log. Not a test suite.
Product truth answers a narrower and more durable question:
What does this product actually promise to do right now, and which behaviors are confirmed enough that agents should treat them as protected?
That layer needs to capture things like:
- confirmed product behaviors
- forbidden states and actions
- deliberate edge cases
- areas that are still provisional
- decisions that are actively being explored and should not be treated as settled
Without that layer, every agent is forced to infer product meaning from implementation details.
Sometimes that works. Sometimes it silently introduces product drift.
Why AGENTS.md and memory banks are not enough
This is where teams get confused, because all of these artifacts look similar from the outside. They're usually text files in the repo. They're all readable by both humans and agents. They all seem like "context."
But they operate at different levels:
- AGENTS.md tells the agent how to behave as a contributor.
- Memory banks preserve what happened across sessions.
- Feature specs describe what the team plans to build next.
- Tests verify implementation behavior in specific scenarios.
- Monitors look for dangerous or misaligned actions.
None of those directly answer: which product behaviors are intentional, protected, and safe to build on top of?
Feature specs come closest — a PRD might say "build a 14-day grace period." But feature specs are temporary. They die when the feature ships. The PRD gets closed, the ticket gets archived, and the decision inside it becomes implicit in the code. Feature specs are how decisions are born. They're not how decisions survive.
For many AI-native teams, feature specs don't even exist. Intent lives in standups, Slack messages, two-line Linear tickets, and conversations with coding agents. There's nothing to graduate into a durable product artifact because nothing was written down in the first place.
You can have all five layers and still leave the core product truth implicit.
That's why a repo can feel "well-instrumented" for agents and still be fragile when they touch billing, entitlements, onboarding logic, permissions, or compliance-sensitive flows.
What a behavior spec adds
A behavior spec — formally a Product Behavior Contract (PBC) — adds the missing product layer without replacing the rest of the stack.
But it's important to be precise about what level of abstraction it operates at. Product truth sits in the middle — between the high-level intent of PRDs and the low-level reality of code:
| Too abstract (PRD) | Right level (product truth) | Too concrete (code description) |
|---|---|---|
| "Support billing grace periods" | "Grace period is exactly 14 days, not configurable per plan, no data deletion during grace" | "Line 42 of billing.ts checks daysSince > 14 and calls downgrade()" |
| "Auth should be secure" | "Max 3 login attempts before lockout. Lockout lasts 30 minutes. No admin override." | "The loginAttempts counter in Redis increments on each failed /api/auth call" |
| "Handle plan upgrades" | "Mid-cycle upgrade charges prorated difference immediately. No credit for unused time." | "upgradeSubscription() calls Stripe's subscription.update() with proration_behavior: create_prorations" |
If it stays too abstract, it's useless to agents — "users should have a smooth checkout experience" tells the agent nothing about what to protect. If it gets too concrete, it becomes a description of the current implementation, which changes with every refactor and costs more to maintain than it's worth.
The test: if you rewrote this module from scratch in a different language, would this rule still need to be true? "Grace period is 14 days" — yes, that's product truth. "Call Stripe API for payments" — no, that's an implementation choice.
That's also why product truth is the most durable layer. PRDs change when strategy changes. Code changes when implementation changes. Product truth changes only when someone explicitly decides to change a product policy — which is rare.
A behavior spec makes this middle layer explicit:
- What must happen — required outcomes for each behavior
- What must not happen — forbidden actions and states
- Which edge cases are deliberate — exceptions that look like bugs but aren't
- Which behaviors are confirmed, provisional, or still being explored — trust levels
- Which source files provide evidence — traceability back to code
That changes the quality of agent decisions.
When the contract says a billing limit is confirmed, the agent stops treating it as an arbitrary number it can refactor freely. When the contract says the permission model is still under exploration, the agent stops extending it as if the design were settled. When a behavior is marked provisional, humans and agents both know not to overfit around it.
This is the difference between code context and product context.
Code context tells the agent what exists. Product context tells the agent what must remain true.
Where this hits hardest: refactoring
The most dangerous moment for missing product truth isn't when an agent builds something new — it's when it refactors something that already works.
Refactoring's explicit goal is to change code structure without changing behavior. But the agent doesn't know which behaviors are intentional product decisions and which are incidental implementation details.
You ask the agent to "simplify the billing module." It sees retry logic with edge case handling that looks over-engineered and removes it. But that edge case was handling a deliberate grace period rule that exists nowhere except in the original author's head.
You ask it to "extract shared logic." Billing and notifications both have retry logic. The agent creates a shared utility — but billing had a deliberate 24-hour delay between retries (a product decision, not a technical one), while notifications used 5-minute retries. The shared utility defaults to 5 minutes. Billing retries just got aggressive and nobody noticed.
You ask it to "make billing configurable per plan." The agent makes everything configurable. The 14-day grace period becomes a per-plan setting with a default of 14. Technically correct — but the product decision was "14 days, not configurable per plan." That constraint is gone now.
These aren't hypothetical edge cases. They're what happens every time you ask an agent to refactor a module with implicit business rules. Refactoring PRs are reviewed with the assumption "behavior didn't change" — so reviewers skim them. And the agent genuinely believes behavior didn't change. It just doesn't know which behaviors mattered.
The agent stack is converging. The product layer is next.
My read of the current ecosystem is that OpenAI, Anthropic, and the broader tool market are all converging on the same architecture:
- give agents better maps of the repo
- help them work across long time horizons
- break work into clearer sub-tasks
- evaluate and monitor their behavior more rigorously
That is the right direction.
But as agents get better at execution, the cost of missing product truth goes up, not down.
A stronger coding agent can now move faster, touch more files, and refactor more confidently. If the product layer is still implicit, that extra capability just lets it make bigger product mistakes more efficiently.
The next mature AI-native repo will not stop at workflow rules, harnesses, and evals. It will also include a durable product artifact that says what the software is actually supposed to do.
That's the role of a behavior spec.
The format is open source. The PBC viewer lets you browse structured contracts in the browser. And Stewie is the product built to help teams generate and maintain that contract from real code.