If you've been following AI coding tools closely, you've probably noticed the conversation changing.
A year ago, most of the discourse was about prompts, context windows, and whether agents could reliably finish tasks at all.
Now the serious work is happening one layer above the model:
- OpenAI is writing about harness engineering, internal monitoring for coding agents, and benchmarks like EVMbench.
- Anthropic is writing about effective harnesses for long-running agents and even parallel agent teams building a C compiler.
- Every tool ecosystem now has some version of workflow rules, repo instructions, memory files, and spec-driven coding.
Taken together, these point to the same conclusion: the model is only part of the system. Reliable agentic coding depends on the surrounding stack.
That's real progress. But it still leaves one missing layer.
The modern agent stack is getting better at telling agents how to work. It still does a poor job telling them what the product must continue to do.
What AI agent context solves today
Most serious AI-native repos are already building some version of the same stack:
- Repo instructions like
AGENTS.md,CLAUDE.md, or tool-specific rules tell the agent how to work in this codebase. - Memory files preserve what happened in prior sessions so the next run doesn't start cold.
- Harnesses manage long-running work, handoffs, tool access, and task decomposition.
- Evals and monitors check whether the agent stayed within technical or safety boundaries.
Each layer solves a real problem.
Repo instructions reduce workflow mistakes. Memory reduces repeated exploration. Harnesses help agents make progress across long tasks. Evals and monitors catch bad outputs and suspicious behavior.
If your goal is better software engineering execution, this stack makes sense.
Why better harnesses still don't protect product decisions
Here's the problem: an agent can follow every repo rule, use the right harness, pass the tests, and still break the product.
Not by writing obviously bad code. By changing something that looked reasonable from the code alone.
That happens because most product decisions are not explicit in the repo:
- Is the 14-day refund window a confirmed policy or a placeholder?
- Is the current permission model intentional or just the minimal thing that shipped first?
- Is an empty field a bug, a compliance requirement, or a deliberate product choice?
- Is this validation rule load-bearing, or leftover code that should be removed?
The agent sees implementation. It does not automatically see product intent, trust level, or business significance.
That is why teams end up saying the same thing after an agent made a "wrong" change: the code was plausible, but it violated something the team had already decided.
This is not a prompt quality problem. It is a missing artifact problem.
What is missing from AI agent context today?
The missing layer is product truth.
Not a PRD. Not a sprint spec. Not a memory log. Not a test suite.
Product truth answers a narrower and more durable question:
What does this product actually promise to do right now, and which behaviors are confirmed enough that agents should treat them as protected?
That layer needs to capture things like:
- confirmed product behaviors
- forbidden states and actions
- deliberate edge cases
- areas that are still provisional
- decisions that are actively being explored and should not be treated as settled
Without that layer, every agent is forced to infer product meaning from implementation details.
Sometimes that works. Sometimes it silently introduces product drift.
Why AGENTS.md and memory banks are not enough
This is where teams get confused, because all of these artifacts look similar from the outside. They're usually text files in the repo. They're all readable by both humans and agents. They all seem like "context."
But they operate at different levels:
- AGENTS.md tells the agent how to behave as a contributor.
- Memory banks preserve what happened across sessions.
- Feature specs describe what the team plans to build.
- Tests verify implementation behavior in specific scenarios.
- Monitors look for dangerous or misaligned actions.
None of those directly answer: which product behaviors are intentional, protected, and safe to build on top of?
You can have all five and still leave the core product layer implicit.
That's why a repo can feel "well-instrumented" for agents and still be fragile when they touch billing, entitlements, onboarding logic, permissions, or compliance-sensitive flows.
What a Product Behavior Contract adds
A Product Behavior Contract adds the missing product layer without replacing the rest of the stack.
It sits alongside your existing agent context and makes the behavioral contract explicit:
- What must happen
- What must not happen
- Which edge cases are deliberate
- Which behaviors are confirmed, provisional, or still being explored
- Which source files provide evidence
That changes the quality of agent decisions.
When the contract says a billing limit is confirmed, the agent stops treating it as an arbitrary number it can refactor freely. When the contract says the permission model is still under exploration, the agent stops extending it as if the design were settled. When a behavior is marked provisional, humans and agents both know not to overfit around it.
This is the difference between code context and product context.
Code context tells the agent what exists. Product context tells the agent what must remain true.
The agent stack is converging. The product layer is next.
My read of the current ecosystem is that OpenAI, Anthropic, and the broader tool market are all converging on the same architecture:
- give agents better maps of the repo
- help them work across long time horizons
- break work into clearer sub-tasks
- evaluate and monitor their behavior more rigorously
That is the right direction.
But as agents get better at execution, the cost of missing product truth goes up, not down.
A stronger coding agent can now move faster, touch more files, and refactor more confidently. If the product layer is still implicit, that extra capability just lets it make bigger product mistakes more efficiently.
The next mature AI-native repo will not stop at workflow rules, harnesses, and evals. It will also include a durable product artifact that says what the software is actually supposed to do.
That's the role of a product behavior contract.
The format is open source. The PBC viewer lets you browse structured contracts in the browser. And Stewie is the product built to help teams generate and maintain that contract from real code.
Related: