
> **TL;DR:** AI-safe engineering needs three durable, machine-readable layers: a **behavior** layer (what the product must do), a **decisions** layer (why the team chose it), and an **execution** layer (what's actually running and whether it matches). Execution has mature tooling. Some domains have decision substrates: authority frozen into deterministic data. Most product teams don't have that kind of oracle, so the behavior layer is the empty one — and it's the layer that decides whether agents stay honest.

Last week OpenAI shipped workspace agents in ChatGPT: Codex-powered agents for teams.

I watched the launch and felt something specific. The labs have decided that workflow orchestration is the next category to annex. Every layer of the AI engineering stack now has a serious player chasing it — except one.

That blank layer is the reason I keep writing about behavior contracts.

I've written before about [why agent-context tooling alone misses product truth](/blog/ai-agent-context-still-misses-the-product-layer). This post zooms out one level — from agent operation to the durable memory an AI-native team needs to preserve over time.

---

## The three layers

After a year of shipping with AI assistants, of rebuilding products from scratch and watching teams fight the same fights, I've come to believe AI-native engineering rests on three layers of durable memory:

**1. Behavior** — what the product must do, regardless of who or what writes the code.
What's settled. What's still being worked out. Which edge cases are intentional. Which rules are hard caps and which are soft warnings.

**2. Decisions** — why the team chose this and not that.
Trade-offs, assumptions, rejected options, the reasoning that survives long after the people leave.

**3. Execution** — what's actually running in production, and whether it matches the behavior contract.
Drift detection, runtime verification, and the checks that prove the AI's output is correct rather than just plausible.

These aren't three names for one thing. They're three different altitudes. They answer three different questions. They have different failure modes. And right now, in 2026, only the execution side has broadly mature tooling.

---

## The execution layer has mature tooling

Telemetry, observability, eval frameworks, validation harnesses — the runtime side of AI engineering has moved fast.

But **Erik Fehn**'s [Project Phoenix](https://proto.efehnconsulting.com/project-phoenix/) helped sharpen the boundary for me. Phoenix is not simply execution-layer tooling. It is closer to a **decision substrate**: authority frozen into deterministic data, then execution layered on top.

In Erik's PPR_Agent domain, that authority comes from FDA-mandated cardiac device implant records behind a deterministic SQLite layer. Swap the model. Rewrite the interface. The numbers don't change. The invariants survive because they were never only in the code.

His phrase has stuck with me: *the substrate is the decisions.*

That places Phoenix at the boundary between decisions and execution. The substrate carries authority. Execution checks whether the AI's output stays inside it.

Most teams don't have an external oracle like that. Their product decisions live in Slack, PRs, tickets, stale docs, code-shaped assumptions, and the heads of whoever was there when the feature shipped. That's where the behavior layer matters.

---

## The decisions layer is being prototyped

The middle layer — *why* we chose this — has historically lived in Slack threads, PR descriptions, and the heads of two or three senior people who'll eventually leave.

**Yauheni Kurbayeu**'s [Provenance Manifesto](https://provenancemanifesto.org) is the most serious attempt I've seen to make decisions a first-class artifact in the SDLC. He calls the pain *organizational context amnesia* — the slow erosion of "why we built it this way" as people rotate, teams reorganize, and AI agents start touching code without ever meeting the humans who reasoned about it first.

His direction is right. A decision log that captures assumptions, risks, owners, and lineage is the right shape for the layer. The work is early — file-based prototypes, a graph-based long-term direction — but the frame is correct.

The decisions layer is starting to crystallize. It's not solved, but the people working on it know what they're working on.

---

## The behavior layer is empty

This is the gap.

What's missing from most AI-native teams is a durable artifact that says: *here is what the product must do, written in a form a human and an agent can both read, consult, and update.*

Not a wiki page. Wikis optimize for explanation, not coverage.
Not a ticket backlog. Tickets optimize for coordination, not long-term truth.
Not a test suite. Tests optimize for catching regressions, not communicating intent.
Not a system prompt. System prompts decay and get rewritten by whoever shipped last.

The behavior layer needs an artifact with a few specific properties:

- **Markdown-first**, so humans can read and edit it without tooling.
- **Structured enough** that an agent can resolve "what's the rule for billing on the free tier?" without grepping the codebase.
- **Versioned**, so changes to product intent are reviewable.
- **Honest about uncertainty** — explicit about what's settled, what's still being worked out, and what's unknown.

That artifact is what we've been calling a **Product Behavior Contract** (PBC). More generally, this is the behavior-spec layer: a durable record of what a system must continue to do. PBC is the product/software version of that idea.

The format is open. The reference tooling is open. The bet is that the behavior layer is too important to be locked inside any single vendor's workspace.

A note on altitude: **behavior lives at more than one layer of a company.**

There is **team behavior**: how PRs get reviewed, how incidents are run, what we promise specific accounts.

There is **policy behavior**: what we never log, where data must stay, what counts as a refundable failure.

And there is the layer this post is about: **product behavior** — what the running software must continue to do.

All three are durable. All three are mostly implicit. I'm focused on product behavior because it is the most verifiable: you can check whether running code matches a behavior contract.

---

## Why the labs aren't building a portable behavior layer

The labs are building workflow orchestration because that's where the demos look most impressive — and where enterprise budget already exists.

But workflow orchestration without a behavior contract is guardrails without a target. The agent runs *somewhere*, but it doesn't know *what it's supposed to do*. So it does what it's been doing for two years: it does something plausible, hopes the human didn't notice, and quietly pushes the team's actual product intent further from what's running in production.

You feel this when an AI assistant "fixes" a bug by removing an edge case that turns out to have been intentional. You feel it when an agent ships a feature that technically passes review but contradicts a soft rule three people in the room would have caught. You feel it most when a new hire — human or agent — restarts the same investigation a previous teammate already finished, because nothing durable captured the answer.

Workflow gets faster. Without a behavior layer, *wrongness* gets faster too.

---

## What this means for AI-native teams

You don't need every layer perfect. You need each layer **present**.

Most teams I talk to have:

- **Execution layer:** partial. Telemetry exists. Eval is sometimes wired up. Drift detection is rare.
- **Decisions layer:** partial. Some teams have ADRs. Most don't. Slack is the substrate by default.
- **Behavior layer:** usually empty. There's no artifact that says *what the product must do, in a form an agent can read and a human can argue with.*

The cheapest move in 2026 is to claim the layer the labs aren't going to build for you. The behavior layer is small enough that one person can start it in an afternoon. It's also load-bearing: a behavior contract gives the decisions layer something to anchor against, and gives the execution layer a target to verify.

When a team has an external oracle, the substrate can carry authority. When it doesn't, the behavior contract becomes the closest durable artifact: a reviewed, versioned spec of what the system must continue to do.

You don't have to use any specific tool to do this. You can write a markdown file in your repo today.

But I'd argue you should write *something*. The labs are shipping faster than anyone's product intent can keep up. The team that has a durable, navigable behavior layer is the team whose AI agents stay honest.

---

## Where Stewie fits

[Stewie](https://www.stewie.sh) is a product intelligence workspace built around the behavior layer. The format — `.pbc.md` — is open and lives at [pbc.stewie.sh](https://pbc.stewie.sh). The harder problem we're working on is the part that doesn't fit in a markdown file: keeping a behavior contract aligned as code evolves, as teams rotate, as agents touch parts of the product humans haven't looked at in months.

If you're a CTO, eng lead, or product-minded engineer trying to keep AI assistance honest while you ship faster, that's the missing layer. We're in early beta with design partners.

---

Three layers. Three altitudes. Different builders, different artifacts, same meta-problem: **preserving intent, reasoning, and correctness in AI-accelerated engineering.**

Which of the three is your team weakest at right now?

I'd guess behavior, but I want to be wrong.
