EXP-007: The Agent That Quietly Built a Different Product

Context

Here's the prompt we started with:

Build a minimal todo app that's lightning fast and has markdown-based notes with preview and a clean modern UI.

Clear enough, right? A todo app. With markdown notes. Fast. Clean.

We gave this exact prompt to two separate agents, each in full isolation. One went through intent discovery first — it produced a structured intent artifact before writing any code. The other got the raw prompt and nothing else.

What happened next is the most interesting thing we've found so far.

Hypothesis

The prompt has two conceptual anchors competing for attention: "todo app" and "markdown notes with preview." Our bet was that without explicit intent structure, an agent might silently reorganize the product around whichever concept felt most implementation-salient — and never flag the switch.

With an intent artifact making the product's center of gravity explicit, that silent reframing shouldn't happen.

Initial Intent Artifact

The intent artifact defined the product as a todo-first, local-first application with markdown note support and live preview. It called out goals like capturing todos and markdown notes with near-zero friction, delivering sub-100ms response times, and showing live markdown preview.

Critically, it made speed, data integrity, and minimalism protected values. And it flagged specific drift risks: feature creep, over-investing in markdown at the cost of todo simplicity, and pulling in heavy UI dependencies.

None of this was coached. The agent discovered it from the same five-line prompt.

Method

After both agents finished building, we read everything — the implementations, the summaries, the code. We compared against the original prompt meaning, not against each other. Some of the performance claims in the comparison are analytical judgments from reading the code, not benchmarked measurements.

Observation

The first clue was the questions

The prompt-only agent started by asking implementation questions: What framework? What storage backend? It needed clarification before it could move.

The intent-driven agent just started building. The intent artifact had already compressed enough ambiguity that there was nothing left to ask about. That difference in startup behavior was the first signal.

Then the products diverged

The prompt-only agent built a markdown notes app. Not a todo app with notes — a notes app. Its own implementation plan literally labeled the project "minimal markdown notes app." Sidebar with a notes list, editor panel, preview pane. The word "todo" essentially vanished.

The intent-driven agent built what was asked for: todos with markdown notes and live preview. Todo creation, completion, deletion. A note button, side panel, split-pane markdown editor. Both capabilities present.

The follow-up conversation revealed everything

We went back and asked each agent directly what it had built.

The intent-driven agent confirmed: yes, markdown is there — note button, side panel, split editor, live preview.

The prompt-only agent confirmed the opposite: it had not built real todo features. The only todo-adjacent behavior was rendering markdown checkboxes inside notes. When we asked why, the agent explained matter-of-factly:

While the original one-line request said "todo/notes app," the implementation plan described a pure notes app — and I followed the plan.

There it is. The agent didn't misunderstand the prompt. It generated an intermediate plan that replaced the original product meaning, then executed that plan consistently. It built the wrong product on purpose — its own purpose.

Drift Analysis

Plan substitution (primary)

This is a textbook case. The prompt contained two plausible anchors: "todo app" and "markdown notes with preview." Without an explicit intent hierarchy, the prompt-only agent promoted markdown-notes to the center of the product and demoted todo semantics to incidental checkbox rendering.

The agent's internal plan became more authoritative than the original prompt. That's plan substitution — and it's the failure mode Exogenesis is most concerned with.

Product-identity drift (secondary)

As a consequence, the prompt-only branch shifted from the intended product category (todo app with markdown notes) to a neighboring but different one (notes app with optional checkbox rendering). The right-looking feature set was built, but organized around the wrong primary object.

Legitimate Divergence

Not everything that differed was drift. Both branches made valid design choices in areas the intent artifact didn't constrain:

UI layout: sidebar-based notes list vs split-pane. The artifact said nothing about navigation patterns.
Visual styling: different color schemes, typography, spacing. "Clean modern UI" leaves room for interpretation.
State management: different approaches to DOM manipulation. The artifact specified performance targets, not technology.

These are healthy differences. The artifact correctly left room for implementation freedom here.

Result

The hypothesis held — and the result was sharper than expected.

The intent-driven implementation preserved both todo features and markdown capability. The prompt-only implementation preserved markdown but explicitly failed to build real todo functionality.

The interesting part isn't that the intent-driven version was "better." It's that the prompt-only version was internally coherent. It worked. It looked polished. It just wasn't the product that was asked for. And the agent knew exactly why: it followed its own plan instead of the prompt.

Intent discovery prevented the coding agent from silently replacing the requested product with a different but internally coherent implementation plan.

Principle

Without an explicit intent artifact, coding agents may replace the original product concept with their own intermediate implementation framing and then execute that reframing consistently.

The practical corollary: intent artifacts don't just reduce ambiguity. They protect the original product center of gravity from being overwritten during planning.

Follow-Up

Does this pattern repeat with simpler prompts that have only one plausible anchor?
What about workflow apps with strong state transitions, where the "center" is harder to reframe?
Security-sensitive features with trust-boundary constraints — does plan substitution happen there too?

The intent format itself might benefit from explicitly representing primary product concept, mandatory supporting capabilities, and optional capabilities — making it structurally harder to demote a mandatory capability.

Limitations

Both branches used the same model family. Model-specific tendencies may have shaped both outputs.
The intent-driven branch received more structured input, so the improvement could partly reflect context volume rather than intent structure.
Performance claims (sub-100ms response times) come from reading the code, not from benchmarks.
Single run per branch. A different run might produce a different plan substitution pattern — or none at all.
The comparison was written knowing both branches existed, which could introduce framing bias.