EXP-005: Right Features, Wrong Product

Context

Build a simple personal budget tracker.

"Budget" is one of those words that everyone thinks they understand until you ask them to define it. Is a budget a list of transactions? A spending ceiling? A plan for where money goes? The answer changes the product.

An *expense tracker* answers: "Where did my money go?"

A *budget tracker* answers: "Am I over or under my plan?"

Same domain. Different product. We wanted to see which one each agent would build.

Hypothesis

The prompt-only agent would build an expense tracker — add transactions, see totals — and call it a budget tracker. The intent-driven agent would identify "budget" as requiring a settable spending limit and a remaining-amount display: the features that distinguish a budget from a transaction log.

Initial Intent Artifact

The intent artifact made a key discovery: "'Budget' is what distinguishes this from a transaction log. Without a settable spending ceiling and a 'remaining' number, the app fails its name."

It defined the product center of gravity as the relationship between planned and actual spending — not the list of transactions itself. Protected values included simplicity, budget-centricity (the remaining amount should be the most prominent element), and "honest reflection, not prescription" — meaning the app should show you where you stand without being judgmental about it.

Method

Both agents received only the raw prompt. The intent-driven agent generated a structured intent artifact before implementation. Both produced single-file HTML apps and written summaries.

Observation

The prompt-only branch built an expense tracker

Add income and expenses. See totals: income, expenses, balance. Filter by type. The summary was candid:

"I built a transaction tracker, not a budget planner."

The agent knew. It recognized the gap between what was asked for and what it built, stated it clearly in the summary, and the implementation had already committed to the wrong framing.

This is the same pattern we saw in the ceramics experiment — the agent has the right insight but doesn't act on it. The difference here is starker: the agent literally named the problem and still shipped the wrong product.

The intent-driven branch built a budget tracker

Set a monthly budget amount. Add expenses with categories. See remaining budget as the largest element on screen. When over budget, the number turns red — with no judgmental language, no warning emoji, no "you overspent!" message. Just a red number.

That last design choice traces directly to the intent artifact's protected value: "Honest reflection, not prescription." The app shows you where you stand and lets you decide how to feel about it.

The load-bearing word

"Budget" is what the intent artifact called a "load-bearing word" — a word that carries product-identity meaning but is easy to skip during implementation. Without a settable spending ceiling and a remaining-amount display, you don't have a budget tracker. You have a transaction log with a different name.

The intent-driven agent caught this during intent discovery. The prompt-only agent caught it in its post-build summary — too late to change the architecture.

Drift Analysis

Product-identity drift (primary)

This is the strongest case of product-identity drift in our batch. The prompt says "budget tracker." The prompt-only branch built an expense tracker. These are genuinely different products organized around different primary concepts:

Expense tracker: primary object is the transaction. Core interaction is adding and reviewing transactions.
Budget tracker: primary object is the budget. Core interaction is setting a ceiling and watching the remaining amount.

The prompt-only branch built the right features for the wrong concept. It's not that it added unwanted features — it's that it built a different product and didn't realize until the summary.

Silent default selection (secondary)

The prompt-only branch chose the "transaction log" framing without surfacing that a choice had been made. It silently selected one interpretation of "budget tracker" — the expense-tracking interpretation — and implemented it as though it were obvious.

Legitimate Divergence

Category support: The intent-driven branch added expense categories and a category breakdown chart. These weren't in the prompt, but they're traceable to the artifact's goal of understanding spending patterns. The prompt-only branch didn't add categories. Both are valid — categories are useful but not required.
Visual design: Different color schemes and typography. Neither specified by the artifact.
Input design: Different form layouts for adding transactions. The artifact specified simplicity but not specific UI patterns.

Result

The hypothesis held cleanly. The prompt-only branch built an expense tracker and called it a budget tracker. The intent-driven branch identified "budget" as the load-bearing word and built around it.

This is the strongest case in our experiments for intent discovery preventing drift at the product-identity level — not feature creep, not scope inflation, but building the wrong thing entirely. And the prompt-only agent's own summary proves it knew:

"I built a transaction tracker, not a budget planner."

Right features. Wrong product. And the agent said so itself.

Principle

Some words in prompts carry product-identity meaning that is easy to skip during implementation. "Budget" is not a synonym for "expense" — it implies a ceiling, a remaining amount, a relationship between planned and actual spending. Intent discovery forces the agent to ask "what does this word actually mean?" before committing to an architecture. By the time the prompt-only agent recognized the problem in its summary, it had already built the wrong thing.

The earlier the load-bearing word is examined, the less expensive the insight is.

Follow-Up

Test with other semantically loaded financial terms: "savings tracker" vs "account balance," "investment portfolio" vs "stock list"
Do users who say "budget tracker" actually want budget-ceiling functionality, or do they mean "expense tracker"? User research would settle this.
Test whether the neutral over-budget indicator (red number, no emoji) is perceived differently from a judgmental one

Limitations

Both branches used the same model family. The tendency to build expense trackers when asked for budget trackers might be model-specific.
"Budget tracker" is genuinely ambiguous in casual usage. Many people use "budget" and "expense tracking" interchangeably. The drift classification assumes the strict financial meaning of "budget" is correct.
Single run per branch. The prompt-only agent might build a proper budget tracker on another run.
The intent-driven branch received more structured input. The "budget as ceiling" insight was in the artifact, which the prompt-only branch didn't have.
The intent-driven branch added categories that weren't in the prompt. We classified this as artifact-justified, but it could also be seen as mild scope inflation.