CONS-001: Three Agents, One Budget: Where Intent Discovery Agrees With Itself

Context

"Build a personal budget tracker." Five words. No framework preference, no feature list, no wireframes, no domain expertise. Just a prompt that could plausibly produce anything from a spreadsheet clone to a full fintech dashboard.

We gave this prompt to three independent agents. Each received only the raw prompt and a blank intent artifact template. No hints about what to look for, no pre-identified values, no coaching about what "budget tracker" should mean. The question was simple: if intent discovery is a real method and not just a way to get lucky once, would three separate agents converge on the same product meaning?

Hypothesis

We expected moderate-to-high convergence on product identity and core values — "personal budget tracker" is a reasonably well-understood concept. But we expected divergence in the specifics: which drift risks to flag, which forbidden states to name, how to frame trade-offs. The prompt is short enough to leave most implementation questions genuinely open.

Initial Intent Artifact

No pre-existing intent artifact was used. Each agent generated its own from scratch.

Method

Three agents independently performed intent discovery on the same prompt. Each received:

The raw prompt: "Build a personal budget tracker"
A blank intent artifact template (EXO v2)
An output path for the generated artifact and summary

No agent knew the others existed. No agent received domain hints, pre-identified values, or examples of what good discovery looks like. After all three completed, we decomposed each artifact into atomic claims and performed semantic matching to measure convergence.

Claims were assessed analytically — semantic matching was performed by the orchestrator, not by an automated tool. Two claims were considered matching if they expressed the same intent regardless of wording.

Observation

The product identity was never in doubt

All three agents independently arrived at the same fundamental insight: a "budget tracker" is not an expense logger. The word "budget" implies a forward-looking plan. The word "tracker" implies monitoring reality against that plan. Every agent made this distinction explicitly and flagged "expense tracker substitution" as the primary drift risk.

Agent 1 called the center of gravity "the budget-vs-actual loop." Agent 2 called it "budget adherence visibility." Agent 3 called it "the gap between planned and actual spending." Three different phrasings, one idea. This is the strongest convergence finding: the product's identity was stable across all three independent discovery passes.

Protected values converged remarkably well

Four protected values appeared in all three artifacts:

Arithmetic correctness (if the numbers are wrong, the product is harmful)
Data persistence (manually entered records cannot be reconstructed)
Low-friction expense entry (the product dies when entry becomes tedious)
Privacy (spending data is intimate)

The convergence score for protected values was 0.667 — the second highest of any section. This suggests that for a well-understood product domain, the core values are not arbitrary. They emerge naturally from the prompt when you pause to ask what matters before building.

What is interesting is how each agent expressed the *same* value with different emphasis. Agent 1 called it "financial data accuracy." Agent 2 called it "numerical correctness." Agent 3 called it "arithmetic integrity." The semantic core is identical, but the framing subtly varies — accuracy suggests fidelity to reality, correctness suggests logical soundness, integrity suggests structural trustworthiness. None of these framings are wrong. They are different angles on the same non-negotiable.

Assumptions were the most stable layer

The highest convergence score (0.714) belonged to assumptions. Five of seven unique assumptions were identified by all three agents:

Single user
Single currency
Monthly budget cycle
Manual data entry
Local/client-side storage

This makes sense. Assumptions are the claims most directly derivable from the prompt's surface. "Personal" implies single user. "Budget" implies periodic. "Tracker" implies manual entry is acceptable. These are not deep discoveries — they are careful readings. But the fact that all three agents made them explicit (rather than leaving them implicit) is itself a meaningful finding. Explicit assumptions are cheaper to review than implicit ones.

Forbidden states and invariants diverged sharply

The lowest convergence scores were in forbidden states (0.111) and invariants (0.250). Only one forbidden state — silent data loss — achieved full convergence. The rest were singular: each agent named different failure scenarios.

This is not surprising, but it is worth sitting with. Forbidden states and invariants require imagining what could go wrong, not just what the product should do. That imagination exercise is inherently more divergent. Agent 2 worried about rounding errors accumulating. Agent 3 worried about the app becoming unresponsive under load. Agent 1 worried about silent currency mixing. These are all valid concerns — they just reflect different failure models.

The implication is that a single intent discovery pass may leave significant failure-mode blind spots. Three passes together covered far more territory than any one alone.

Drift risks: same fear, different nightmares

All three agents flagged the same two primary drift risks: building an expense logger instead of a budget tracker, and scope inflation toward a financial analytics platform. But after those two, the agents diverged in revealing ways.

Agent 1 worried about over-engineering the data model (building double-entry accounting). Agent 2 worried about over-abstracting categories (hierarchical trees and rule engines). Agent 3 worried about gamification drift (badges, streaks, social features). Each agent projected a different failure trajectory based on a different implementation scenario they imagined.

This pattern — convergence on the most likely drifts, divergence on the secondary ones — suggests that drift risk identification has a reliable core but a long, agent-specific tail.

Drift Analysis

This is a consensus experiment, not a drift experiment. No implementation was built, so no drift was observed. However, the convergence analysis surfaces several findings relevant to drift taxonomy:

Expense tracker substitution was identified by all three agents as the primary drift risk. This is strong evidence that this particular drift pattern is salient enough to be reliably discovered from a short prompt.

Scope inflation toward analytics or financial planning was also universally flagged. Two of the most important drift risks for this prompt domain are apparently stable discoveries.

Secondary drift risks were agent-specific. This suggests that a single intent discovery pass may systematically under-enumerate drift risks — catching the obvious ones but missing plausible secondary vectors.

Legitimate Divergence

Several areas of divergence represent valid variation rather than instability:

Allowed optimizations: Whether to mention export, dark mode, OCR, or sorting as allowable polish is a matter of which implementation scenarios the agent imagined. The intent artifact does not constrain this area.
Forbidden state specifics: The particular failure scenarios each agent imagined (rounding errors vs. unresponsive states vs. currency mixing) reflect different implementation assumptions. The intent artifact correctly leaves room for this variation.
Framing of simplicity: Agent 1 framed simplicity as habit sustainability, Agent 2 as entry speed, Agent 3 as non-technical usability. These are three valid lenses on the same value, not contradictions.

Convergence on these details would actually be suspicious — it would suggest the agents were working from shared context rather than independently.

Result

The overall convergence score was 0.402 (40.2%). By the experiment's own scale, this falls in the moderate range (0.3-0.6), suggesting that core identity is stable but details vary.

But the score alone understates the finding. The convergence was highly stratified:

Product identity: full convergence (3/3 agree on what this is)
Center of gravity: full convergence (all three identified the budget-vs-actual comparison)
Protected values: 0.667 (four of six values were universal)
Assumptions: 0.714 (five of seven were universal)
Forbidden states: 0.111 (almost entirely divergent)

The pattern is clear: intent discovery is highly reproducible for product meaning, values, and assumptions. It is moderately reproducible for goals and scope. It is weakly reproducible for failure imagination — forbidden states, specific invariants, and secondary drift risks.

The strongest single-sentence takeaway: three agents independently found the same product inside a five-word prompt, but imagined different ways it could break.

Principle

Intent discovery on a well-understood prompt reliably stabilizes product identity, core values, and foundational assumptions. It does not reliably stabilize failure-mode enumeration. A single intent discovery pass catches the obvious risks but leaves blind spots in secondary drift vectors, forbidden states, and implementation-specific invariants.

A practical corollary: for domains where failure modes matter most (finance, health, safety), running multiple independent discovery passes and merging the failure inventories may be more valuable than running one pass and assuming it is comprehensive.

Follow-Up

Does this convergence pattern hold for genuinely ambiguous prompts? "Build a personal budget tracker" is relatively well-understood. Would "Build something that helps people feel less anxious about money" produce the same convergence on identity, or would the agents discover different products?

If three agents merged their forbidden-state inventories, would the combined set be meaningfully more complete than any single agent's? And would an implementation guided by the merged artifact actually avoid more failures?

Is the divergence in secondary drift risks a feature or a limitation? If each agent imagines different failure trajectories, the union of all three may be more robust than the intersection. But is there an upper bound on useful divergence before it becomes noise?

Would running this experiment with truly isolated agents (separate context windows, separate sessions) produce materially different convergence scores?

Limitations

No true agent isolation: Due to tool constraints, all three intent discovery passes were generated within a single orchestrator context rather than in separate agent sessions. While each pass was generated independently of the others, true context isolation was not achieved. This is the most significant limitation and means the convergence scores may be artificially inflated by shared model tendencies within one session.

Single model family: All three passes used the same underlying model. Cross-model consensus (e.g., Claude vs GPT vs Gemini) would be a stronger test of reproducibility.

Analytical semantic matching: Claim matching was performed by the orchestrator through judgment, not automated semantic similarity. Different evaluators might classify boundary cases differently.

Well-understood domain: "Personal budget tracker" is a common product concept with strong cultural priors. The convergence observed here may not generalize to novel or ambiguous domains.

Single run: This is one experiment, not a statistical sample. The convergence scores are observations from one prompt, not population estimates.

No implementation test: Convergence in intent discovery does not prove convergence in implementation. Three agents that agree on what a budget tracker is might still build three different implementations with different drift patterns.