Tag: Halucination

3 Prompt AI Series #4: Framework: Calibration, Governance, and Trade-offs

Implementing the Three-Rule Framework: Calibration, Governance, and Trade-offs

The previous post in this series introduced a general framework for AI-assisted scenario building: Force Blank, Penalize Guessing, Show the Source. The framework produces output where every claim is tagged as VERIFIED, ASSUMED, or PROJECTED, and where gaps are explicitly labeled instead of silently filled.

That’s the what. This post is about the how — three practical challenges that anyone implementing the framework will encounter:

  1. Calibration: You’ve tagged something as ASSUMED. How do you check whether the assumption is reasonable?
  2. Governance: How do organizations enforce tagging in actual workflows — not just in one person’s prompt?
  3. Trade-offs: Doesn’t all this tagging create cognitive overload? How do non-experts read a document full of provenance labels?

1. Calibrating Assumptions: From “Tagged” to “Tested”

Tagging an assumption is necessary but not sufficient. (ASSUMED: market grows 15% annually) is better than an unlabeled 15% baked into the projection — but it still doesn’t tell you whether 15% is defensible. The framework surfaces assumptions; calibration tests them.

Four calibration methods work well with the tagged output:

Reference Class Forecasting: The Outside View

Daniel Kahneman and Amos Tversky’s distinction between the “inside view” (planning based on the specifics of this project) and the “outside view” (what happened in similar projects historically) is the single most useful concept for calibrating assumptions. The planning fallacy — systematically underestimating costs and timelines — is so well-documented that the American Planning Association officially endorsed reference class forecasting in 2005 as a corrective.

In practice, this means: for every ASSUMED tag, ask the model (or yourself) to identify 3–5 comparable situations and their actual outcomes. If you assume 15% growth, what growth did similar products in similar markets actually achieve? If you assume a 6-month regulatory timeline, how long did comparable approvals actually take? The tagged format makes this step natural — you have a list of assumptions; now walk down it with an outside view on each one.

You can even build this into the prompt:

For every ASSUMED tag, add a “Calibration” note: identify 2–3 comparable historical cases and their actual outcomes. If no comparable data exists, note [NO REFERENCE CLASS].

Sensitivity Testing: What Breaks If This Is Wrong?

Not all assumptions are equally important. RAND’s Assumption-Based Planning calls this “criticality” — an assumption is critical if its failure would require fundamental changes to the plan. In practice, this means testing: what happens to the conclusion if this assumption is 50% wrong? If the answer is “not much,” the assumption is low-priority. If the answer is “the entire business case collapses,” that’s your highest-priority validation target.

The tagged format enables this directly. You can ask the model:

Take the three ASSUMED items with the highest downstream impact on the final projection. For each, recalculate the projection with the assumption at 50% of stated value and at 150%. Show me which assumptions the conclusion is most sensitive to.

Pre-Mortem: Imagine It Failed

Gary Klein’s pre-mortem technique inverts the question: instead of asking “will this work?”, you start from “it failed — why?” This is particularly effective for ASSUMED tags, because it surfaces failure modes that optimism hides. Ask the model:

Assume this scenario failed after 12 months. Which of the ASSUMED items were most likely the point of failure? For each, describe a plausible narrative of how that assumption broke down.

Temporal Decay: When Does the Assumption Expire?

Assumptions have shelf lives. A market size estimate from a 2025 Gartner report is still reasonable in 2026. A competitive landscape assumption from 2024 may already be wrong. Adding a temporal dimension to ASSUMED tags helps:

For each ASSUMED tag, add an expiry estimate: how long is this assumption likely to remain valid? Mark anything older than 12 months or based on pre-2025 data as [STALE ASSUMPTION].


2. Governance: Making the Framework Stick Beyond One Person’s Prompt

The framework works well when one person uses it in one chat session. The governance question is: how does it survive contact with an organization — multiple people, multiple AI tools, multiple documents, over months?

The Problem: Tags Die in Translation

What typically happens: an analyst generates a beautifully tagged scenario. They copy it into a slide deck. The tags disappear. A manager reads the deck, sees “Year 1 revenue: €310K” with no indication that the number is PROJECTED from two unvalidated ASSUMED inputs. The ghost scenario lives again.

This is a knowledge management problem, not an AI problem. And it has knowledge management solutions.

Level 1: Template Enforcement

The simplest governance mechanism is a template. If your organization uses AI for scenario planning, the output template should have provenance columns built in. Not optional, not “add if useful” — structurally required. A scenario document without source tags should be treated the same way as a financial report without citations: incomplete.

Concretely: create a standard table format for all AI-assisted scenario outputs:

Variable Value Source Basis / If Wrong Validated By Date
(All AI-generated scenario outputs must use this format)

The “Validated By” and “Date” columns are the governance additions. They turn a prompt technique into an audit trail. Someone must sign off on each ASSUMED item before it enters planning.

Level 2: Review Workflow

For organizations with more structured processes, integrate tagging into the review cycle:

Step 1 — Generation: AI produces tagged output using the three-rule prompt.
Step 2 — Assumption Review: A domain expert reviews all ASSUMED and PROJECTED items. Each gets one of three dispositions: confirmed (reclassified to VERIFIED), challenged (sent for calibration), or accepted with risk (kept as ASSUMED with a documented rationale).
Step 3 — Gap Triage: All DATA GAP and ASSUMPTION GAP items are triaged: resolvable (assign someone to find the data), irreducible (the uncertainty is inherent — document it and plan around it), or deferred (not needed for this decision stage).
Step 4 — Decision Package: The final document separates “what we know” (VERIFIED), “what we believe” (ASSUMED, with calibration notes), and “what we don’t know” (remaining gaps). Decision-makers see all three.

Level 3: System Prompt Standardization

If your organization uses AI across multiple teams, standardize the system prompt. Don’t rely on individual analysts remembering to apply the three rules. Embed the framework into every AI access point — whether that’s a shared Claude project, a custom GPT, an API wrapper, or an n8n workflow. The prompt becomes infrastructure, not personal practice.

For teams using Claude Projects or custom GPTs, the three-rule prompt goes into the project instructions or system message — it’s active for every conversation in that workspace without anyone needing to remember to include it.

The Cultural Challenge

The hardest governance problem isn’t technical. It’s that tagging uncertainty feels like weakness. Presenting a scenario full of ASSUMED and DATA GAP labels to a board looks less confident than presenting clean numbers. The organizational response to this must be explicit: a tagged scenario is not an incomplete scenario — it’s an honest one. The clean numbers were never clean; they just hid where the guesses were.

This is exactly what Bent Flyvbjerg’s decades of research on megaproject failures shows: the projects that went most catastrophically over budget weren’t the ones with the most uncertainty — they were the ones where the uncertainty was hidden. Transparency about assumptions is a risk reduction strategy, not an admission of weakness.


3. Trade-offs: When Tags Become Noise

A document where every sentence carries a provenance label is exhausting to read. The framework creates real cognitive overhead, and pretending otherwise is dishonest. The question isn’t whether there’s a cost — there is — but how to manage it.

The Overload Problem

Consider a 20-variable scenario with source tags, calibration notes, and “if wrong” annotations on every ASSUMED item. For the analyst who built it, this is valuable — they can see exactly where to direct attention. For the executive who needs to make a decision based on it, it’s a wall of qualifications that obscures the bottom line.

Both perspectives are legitimate. The solution isn’t to choose one over the other — it’s to serve both with different views of the same underlying data.

Solution: Layered Presentation

The tagged scenario should exist in at least two layers:

Layer 1 — Decision Summary: One page. Key conclusions, key numbers, key risks. No tags in the running text. Instead, a single “Confidence Profile” section at the bottom:

This scenario rests on 14 verified data points, 6 stated assumptions, and 3 projections. Two data gaps remain unresolved (market-specific CAC, regulatory timeline). The assumption with the highest downstream impact is [X] — if wrong by 50%, projected revenue shifts from €310K to €180K.

That’s the executive view: how much of this is solid, how much is uncertain, and what specifically could break it.

Layer 2 — Full Tagged Analysis: The complete output with all provenance tags, calibration notes, gap labels, and sensitivity analysis. This is the working document. It’s what the analyst uses, what the reviewer signs off on, and what gets archived. It’s the audit trail.

The relationship between the layers is like the relationship between a financial statement and its footnotes. The statement tells you the numbers; the footnotes tell you what the numbers rest on. Both exist. Different readers use different layers.

How Non-Experts Read Tags

For teams where not everyone is fluent in the tagging system, simplify the visual language. Three colors work better than three acronyms:

  • VERIFIED → presented as normal text (no special marking needed — it’s the baseline)
  • ASSUMED → highlighted or marked with a distinct visual cue (e.g., italic, a colored sidebar, or a simple ⚠ symbol)
  • DATA GAP → presented as an explicit blank with a brief note

The core message non-experts need to internalize is simple: unmarked text is grounded; marked text is uncertain; blanks are honest. That’s a ten-second briefing. If someone can read a weather forecast that distinguishes “current temperature” from “tomorrow’s forecast,” they can read a tagged scenario.

When to Reduce Tagging

Not every use case needs full provenance. The right level of tagging depends on the stakes:

Stakes Tagging Level Example
Low Tag only gaps Internal brainstorming, early-stage ideation
Medium Tag gaps + assumptions Project proposals, budget drafts, team planning
High Full tagging + calibration Board presentations, investment decisions, regulatory submissions

For a casual strategy brainstorm, requiring VERIFIED/ASSUMED/PROJECTED on every line would kill the creative flow. For a €2M investment decision going to the board, anything less than full tagging is irresponsible. Match the framework’s intensity to the decision’s consequences.


The Framework Maturity Model

Putting it all together, organizations adopting the three-rule framework can think of implementation in three stages:

Stage 1 — Individual Practice: One person uses the three-rule prompt in their own AI conversations. Tagged output stays in their workspace. Value: personal quality control. Cost: near zero.

Stage 2 — Team Standard: The prompt is embedded in shared AI workspaces (Claude Projects, custom GPTs). Templates enforce the table format. Assumptions get informal peer review. Value: consistent quality across a team. Cost: template creation, brief training.

Stage 3 — Organizational Governance: The framework is integrated into planning processes. Assumption review is a formal workflow step. Calibration (reference class, sensitivity, pre-mortem) is standard practice. Decision packages separate confidence layers. Value: systematic risk reduction. Cost: process change, cultural shift.

Most teams should start at Stage 1 and see results immediately. Whether to progress to Stage 2 or 3 depends on how much is at stake when AI-generated scenarios inform real decisions. The higher the stakes, the more the governance investment pays for itself.


Limitations and Known Gaps

The three-rule framework is a practitioner pattern, not a peer-reviewed method. It deserves the same critical scrutiny it asks users to apply to AI output. Here are the things it doesn’t solve — and the ways it can be misused.

1. Not empirically validated

There are no controlled experiments, before/after error-rate measurements, or user studies behind this framework. Research shows that provenance tagging and structured prompting can reduce hallucinations — sometimes significantly — but this has been demonstrated for specific tagging schemes under controlled conditions, not for the exact VERIFIED / ASSUMED / PROJECTED pattern proposed here. Treat the framework as an engineering heuristic that probably helps in many cases, not as something whose effectiveness you can assume without measuring on your own use cases. If you adopt it, track whether it actually improves your outputs.

2. The prompt is one lever, not the only lever

The framework leans heavily on prompt design as the primary mechanism for controlling model behavior. In practice, prompts can reduce hallucinations, but models still violate instructions under pressure — especially when optimization, reward models, or fine-tuning push toward fluency and completeness. For production systems, prompt-level rules should be complemented by architecture-level controls: retrieval-augmented generation (RAG) to ground outputs in actual data, rule-based filters to catch unsupported claims, abstention mechanisms that refuse to generate when confidence is low, and human review workflows. The prompt is the user-accessible lever. It is not the only lever, and in high-stakes deployments, relying on it alone is fragile.

3. VERIFIED means “sourced,” not “infallible”

The framework’s tag hierarchy implies a confidence gradient: VERIFIED = solid, ASSUMED = fragile, PROJECTED = derived. But “verified” data can itself embed significant problems. Historical figures can reflect measurement error. Market data can encode vendor assumptions or sampling bias. Financial actuals can be non-stationary — a Q4 2024 revenue figure may be misleading for Q4 2026 projections in a post-shock market. The framework tracks provenance (where did this number come from?) but not quality (is this number still a reliable guide?). Users should resist the temptation to treat VERIFIED as “settled.” Data fundamentalism — assuming that sourced data is correct data — is a different failure mode than hallucination, but it can drive equally bad decisions.

4. Tags expose inputs, not structural validity

A scenario can be perfectly tagged — every number sourced, every assumption labeled, every gap flagged — and still be fundamentally misleading because the underlying causal model is wrong. Treating customer churn as independent of pricing. Ignoring feedback loops between marketing spend and brand perception. Assuming linear scaling where the real dynamics are nonlinear. The framework catches factual hallucinations (wrong inputs) but not structural errors (wrong model of how the inputs relate). The calibration methods described earlier — sensitivity testing, pre-mortem — partially help by stress-testing individual assumptions, but they test assumptions in isolation, not the relationships between them. ABP and scenario planning literature emphasize structural thinking, exploration of alternative logics, and the “world of no broken assumptions” as a reference scenario. This framework focuses on tagging and gap flagging, not on the quality of the mental model. A well-tagged bad model is still a bad model.

5. Labels don’t expose whose assumptions are being encoded

The categories VERIFIED / ASSUMED / PROJECTED can give a veneer of objectivity that hides power dynamics. Management may encode optimistic growth targets as ASSUMED without revealing the political pressure behind the number. A vendor’s market-size estimate tagged as VERIFIED may embed that vendor’s commercial interests. An analyst’s PROJECTED calculation may use a model that reflects institutional bias toward certain outcomes. The framework does not require the model (or the human) to reveal whose assumptions are being encoded or how they were generated. In organizational contexts, this matters: the question isn’t just “is this sourced or assumed?” but “whose interests shaped this assumption?” The framework doesn’t answer that question — and claiming it does would be a form of the same false confidence it’s designed to prevent.

6. Too many gaps can paralyze decisions

The framework explicitly penalizes guessing and encourages the model to flag [DATA GAP] and [ASSUMPTION GAP] at every opportunity. In high-uncertainty domains — which is most strategic planning — this can produce outputs dominated by gaps and caveats. ABP literature stresses that some assumptions must be made “for planning purposes” or planning cannot proceed. The stakes-based scaling table earlier in this post partially addresses this (brainstorming gets light tagging, board decisions get full tagging), but the underlying tension remains: the framework promotes a norm where “silent invention is worse than flagged uncertainty” without explicitly discussing when too much uncertainty signaling undermines decision-making. In a corporate context, if every plan is filled with prominent warnings, managers may either ignore the warnings as boilerplate or become overly cautious and delay needed decisions. Match the framework’s intensity not only to the decision’s stakes but also to the organization’s risk appetite and decision timeline.

7. Domain-specific adaptation required

The series claims the framework is portable across domains — document extraction, worldbuilding, business scenarios, cybersecurity, scientific writing. But those domains have very different stakes, epistemic structures, and regulatory environments. In medicine, tagging something as ASSUMED is far from sufficient to make it safe — existing guidance requires retrieval-augmented generation, external verification, and human oversight. In legal work, a custom label scheme might conflict with established citation standards or be misinterpreted by courts. In regulated industries, compliance frameworks may have their own provenance requirements that the three-rule labels don’t map onto. The general pattern provides a starting structure; domain-specific adaptation and validation are required before relying on it in regulated or high-stakes environments. The domain-specific posts in this series (cybersecurity, scientific writing) are first steps in that adaptation, not finished products.

These limitations don’t invalidate the framework — they bound it. The three rules are a significant improvement over the default (no provenance, no gap flagging, no penalty for guessing), but they are not a complete solution. They’re the beginning of a practice, not the end of one.


Sources and Further Reading

Three Prompt Rules That Stop AI From Guessing — And the Science Behind Them

Every new model generation arrives with fanfare: better benchmarks, higher accuracy scores, more impressive demos. GPT-5 reasons through complex problems. Claude plans ahead when writing poetry. Gemini processes images and video with startling fluency. The intelligence curve keeps climbing.

But there’s a second curve that rarely makes the keynote slides — the honesty curve. And it’s barely moved.

This isn’t a vague philosophical complaint. It’s a structural problem baked into how these models are trained, evaluated, and deployed. And it’s one that hits hardest in exactly the kind of work where people increasingly rely on AI: extracting data from contracts, parsing invoices, summarizing meeting notes, building CRM records from messy inputs.

This post unpacks why the intelligence-honesty gap exists, what the latest research tells us about its causes, and — most practically — three prompt rules you can apply today to force AI to be honest about what it doesn’t know.


The Gap: Intelligence vs. Honesty

When we say a model “got smarter,” we usually mean it scores higher on benchmarks — math competitions, coding challenges, multi-step reasoning tasks. These are real improvements. But benchmark scores measure a model’s ability to produce correct answers. They don’t measure a model’s willingness to say “I don’t know.”

In fact, the incentive structure actively punishes honesty.

In September 2025, OpenAI published a research paper that made this problem precise. The team — including researchers from Georgia Tech — examined major AI benchmarks and found that the vast majority use binary grading: either the answer is correct and gets a point, or it’s wrong and gets zero. Crucially, abstaining — saying “I don’t know” — also gets zero. The mathematical consequence is straightforward: guessing always has a higher expected score than abstaining. A model that bluffs on every uncertain question will rank higher than one that honestly declines.

OpenAI’s own blog post put it plainly: the situation is like a multiple-choice test where leaving an answer blank guarantees a zero, but guessing at least gives you a chance. Under those rules, the rational strategy is to always guess — even when you have no idea. And that’s exactly what the models learn to do.

The paper demonstrated this with a striking example: when asked for the PhD dissertation title of one of its own co-authors, a widely-used model confidently produced three different titles across three attempts. All three were wrong. It did the same with his birthday — three dates, all incorrect, all delivered with unwavering confidence.

This isn’t a bug that can be patched. It’s the natural outcome of optimizing for accuracy-only metrics. As the OpenAI researchers argue, the mainstream benchmarks and leaderboards need to be redesigned to penalize confident errors more heavily than uncertainty. Until that happens, every model that climbs the leaderboard does so in part by learning to bluff better.


Why Models Confabulate: Insights from Interpretability Research

The OpenAI paper explains the incentive problem. But what happens mechanically inside the model when it makes something up?

Anthropic’s interpretability research — published in March 2025 under the title “Tracing the Thoughts of a Large Language Model” — provides some of the most detailed answers we have. Using what they describe as a “microscope” for AI, Anthropic’s team traced the internal circuits that activate when Claude processes a question. It’s worth noting that these findings are specific to Claude 3.5 Haiku — other model families may handle uncertainty through different internal mechanisms — but the patterns are likely general enough to be instructive.

One of their most revealing discoveries involves what we might call a default refusal mechanism. In Claude, refusing to answer is actually the default behavior: the researchers found a circuit that is “on” by default and causes the model to state it has insufficient information. But when the model recognizes a “known entity” — say, Michael Jordan the basketball player — a competing set of features fires up and suppresses this default circuit, allowing the model to respond.

The problem arises when this mechanism misfires. If the model recognizes a name but doesn’t actually know the relevant facts, the “known entity” signal can still override the “I don’t know” circuit. The result: a confident, detailed, completely fabricated answer. In one experiment, the researchers used a person named Michael Batkin — someone unknown to the model, who by default triggered a refusal. But when they artificially activated the “known entity” features or inhibited the “can’t answer” features, Claude promptly — and consistently — hallucinated that Batkin was famous for playing chess.

Even more unsettling: Anthropic found evidence that when Claude can’t easily compute an answer (say, the cosine of a large number), it sometimes engages in what philosopher Harry Frankfurt would call bullshitting — producing an answer without any internal evidence of the calculation actually occurring. Despite claiming to have run the math, the interpretability tools revealed no trace of any computation. When given a hint about what the answer should be, Claude worked backwards, constructing plausible-looking intermediate steps that lead to the hinted answer — a textbook case of motivated reasoning.

These findings matter because they show that the honesty problem isn’t just about training incentives. The models have internal mechanisms that are supposed to catch uncertainty — but those mechanisms can be overridden by other pressures, including the drive toward grammatical coherence and the pattern-matching instinct to fill in gaps.


Automation Bias: Why This Matters More Than You Think

All of this would be merely academic if people treated AI output with appropriate skepticism. They don’t.

Automation bias — the tendency to over-rely on automated recommendations — is one of the most thoroughly documented phenomena in human-computer interaction research. A 2025 systematic review published in AI & Society analyzed 35 peer-reviewed studies spanning healthcare, finance, national security, and public administration. The pattern was consistent across domains: when an AI system delivers a confident answer, people accept it. They check less. They override their own judgment.

randomized clinical trial conducted with AI-trained physicians in Pakistan (published as a preprint in August 2025) made the dynamic especially clear. Even doctors who had completed 20 hours of AI-literacy training — including instruction on how to critically evaluate AI output — were vulnerable to automation bias when exposed to erroneous LLM recommendations. The training helped, but it didn’t eliminate the problem. Confident-sounding AI output has a gravitational pull that’s difficult to resist, even when you know to look for errors.

The real-world consequences are already visible. In February 2024, Air Canada was ordered to pay damages to a customer after a support chatbot — not a large language model, but an AI system nonetheless — hallucinated a bereavement fare policy that didn’t exist. The chatbot confidently told the customer they could retroactively request a discount within 90 days of purchase. The actual policy allowed no such thing. But the system stated it with such authority that the customer relied on it to make a financial decision. The underlying technology differed from today’s LLMs, but the dynamic was identical: confident AI output, uncritical human acceptance.

In an operations context, the failure modes are subtler but no less damaging. Consider a contract with payment terms mentioned on page 8 and page 14 — and the two pages say different things. A human reviewer might catch the discrepancy. An AI, asked to extract the payment terms, will pick one and move on. It won’t mention the conflict. It won’t flag the ambiguity. It will fill the cell in your spreadsheet with “Net 30” and give you no indication that page 14 says “Net 45.”

Meeting notes are another minefield. “Let’s circle back next week” becomes a specific date and a named owner in the AI’s summary — details that nobody actually stated, but that the model invented to produce a clean, actionable output.

The pattern is the same across invoices, insurance documents, lease agreements, vendor scoring, CRM data entry: wherever AI is used to extract structured information from messy sources, the model’s instinct to fill every field works directly against the user’s need to know which fields are uncertain.


Three Prompt Rules That Change the Incentive

These three problems — training incentives that reward guessing, internal mechanisms that can override uncertainty detection, and human psychology that accepts confident output at face value — come from different research streams. But they converge on the same practical conclusion: by default, AI will guess rather than admit ignorance, and people will trust the guess.

You can’t fix the training pipeline. You can’t redesign the benchmarks. But you can change the local incentive structure inside the conversation. The following three rules — adapted from a practical framework by D-Squared — do exactly that. They work because they explicitly reverse the default dynamic: instead of rewarding completeness, they reward honesty about uncertainty. Note that the effectiveness of these techniques may vary across model families — they’ve been tested primarily with ChatGPT and Claude, and other models may respond differently.

Rule 1: Force Blank + Explain

The single most effective change you can make is to explicitly instruct the model to leave fields blank when the data is ambiguous, missing, or unclear — and to explain why.

Without this rule, every field gets filled. With this rule, the model produces output like:

Field Value Reason
Payment Terms — BLANK Pages 8 and 14 state different terms — net 30 vs net 45
Renewal Date Jan 15, 2027
Liability Cap — BLANK References “Exhibit B” — not included in document

The blank fields are where the value is. They tell you exactly where to focus your attention. They’re the model admitting “I’m not sure” — something it would never do without explicit instruction.

The prompt language:

Extract the following fields from this document into a table. Rules: Only extract values that are explicitly stated in the document. When a value is ambiguous, missing, or unclear, leave the field BLANK. Add a column labeled “Reason.” Next to every blank field, include a one-sentence explanation of why you left it blank. Base every value on what the document actually says. Quote or reference the specific section you pulled it from.

One way to think about why this works is through the lens of Anthropic’s interpretability findings. The model has internal mechanisms for recognizing uncertainty — the default refusal behavior described above. But those mechanisms get overridden by the pressure to produce complete, coherent output. The “Force Blank” instruction may effectively give the uncertainty pathway permission to activate, rather than being suppressed by the completion instinct. We don’t know for certain that this is the internal mechanism at work — but the practical result is consistent and reliable.

Rule 2: Penalize Guessing

By default, from the model’s perspective, a wrong answer and a blank answer carry equal weight — neither earns praise, neither triggers correction. The model has no reason to prefer one over the other, so it defaults to guessing (which at least has a chance of being right).

Rule 2 changes this calculus with a single sentence:

A wrong answer is 3× worse than a blank. When in doubt, leave it blank.

This mirrors the scoring reform that OpenAI’s September 2025 paper advocates at the benchmark level. The researchers propose that evaluation systems should award points for correct answers, penalize wrong answers more heavily than abstentions, and give partial credit for appropriate expressions of uncertainty. They note that some standardized human exams have used this approach for decades — penalizing wrong guesses more heavily than skipped questions — precisely to discourage blind guessing.

You can’t change the benchmark. But you can embed the same incentive structure in your prompt. The 3× multiplier is arbitrary — pick any number that makes the model understand that silence is preferable to fabrication. The key insight is that you need to say it explicitly. The model won’t infer this preference on its own.

Rule 3: Show the Source

Even models that are told to “extract only” will drift toward inference. They’ll compute a renewal date from a start date and term length. They’ll estimate a total from line items. They’ll infer a contact person from an email signature. These aren’t necessarily wrong — but they’re not extraction, and the user needs to know the difference.

Rule 3 requires the model to label every value as EXTRACTED (directly stated in the document) or INFERRED (derived, calculated, or interpreted), with an explanation for every inferred value.

The prompt language:

For each field, add a column called “Source.” Mark each value as one of: EXTRACTED — directly stated in the document, exact match. INFERRED — derived from context, calculated, or interpreted. For every INFERRED field, include a one-sentence explanation of what you based it on.

The output looks like this:

Field Value Source Evidence
Start Date Jan 15, 2025 EXTRACTED Section 2.1, paragraph 1
Term Length 24 months EXTRACTED Section 2.1, paragraph 2
Renewal Date Jan 15, 2027 INFERRED Calculated 24 months from start date. Check Section 8 — early termination clause may alter this.

The EXTRACTED/INFERRED distinction is a practical implementation of what hallucination researchers call “provenance tracking” — tying every claim back to its source. The model is perfectly capable of making this distinction; it just doesn’t bother unless you ask.


The Combined Prompt

All three rules work together. Here’s the complete version:

Extract the following fields from this document into a table.

Rules:

– Only extract values explicitly stated in the document.

– When a value is ambiguous, missing, or unclear, leave the field BLANK.

– A wrong answer is 3× worse than a blank. When in doubt, leave it blank.

– For each field with a value, add a “Source” column: EXTRACTED = directly stated, exact match. INFERRED = derived, calculated, or interpreted.

– For every INFERRED field, add a one-sentence explanation.

– For every BLANK field, add a row to a separate “Flags” table explaining why the value could not be extracted.

The workflow change this enables is significant. Instead of reviewing every extracted value (which nobody actually does), you review only the blanks and the inferred fields. Everything marked EXTRACTED with a section reference can be trusted at a higher confidence level. Your attention goes where it matters.


The Bigger Picture

These three rules are a stopgap. They work — sometimes remarkably well — but they’re fighting against the grain of how models are trained. The deeper fix requires changes at the infrastructure level.

OpenAI’s hallucination paper calls for benchmark reform: scoring systems that reward calibrated uncertainty instead of confident guessing. Anthropic’s interpretability work points toward architectural insights — understanding the internal circuits well enough to strengthen the “I don’t know” pathway rather than relying on prompt-level patches.

Perhaps the most structurally promising direction is OpenAI’s “Confessions” research (2025). Instead of relying on users to prompt honesty, the Confessions approach separates the honesty objective from the performance objective during training itself. After producing a main answer — optimized for all the usual factors like correctness, style, and helpfulness — the model generates a separate “confession” report. This report is scored exclusively on honesty: Did the model flag its uncertainties? Did it acknowledge where it took shortcuts? Crucially, nothing in the confession is held against the main answer’s score, so the model has no incentive to hide its doubts. If this approach scales, it could move the honesty problem from something users have to prompt-engineer around to something the model handles natively.

These are promising directions, but none of them are available to you today. What is available is the ability to change the local incentive structure in your prompts. Force blanks. Penalize guessing. Require source labels. These three rules won’t make AI honest by nature, but they create an environment where honesty is the path of least resistance — and that turns out to be surprisingly effective.

The models are smart enough to know when they’re guessing. They just need permission to say so.


Sources and Further Reading