A €5k hardware-purchase research exercise. I drafted a verdict using a strong frontier model with full session context. The verdict read internally consistent.
Two fresh-session reviewers found fifteen issues in it, with near-zero overlap between them. One was a small same-family model running a mechanical checklist. One was a large same-family model in a fresh session running an argument-shape critique.
Two of those issues are useful here.
The first was a number nobody could source. A trade-press article had stated that a recent open-weight model needed only 5.48 GB of GPU memory to hold a million-token context, framed as a hardware-does-not-matter-anymore argument. The figure was credited to an unnamed analyst using unspecified modeling benchmarks. The drafting context took the number as authoritative. Four independent technical sources, when checked, gave figures from 9.6 GB up. The vendor's own model card gave only a relative figure.
The second was a category swap the drafter never flagged. An argument about demand for AI tools that assist with code by completing lines and suggesting edits was using usage data from a different product class: AI tools that run as long-running autonomous agents executing multi-step plans. The two shared the "self-hosted AI agent" framing but did materially different jobs. A reviewer with no investment in the drafting context, asked directly whether the cited platform's stated primary purpose matched the use case being argued, surfaced the swap immediately.
Notice what would not have caught either of these. Most advice on catching GenAI mistakes points at the model. Retrieval-augmented generation, citation checkers, grounded generation, judge models. The implicit picture is that if the model had more information, or the right kind of supervision, the output would be more reliable. Neither error here lives in the model's output. Both live in the drafting context, which the model cannot see from inside.
RAG retrieves what you ask for. The drafter had access to the trade-press article and to vendor documentation; what was missing was a prompt to compare them against each other. Citation checkers confirm that referenced sources exist. The 5.48 number had a citation, vaguely, with no name and no methodology. That is exactly the pattern model-level checkers cannot flag. Grounded generation pins outputs to a corpus. It does not check whether the corpus supports the kind of claim the output makes. All three operate on the model's output. None of them operate on the context that produced it.
What did catch the errors was a workflow with three properties: a registry that named which claims supported which conclusions and required source-tier metadata on each, a fresh-session reviewer with no investment in the conclusion, and a multi-pass structure that escapes the bias of the drafting context at two different scales. The registry is the verification artefact. The reviewer is the verifier. The multi-pass structure is the redundancy. None of this lives inside the model.
The framework I have been building is agent-ready-papers. The name is paper-specific; the discipline is not. Typed claim registry. Source-tier metadata on every load-bearing claim. Anti-hallucination checklist applied before commit. Multi-pass review with reviewer-bias escapes at two different scales. The paper-specific bits, page budgets, LaTeX, journal style guides, are about twenty percent of it. The other eighty percent is "GenAI failure modes need verification infrastructure because the failures are not noise, they are structural."
What I have come to see is that this generalises cleanly to research and decision-making. If you are about to commit money or reputation based on GenAI-synthesised claims, the typed registry, source tiering, anti-hallucination checklist, and multi-pass review apply identically. None of them require the artefact to be a paper.
The honest counter-argument is overhead. A V&V battery costs real time. For low-stakes work the model-level interventions are the right tool: cheap, fast, no infrastructure. The framework starts earning its keep where the cost of being wrong exceeds the cost of running the verification. Hardware purchase decisions cross that threshold. Most strategic decisions do. Most personal coding sessions do not.
The discourse will catch up to this distinction slowly. Every GenAI-engineering thread right now defaults to model-level: which retriever, which evaluator, which prompt template. These are real, and they matter. But they will not catch the failures whose cause lives in the workflow that wraps the model. Category-shape and source-shape are not properties the model has access to from inside its own context. The verifier has to be outside.
Where have you seen verification at the wrong layer catch up with you?