Two AI reviews in assessment mode caught zero errors in a 68-equation theory document. A third pass with the same model family but a reproduction instruction, caught three.

Some context. A theory document I had been working on for several months. Ten chapters, AI-assisted throughout, covering mechanics, synchronisation, electromagnetism, control, power analysis, and the error budget. Hardware was about to be built on top of it. I needed to know whether the equations were right.

First, I ran two LLM reviews in the usual assessment mode: Read this, check it for physical soundness, and flag anything that looks wrong. One pass used Gemini, the other Claude. Both came back clean. No errors found. Dimensions checked out, magnitudes were in plausible ranges, and the prose was internally consistent.

Then I ran a third pass with a fundamentally different instruction. This time, the prompt forced reproduction rather than judgment: Substitute the stated input values into every equation. Compute the result step by step using code and calculation tools. Compare the output against the value in the document. Never skip a calculation; never paraphrase a table.

That pass caught the errors.

Coupling estimates came out optimistic by a factor; timing values were nearly an order of magnitude too small. Two were arithmetic bugs of the same kind; neither survived recomputation. The third error was sharper because it self-contradicted on the page: the text said one thing, but the numbers said another.

All three errors produced individually plausible numbers. They had the right units, reasonable magnitudes, and cohered perfectly with the surrounding prose. An assessment-mode reviewer checks whether an artifact looks right. These errors looked right.

Reproduction strips out the degree of freedom plausibility lives in. A calculator does not have an opinion about whether 3.3 times 3,600 equals 36 seconds per hour. The instruction to compute rather than to judge removes the move where "this reads as reasonable" gets to substitute for "this is correct." Same model. Different mode. Different result.

AI review is plausibility review, unless you force it not to be.

This mechanism isn't specific to arithmetic. Assessment checks the surface an artifact presents; reproduction operates on a reality the artifact cannot fake.

Citations are a perfect example. They are entirely plausibility-shaped: an invented DOI looks like a real DOI, and a hallucinated author sounds like a real author. An LLM asked "does this citation support the claim?" can answer fluently while referencing a paper that doesn't exist. The escape velocity is forcing the AI to verify that the source actually exists and explicitly matches the text, rather than judging whether the citation reads plausibly.

Code follows the exact same pattern. Functions that compile, pass their tests, but solve entirely the wrong problem are "plausibility-passing." When the same developer writes both the implementation and the test, the test inherits the implementer's blind spots. The only check that escapes this loop is reproduction in the runtime sense: executing the function against edge cases the implementer failed to anticipate.

In every instance, the pattern is identical: the artifact reads as right, so the reviewer accepts it. The escape is a completely different kind of check, not a sharper version of the same judgment.

These patterns form the backbone of a framework called agent-ready-papers, which includes the equation-checker for arithmetic, the anti-hallucination checklist for citations, and reproduction tests for code.

Rigorous verification costs real time and compute, and not every artifact warrants it. A weekend prototype probably doesn't. But a theory document with hardware waiting on it easily crosses the threshold. So does a €5k procurement exercise, or a production deployment whose failure mode is being quietly wrong for months.

The framework starts earning its keep the moment the cost of being wrong exceeds the cost of running the verification.