Agreement is not verification. · dev.jeroenveen.nl

A plan arrived from another working session, and it looked finished. One GPU workstation, around €8,000, with the graphics card alone at €4,500 to €5,500. It came with a rationale, a tax treatment, a line in a budget spreadsheet, and the quiet authority of something already decided. The word attached to it was "locked."

It was also wrong, and the thing that showed it was wrong cost about ten minutes.

I want to be precise about what "wrong" means here, because the plan was not sloppy. It was internally consistent in the way that makes you stop checking. The 48GB of video memory was justified by the size of the model it would run. The model size was justified by the workload. The cost was justified by a Dutch investment-deduction argument that genuinely does reward buying more in a given tax year. Each step followed from the one before. A confident plan and a correct plan look identical from the outside. Self-consistency reads exactly like correctness, and it is not the same thing.

The crack was a single question the plan never grounded: did the workload actually need 48GB? The plan asserted it. It did not show it. So I read the actual specification, the source document, not the plan's summary of it. The picture inverted. The large model in this system runs as a frozen oracle: it does inference to generate training labels. It is never itself trained. The thing that does get trained is a small classifier, a fraction of the size, cheap to fit almost anywhere. Running a 27-billion-parameter model for inference at 4-bit precision needs roughly 17GB. A 24GB card covers it with room to spare. The 48GB figure had come from a full-precision number that the actual workload never used, multiplied by a tax argument that happened to reward the larger purchase. The tell was sitting in the same folder: an earlier version of the budget had specced a 24GB card. The expensive version drifted in later and picked up the word "locked" along the way.

That is the boring kind of error, the kind a careful read catches. The other kind is more interesting, because models are most confident exactly where they are most checkable. A month earlier, a different plan had recommended a specific laptop: a plausible model name, a plausible price, a plausible spec. The machine did not exist. The generation it named had never shipped in that configuration. It was a composite, real parts from three different laptops fused into one model number that no shop had ever sold. What surfaced it was not a smarter prompt. It was asking a second and a third engine the same narrow question independently, and watching them fail to confirm. A part number, a price, a date: that is where fluent output lies with the straightest face, because those are the details that make a recommendation feel researched.

So for the GPU decision I built the checking in. One model did the primary research. Then two others, different families with different training data and different failure modes, were each handed the same narrow questions: does this card exist, is this price plausible for this market, right now. Where three independent passes agree, you have something close to evidence. Where they diverge, you have a flag, and a flag is more useful than a smoothed-over average. One card in the set came back confirmed by only a single outside model, because the other hit a usage limit mid-run. That gap is worth recording honestly rather than rounding up to "verified." A check you did not actually complete is not a check.

Then the part I did not expect. Verification has a reputation as the function that reverses decisions, the skeptic in the room who says no. This time it did the opposite. I reopened the entire question across the whole market, thirty-five cards, new and used, consumer and datacenter and even the odd memory-modified import. The sweep did not overturn the cheap option. It confirmed it. A used 24GB card was the only strong single-card fit at a sensible price, and the search surfaced exactly one genuinely new alternative worth filing away for later. Verification that confirms is not wasted motion. It turns a hunch into a decision you can defend, which is a more durable thing to own than a guess that happened to be right.

Underneath all three episodes is the same confusion, and it is an expensive one. Agreement is cheap. Verification is not. They feel the same from the inside. Two models agreeing, or one session announcing that something is locked, has the texture of having been verified. It is usually just correlation of priors: models trained on overlapping data, or one context agreeing with itself. Real verification is independent or it is nothing. A different model family, though even that is only partly independent: models trained on overlapping slices of the same internet can share a blind spot, so two of them agreeing is weaker than it looks. A fresh session carrying none of the drafting context. Best of all, the primary source read directly, rather than through a fluent summary that has already decided what it says. I have argued before that verification is a workflow problem rather than a model problem. This is the same claim with a receipt attached.

The arithmetic is the part that should bother anyone shipping AI-assisted work. The unverified plan was about €8,000. The verified one is the right card plus a sane build around it, roughly €2,500 to €3,000, and almost all of the gap between that and a cheaper figure I would have quoted a year ago is a memory-chip shortage, not the card. What found the original error was a specification file I had not opened and two short queries to other models. Minutes, against thousands of euros. The costly mistakes in agentic work are rarely the obvious hallucinations, the made-up function or the garbled fact you catch on sight. They are the confident, well-formatted, internally consistent recommendations that no second source ever looked at, because they never asked to be looked at.

The plan that arrives looking finished is the one to check first. It has stopped inviting questions, which is exactly the moment the question is worth the most.

When a model hands you a clean, confident answer, what makes you check it anyway, and what makes you let it through?