smallbox

← All articles

Working with AI agents safely

Part of the system is a language model I can't fully inspect. How do I trust what it says?

Auditing a system that thinks

Most of a system, you can audit by reading it. You step through the code, recompute the number by hand, find the line where the logic went wrong. A language model is not like that. It hands you a fluent, confident answer — a score, a classification, a paragraph of analysis — and there is no line to step through. The reasoning that produced it isn't written down anywhere you can open.

That would be tolerable if a model were most reliable on the things that matter most. It is often the opposite. A model is trained on the public world, which does not always tell the truth, and it is least precise on exactly the things people act on: a specific number, a named fact, a date. It will give you "$2.3 billion" in the same calm tone whether the figure is right, rounded, or invented. The danger isn't that it writes badly. It's that it writes confidently about things that are not so — the same failure that makes AI coding tools risky on a real codebase, pointed at facts instead of code. To a hurried reader, confidence reads as correctness.

So if you put a model in front of any decision, you inherit a specific problem: how do you trust output you cannot inspect?

The check has to come from outside the model

The wrong move is to try harder to inspect the model. You can't, and a cleverer prompt doesn't change that. The useful question points outward instead of inward: what did we check against something the model did not produce?

Everything that works comes back to that one line. A check has to anchor on ground truth the model never touched. If the only evidence that an answer is right is the model itself, you have a feeling, not a check.

This is why one tempting move isn't a check at all: asking the model the same question again. If it agrees with itself, that feels like confirmation. It isn't. The same question asked twice is one witness repeated, not two witnesses agreeing — and depending on how the call is made, the second answer may simply be the first one returned from a cache. Repetition raises your confidence without adding any evidence, which is the most expensive kind of false comfort. Real independent evidence comes from a different direction: a different question that should only line up if the answer is true, or the same fact recomputed a way the model never saw. Diversity of angle is what makes two checks worth more than one. Sameness just echoes.

Three rules that keep generation honest

Here is what that looks like in a system we built that reads the structure of public companies. Most of what it says is generated, so most of it has to be checked against something that isn't.

The model never grades its own homework. When a claim reduces to arithmetic — has this company earned a positive income every year for five years? — we don't ask the model whether it's true. We recompute it in plain code from the stored financial rows, then check the model's name for the thing against the calculation. Most of the catalogue is verified this way, with no model in the loop for the math at all. The model's job is to describe; the check's job is to disagree when the description is wrong.

What gets generated and what gets trusted are kept apart. A finished report is assembled only from claims that have already passed a check — the part of the system that writes the prose never sees a raw signal, only verified evidence. A fluent sentence can't smuggle in an unverified claim, because by the time the writer runs, the unverified claims are already gone.

Every claim carries two labels. One says where its backing comes from — a computed fact, a structural reading, an outside source. The other says how strongly it may be stated — verified, supported, or speculation. The backing caps the strength: a structural reading may not be written as a verified fact. The single forbidden move is promoting one into the other, and we treat it as the cardinal sin precisely because fluent prose makes it more dangerous, not less. A fabricated specific rides the same confident tone as the sound mechanism beside it, and the reader can't hear the seam.

The part we print instead of hiding

Then there is the part most systems bury, which we decided to print.

Some things we cannot audit automatically yet, and we say so. A reading about the shape of a price chart can't be confirmed by recomputation — there's no number to recompute, and a model asked to read the chart can simply misread it. A claim about a company's named suppliers, or how concentrated its customers are, can't be checked at all, because we don't hold that data. Asked anyway, a model will happily produce a confident, specific answer it has no right to give. In testing, ours did: it wrote that a company "cannot operate without TSMC," that a reservoir's inflows "fall 40–60% below design levels," that an operation "sits inside one building in Shanghai." Plausible, specific, and unfounded. The skeleton of the reasoning was sound; the details were invented, and fluent enough to pass for fact.

The fix there isn't a better prompt. It's a rule about what the model is allowed to keep: where the only backing is a general pattern plus the model's own inference, the claim is abstracted back to the mechanism, and the gap is named out loud. The system has a section whose entire job is to state, plainly, what it cannot yet see. "Insufficient evidence" is treated as a correct answer, not a failure to paper over. A short honest result beats a long confident wrong one.

A generator, not a source of truth

None of this makes the model less useful. It makes it useful in the right place. A language model can be a remarkable generator and a poor source of truth at the same time — and for anything quantitative, financial data most of all, it cannot be the only source, because it is not precise enough to trust alone. The way to build with one is not to make it more certain. It's to build the system so that the model's certainty is never the thing you lean on.

If part of your system is a model you can't fully inspect, the move is the same as ours: separate what it generates from what you can independently check, anchor every claim that matters on something the model didn't produce, and make "we can't see this yet" an answer the system is allowed to give. The rest is keeping the record — the version of the output you can point to later, and a way to check it still holds six months from now.

A system that thinks can't be audited line by line. It can be built so that the parts you trust are the parts you checked.

Articles describe the Foundation. The Foundation Map is the thing itself — accounts, admin, email, logging, and deployment, with one real workflow running through them.

← All articles