smallbox

← All articles

What you could build

Can GitHub plus an LLM make a review assistant teams actually keep on?

An AI code reviewer dies of noise, not of being switched off

Every automated code reviewer dies the same death, and it is rarely being switched off in anger. It is being quietly muted. A team installs it, the first week it posts forty comments, a third of them are noise, and by the third week everyone has trained themselves to scroll past its avatar without reading. It is still running. It is still posting. Nobody is looking. That is the death — not deletion, irrelevance — and the entire product is the work of avoiding it.

The part everyone pictures when they imagine building this is the easy part. A pull request opens, a webhook fires, you fetch the diff, you hand it to a language model with instructions to review it, you post what comes back as comments. That is a weekend on a foundation that already handles the accounts, the webhooks, and the jobs. It will work in a demo on the first try, and the demo will be misleading, because a demo is one carefully chosen pull request and the product is ten thousand of them landing on tired people who have the power to ignore you.

The one metric the whole thing lives or dies on

Strip away everything else and a code reviewer is judged on a single question, asked unconsciously by every developer who sees one of its comments: was that worth reading? Not "was it correct" — worth reading. A comment can be technically defensible and still be a waste of the developer's attention, and the developer does not separate the two. They feel the cost of stopping, and they remember it.

This is why false findings are not a quality issue you tune later. They are existential, and the reason is that they teach. One confidently wrong comment costs far more than one good comment earns, because the good comment helps with a single line and the bad comment teaches the reader a lesson about the tool — that its avatar is not worth the interruption. Once learned, that lesson applies to every comment the tool will ever post, including the right ones. A reviewer that is correct nine times and noise once does not net out ahead; it nets out muted, because the tenth comment is the one that sets the habit. The product is not "find issues in code." It is "find the issues that are worth interrupting a human for, and post nothing else," and the second clause is the hard one.

This failure has a name outside code review: alert fatigue, the same force that drowns a market-data alerting product when it pages users too often. Cry wolf and people stop running toward the sound — not because they decided you are useless, but because their attention is finite and you spent it carelessly.

The discipline that fights it

If there is a real product here, it is not in making the model find more. It is in making the system post less — specifically, in refusing to surface a finding until it has tried hard to prove the finding wrong and failed.

That is the move: before a comment ever reaches a developer, the system argues against its own finding. Is this actually a bug, or does it just look like one out of context? Would this fire on correct code? Is there a reason the author wrote it this way that the model cannot see in a diff? Only the findings that survive a genuine attempt to refute them get posted; the rest are killed silently, before they ever cost anyone a read. The default is silence. A finding has to earn its way past a skeptic to become a comment.

This is not a theoretical discipline borrowed for the article — it is close to how the studio does its own reviewing, where a finding is treated as a claim to be verified and the false ones are killed before they reach a person, precisely because a review full of plausible-but-wrong notes trains the reader to stop reading. The instinct transfers directly: the value of an automated reviewer is not its recall, it is its restraint, and restraint is the part that is genuinely hard to build and genuinely hard to copy.

What you own, and what you rent

The model is rented, and swappable — the diff goes in, the candidate findings come back, and which provider does that is a detail. What you would own is the layer that decides what is worth posting: the verification step that kills the weak findings, the record of what was flagged and what the team actually did with each comment, and the slow tuning of signal-to-noise that record makes possible. That record is the real asset, because it is the only way to know whether you are earning reads or losing them — and a reviewer that cannot measure its own signal-to-noise is flying blind toward the muted state. Rent the reading; own the judgment about what deserves to be said.

The hard part

Two, and they compound. The first is the signal-to-noise problem itself, which has no finished answer — keeping false findings near zero while still catching the real ones is a permanent tuning problem, not a feature you complete, and it is the kind of work that is never quite done because the cost of getting it wrong never stops mattering. The second is the market. This is tooling that sits next to where developers already work, and that neighbourhood is crowded with capabilities bundled into platforms teams already pay for — the code host's own review features, the assistant already in the editor. A standalone "AI reviews your pull requests" product walks into that room competing with things that come free in the tools alongside it.

The verdict

This reads as a module or a feature more than a standalone SaaS, and the honest reason is the shape of the value. The technology is easy and getting easier; the discipline — the restraint that keeps the tool worth reading — is the entire differentiator, and a pure discipline is a hard thing to sell as a separate product when adjacent platforms fold in a passable version for free. The version that could work is narrow and opinionated: a reviewer for a specific kind of codebase or a specific class of mistake, where the verification is genuinely sharper than the bundled tools and the team feels the difference in their muting behaviour rather than in a feature list. The general reviewer competes on breadth and loses; the sharp one competes on restraint and might not.

If you build it, the foundation does the bounded thing it always does — it carries the webhooks, the jobs, the accounts, and the logging so the work goes where the product actually is, which here is the verification layer that decides what is worth a developer's attention. That is the part no API hands you and no platform's bundled feature is incentivised to do well, and it is the only part of this idea that was ever going to be the company.

Articles describe the Foundation. The Foundation Map is the thing itself — accounts, admin, email, logging, and deployment, with one real workflow running through them.

← All articles