Why a green test suite is not the same as safety

Two backends. Both have green CI. Both have a test suite the team calls comprehensive. One catches the next regression on the PR before it merges. The other ships the regression straight to production. The difference is not test count. It is what each test actually anchors against.

This is the hardest property of an inherited system to read from the outside, and the one that most often gets misread. A green run feels like permission. The System Report does not let it count as permission until one question is answered for each test that matters: what independent reality does this test agree with?

Three ways a green test means almost nothing

Three failure modes hide in plain sight inside a passing suite. Each one is common; on inherited systems running for five or more years, all three are usually present.

The author wrote both sides. The same engineer wrote the implementation and the assertion at the same time, against an invented expected value, with no independent source for the answer. The test agrees with the code; neither agrees with reality. When the code is wrong, the test is wrong in the same way.

The boundary that mattered is mocked. The thing the test was supposed to verify — that the entity actually persists, that the vendor really returns the shape the parser expects, that the email body renders the way the user reads it — is replaced by a mock that returns whatever the test wants. A unit test for BillingService that mocks the database tells you BillingService calls _db.Save. It does not tell you the row lands.

The assertion is on implementation, not on behaviour. Service.Foo was called once with X. The handler logged "completed". These are checks on the shape of the implementation. A correct refactor breaks them. A broken implementation that preserves the shape passes them. The test is not protecting behaviour; it is freezing structure.

A suite where most tests fail one of these three checks is what the report calls false-green. It is not safety. It is a green checkmark.

The trust classification

The report tags each test that matters by the reality it actually anchors against. Four levels. The same test type can be all four depending on how it is wired.

High-trust — the assertion comes from somewhere the test author did not invent. Build an entity with ten known fields, save it through the real DbContext, reload it from the database, assert all ten fields persisted. Send a real HTTP request to the running system and assert the response shape matches the JSON the frontend already consumes. Run a known input through the document-generation path and diff against an approved file checked into source control. Re-evaluate a signal against a captured fixture and assert the score the product owner confirmed. The test is making the system meet reality, not meet the author.

Medium-trust — the test exercises the real code paths with one boundary deliberately stubbed. The vendor is faked; the clock is frozen; identity is mocked. Useful when the stub matches the real boundary. The test is evidence to the degree that the stub is honest.

Low-trust — heavy mocking, narrow assertions, behaviour invented by the author. May be worth keeping as a regression alarm — something changed — but cannot be cited as evidence the system is correct.

Dangerous — the test mirrors the implementation so closely that it will pass for any implementation that compiles. Mock-call assertions where the setup and the assertion say the same thing in two places. These are not weaker tests; they are negative-value tests, because they punish honest refactoring while protecting nothing. The recommendation in the report is usually to cut them.

The classification is per use, not per category in the abstract. An integration test against a real database is high-trust for the persistence behaviour and low-trust for the rendered HTML. The same test can be evidence and not-evidence in the same run.

The AI amplifier

This used to be a slow problem. AI coding tools have made it a fast one.

A model can generate a hundred green tests in an afternoon. Most of them mock everything they touch, assert on shape rather than behaviour, and produce expected values invented by the same model that wrote the implementation. The CI page lights up. The diff looks productive. The pull request is approved.

What was actually added to the system is coverage with no anchor. A future refactor breaks the tests for cosmetic reasons. A future regression slides through because the assertions never knew what behaviour to defend. The team's confidence rises while the actual safety net stays roughly where it was.

This is one of the central traps the report names when AI is used in implementation. The fix is structural, not stylistic: AI-generated tests are subject to the same trust classification as human ones. A high-trust AI test is fine. A low-trust AI test that nobody read carefully is worse than no test, because it carries the appearance of safety into the next decision.

What "anchored against reality" looks like in production

The pattern is in production on a gamified financial-literacy platform. The artefacts are real, named, and committable. A 45-day simulation runs every action through the production Facade and BusinessServices — quizzes, stock buys, XP allocations, weekly quests. The resulting Postgres state is captured as a pg_dump and checked into source control. Every test starts from that frozen real state, executed through the same code that runs in production. Time is mocked, vendors are stubbed at the boundary, but the application layers are real.

What the test asserts is what the system did to that state. A test that says "buying a stock reduces cash" runs the real BuyAsync against a real student in a real portfolio inside a real database. If the cascade — fee calculation, portfolio weight, ledger entry, parent notification — is broken anywhere, the assertion fails. The author cannot quietly mock the broken thing into correctness.

That is what the report means by high-trust. It is not a virtuous label. It is a structural claim: the test cannot lie because the system has to.

What this tells you

Three moves, in order.

Stop counting tests. Coverage percentage and test count are the wrong axes. The right axis is anchored-against-reality. A team with 200 high-trust tests is in much better shape than a team with 2,000 low-trust ones, and the ratio is not visible from CI.

Find the false-green areas first. The areas of the codebase where green CI is not yet evidence are the areas where new work is most dangerous. The cheapest improvement is rarely "more tests"; it is replacing a handful of low-trust tests with one or two high-trust ones for the same flow. A characterization test on the existing behaviour is usually where this starts.

Treat the four-property gate as the standard. A change is not testable in the sense the report uses until at least one high-trust or honest medium-trust test exists for the affected behaviour. If none exists, the recommendation is to build it before the change, not to ship and watch.

Where this fits in a System Report

The trust classification is part of how the System Report thinks — specifically the Change-Safety and Test Trust Plan that the report produces in week two. The plan does not say add more tests. It says: for each behaviour your next implementation package will touch, what trust level is the safety net at today, and which one or two high-trust additions would move it from green-checkmark to evidence.

Tests volume is not the goal. Confidence before change is.

Articles describe the lens. The questions a System Report asks are how that lens is applied to your system.

Other articles in this cluster →See the System Report →Send your system context →

← All articles