What characterization tests are, and when you actually need them

The hardest code to refactor is the code nobody understands. Every team has it. The function that has been there for seven years, that nobody on the current roster wrote, that has no test, that handles the partner export, and that the team has agreed not to touch on principle.

The trick is that you do not need to understand the function to make it safe to change. You need a test that captures what the function does today — not what it should do — and then you can refactor freely until that test breaks. That kind of test has a name: a characterization test.

This is one of the most under-used patterns in inherited-system work. It is also one of the cheapest things the System Report recommends, in raw effort. The reason it is under-used is not that it is hard; it is that engineers are trained to write tests against the correct behaviour, and a characterization test is written against the current behaviour, which is a different posture.

The wrong question

The intuitive question, when you sit down to test a mystery function, is what is this function supposed to do? That question is unanswerable on inherited code. The original author is gone, the spec is missing, the comments are wrong, and the variable names predate the team's current product.

The right question is the smaller one: what does this function do, on the inputs the system actually feeds it? That question always has an answer. You can find it by running the function. You can find it by replaying production traffic. You can find it by reading rows out of the database and feeding them through. The output the function produces today is the spec, by definition, until somebody who knows the business says it should not be.

A characterization test makes that observation reproducible. It captures inputs and outputs in source control, then asserts the same outputs come out next time. The function can be refactored, rewritten, replaced — the test stays the gate.

The shape of a characterization test

The pattern is simple, and the simplicity is the point.

Pick a function or a flow whose behaviour you need to preserve.
Capture inputs that exercise the behaviour. On a function with a single argument, that may be a list of values. On a flow, it may be a database state and a sequence of calls.
Run the existing system against those inputs. Capture the outputs verbatim. Save them as approved files in source control.
Write a test that re-runs the same inputs through the same code and asserts the outputs match the approved files.
From this point, any refactor that changes the output fails the test. Any refactor that preserves the output passes.

That is the whole pattern. Most characterization tests fit on a screen. The work is not in the test; it is in choosing the inputs.

A pricing function, for example. The team is afraid to refactor it. The team takes 200 historical orders out of the production database — masked if needed, in the way the production-shaped-data article describes — and runs them through the pricing function. The 200 outputs are captured, exact to the cent, and committed. From now on, any change to the pricing function that does not produce those exact 200 outputs fails CI. The team can refactor freely until the moment a real difference appears, and when it does, the difference is visible row by row.

This is a high-trust test by construction. The expected output was not invented by the test author; it was produced by the same system that runs in production. The test cannot lie because the system has to.

When a characterization test is the right move

The default mistake on inherited systems is to write characterization tests for everything. That is not what the report recommends. The pattern is targeted. The right move when:

You need to refactor the function but you cannot describe what it does. The classic case. The test captures the behaviour you cannot describe, and the refactor has a target.

You need to extract a piece of logic into its own service. The original logic continues to run; the extracted service is wrapped in a characterization test against the same inputs; when the outputs match for a long enough sample, the cutover is reduced to a routing change.

You suspect dead code but you are not sure. A characterization test on the suspected-dead function records what it does on the inputs it currently sees. If the function is genuinely dead, the test never runs. If it is sometimes dead and sometimes called, the test will show you when. This is a softer move than deletion, and it converts an unknown into evidence — see the five kinds of weird code.

You are about to let an AI agent rewrite the function. This is the AI version of the pattern. A characterization test is the contract the AI is rewriting against. Without it, the AI invents the contract from the implementation, and you get false-green tests that pass because the model wrote both sides. With it, the AI's rewrite either matches reality or fails — see AI coding tools and false confidence.

When it is the wrong move

A characterization test is not a substitute for a unit test on a known specification. If the function is a pure calculation with a documented formula — XP at level N is 200 + (N − 1) × 50 — a unit test against the formula is the right shape. A characterization test against the existing output would freeze the implementation against itself, which is the dangerous-trust trap the report warns against.

The line is whether the spec exists outside the code. The XP formula is documented in product spec → unit test. Nobody can describe the discount cascade end to end → characterization test. The first defends a known truth. The second defends an observed reality until the truth becomes known.

The data half

The piece teams underestimate is that the inputs matter as much as the outputs. A characterization test against five hand-written inputs will pass for almost any implementation that compiles. A characterization test against 200 real production-shaped inputs catches the difference the team did not know was there.

This is where the DB-per-test pattern earns its weight. Tests run against a frozen pg_dump of a 45-day simulation, through the same production code, with assertions on real outputs. A characterization test on top of that infrastructure is much stronger than the same test against synthetic data, because the simulation reaches edge cases that nobody thought to write.

For systems without that scaffolding, the lighter version still works. A few hundred recent inputs from production logs, masked, replayed against the function. Captured outputs in source control. Done.

What this tells you

Three moves, in order.

Stop trying to derive the spec. On code nobody understands, the spec is what it does. That is sufficient for safe refactoring, and the test is how you write it down.

Pick the inputs honestly. Five inputs is not a characterization test; it is a smoke test. The inputs need to be enough, and they need to be drawn from the shape the system actually sees, not from the shape a test author imagines.

Use the test as a contract for refactoring or for AI rewriting. The most powerful application of the pattern is the one that lets you change the code freely. A function with a real characterization test is a function the team is no longer afraid to touch.

Where this fits in a System Report

Characterization tests are recommended in the Change-Safety and Test Trust Plan — the artefact the report produces in week two. The plan picks specific behaviours that need to be preserved before implementation begins and writes the test recommendation per case. Add a unit test and add a characterization test are not the same recommendation, and the report is explicit about which one applies.

If a piece of code in your system is on the do not touch list, the cheapest move that takes it off the list is almost always the same one: capture what it does today, commit the inputs and outputs, and let the team refactor against the gate.

Articles describe the lens. The questions a System Report asks are how that lens is applied to your system.

Other articles in this cluster →See the System Report →Send your system context →

← All articles