Refactor or rewrite: how to know which is safer

The question always arrives the same way. A founder, a new CTO, or a technical owner inherits a backend that works but has accumulated. Logs that nobody watches. Tests that pass but nobody trusts. Files that the team avoids on Fridays. At some point — usually when a feature request reveals a hidden coupling, or when a candidate engineer turns down the role after seeing the codebase — the question lands on the desk: do we refactor this, or do we rewrite it from scratch?

The honest answer is that "refactor or rewrite" is the wrong frame. Both are paths. Both are reversible or irreversible depending on how they are run. The real question is which of the two can be made safe, in this system, with the team and the budget you actually have.

This article walks the choice the way the System Report walks it.

Why "rewrite" feels obvious, and why that feeling is misleading

Rewrites get proposed for emotional reasons more often than technical ones. The codebase looks ugly. A new framework looks clean. The original authors are gone. A junior engineer says "we could do this in three months in $LANGUAGE". The CTO does not yet know which parts of the existing system are accidental and which are load-bearing, so it is not yet possible to know what would be lost in a rewrite.

The cost of being wrong is asymmetric.

A refactor that fails costs a sprint. A rewrite that fails costs a year.

A rewrite also has a hidden second cost. While you are rewriting, the old system stops getting fixes. The team that was on the old system either resents the new one or quietly leaves. Customers feel a slow degradation in support. By month nine, the rewrite is competing with itself: every week of reality drift in the old system makes the new one less of a replacement and more of a parallel codebase. Some rewrites are abandoned. Some ship eighteen months late, missing features that nobody documented. A few succeed — and those are the rewrites that turn out, in the post-mortem, to have been done in stages anyway.

Saying this out loud is not the same as ruling rewriting out. There are systems that should be rewritten. The point is that rewrite is the larger decision, so it carries the larger evidence requirement.

The four properties of safe change, applied to both options

The System Report does not ask "is this code clean?" It asks whether a change to it can be made observable, testable, reversible, and confirmable. Anything missing one of those is not safe to recommend. The full definition lives in the four properties of safe change; applied to the refactor-or-rewrite question, it stops being abstract and becomes a comparison.

Observable. Refactors of a single area can be watched: the affected logs, the affected admin views, the affected metric. A rewrite running in parallel is much harder to observe, because the old system and the new system rarely emit the same signals. A team running both has to invent a shared telemetry layer before they can know whether the new one is behaving correctly. Most teams do not do this until the cutover is already in trouble.

Testable. A refactor can be exercised against the same database, the same vendor responses, and the same admin workflows that the old code already runs against. A rewrite on a clean stack has nothing to test against except itself, until something is wired into reality. The rewrite usually compensates with mock-heavy tests that pass at every commit and prove almost nothing — exactly the false-green problem that the System Report names explicitly.

Reversible. A refactor is reversible per change: the old function is still in the file, the new one is behind a flag, the old route still exists, the migration has a down path. A rewrite is reversible only at the level of "abandon it and go back". Two months in, that is already an expensive reversal. A year in, it is a board-level conversation.

Confirmable. Refactors usually preserve enough of the surface that a product owner or a long-tenured operator can confirm that nothing important changed. Rewrites change everything at once. A team rewriting a system rarely realises how much small business behaviour they are reinventing — the rounding rule on invoices, the off-hours email template, the partner CSV that has a column nobody remembers naming. A rewrite without a long, detailed confirmation phase ships with a tail of "we used to do that" bugs.

A rewrite can satisfy the four properties. It usually does not, because teams underinvest in the shared telemetry, the parallel-run window, the characterization tests, and the confirmation phase that would make it safe. The System Report's job is to say which of the four is missing in this specific engagement, and to refuse to recommend the option until it is filled in.

What the system itself is telling you

Before deciding, look at what the existing codebase actually does for the business. The System Report builds a small artefact for this — the Business-Use Map. It lists the system areas that carry revenue, the ones that carry compliance, the ones used rarely but at high stakes, and the ones that are technically large but business-light. The map is built from interviews and from reading, not from speculation.

Two patterns recur:

Patterns that survive into the new system. Authentication, payment, stateful workflows, tenant isolation, integration with vendor APIs the business depends on. These are the parts where the original team made a thousand quiet decisions, only some of which are documented. Rewriting them means rediscovering each decision under deadline.

Patterns that the new system can leave behind. Internal admin tooling that nobody uses. Features built for a customer who churned three years ago. Job scheduling that the cron equivalent in your new platform handles natively. Reporting screens superseded by a separate analytics tool.

The shape of the answer is rarely "refactor everything" or "rewrite everything". It is usually: keep the part that holds the most business meaning, replace the part that has the least, and run both in parallel during the transition. The discipline behind that approach is described in the page-by-page MVC modernisation case, where each page is moved end-to-end behind a routing flip, with the old system still running until the new one is verified. There is never a moment where the system is broken.

What the comparison looks like in practice

A System Report does not write a one-line answer. It writes a comparison that looks roughly like this for each candidate path:

Path A — refactor in place. Affected areas: three. Sequence: extract OrderService ownership, add characterization tests around the discount calculation, move the partner CSV behind an explicit format contract. Total effort range: 6–10 weeks. Reversibility: per change, by feature flag. Confirmation: existing operators sign off the partner CSV behaviour before the old path is removed.

Path B — rewrite to a new stack. Affected areas: all. Sequence: build a parallel system, mirror traffic for 8 weeks, run a 2-week confirmation window with operations, cut over. Total effort range: 8–14 months. Reversibility: only at abandonment. Confirmation: a long, expensive parallel-run period the team has not yet committed to staffing.

Recommendation. Path A. The four properties are answerable today. Path B is recommendable only after the revenue-bearing parts of the system have been characterized with high-trust tests and the parallel-run telemetry has been built. That preparation is the same work as Path A.

That last sentence is the quiet truth of the comparison. The work that would make a rewrite safe is the same work that makes a refactor possible. If you can do the safe work, you usually do not need the rewrite.

What this tells you to do next

Three moves, in order.

Stop deciding in the abstract. Most refactor-or-rewrite decisions are made before anybody has read the codebase end-to-end with the four properties as the gate. The decision quality jumps once the reading exists. It is not necessary to know every file. It is necessary to know which areas are revenue-bearing, which are unconfirmed, and which are dead.

Treat the rewrite as the larger evidence requirement. A refactor needs a small safety net per change. A rewrite needs a parallel-run telemetry layer, a confirmation phase, a characterization test estate, and an organisational commitment to running two systems at once. If those are not yet funded, the rewrite is not yet a real option.

Ask which path can be reversed cheaply. If you cannot answer that, that is the answer. Refactor first. Rewrite later, against a system you now understand.

Where this fits in a System Report

The first deliverable of a Smallbox engagement is the System Report. It is the artefact that turns refactor or rewrite? from a feeling into a comparison the four properties of safe change can be applied to. It contains the Business-Use Map that ranks system areas by relevance, the test-trust classification that tells you how much your existing safety net is worth, and one recommended next implementation package — sometimes a refactor, sometimes a phased rewrite, sometimes a pause to build a safety net before either is responsible.

If the question on your desk is do I refactor this, or do I rewrite it?, the System Report is the cheapest way to find out which question you should actually be asking.

Articles describe the lens. The questions a System Report asks are how that lens is applied to your system.

Other articles in this cluster →See the System Report →Send your system context →

← All articles