smallbox

System Report

How the System Report thinks.

The full process is internal. What is on this page is the orientation map: the questions the report asks, and the standards it holds itself to, before any implementation work is recommended.

Why inherited systems are hard.

Most inherited systems are not difficult because of one bad file. They are difficult because important behaviour is spread across code, database shape, deployment assumptions, old business rules, admin tools, background jobs, vendor integrations, and team memory.

A safe change has to answer more than “where is the code?” It has to answer which behaviour is intentional, which is accidental but relied upon, which is dead, and which is unsafe. It has to answer what the system actually does in production, not what the README says it does.

Engineering budget is scarce.

Some tests buy safety; some only buy a green checkmark. Some refactors unlock value; some only move code around. The report exists to tell the difference.

Findings are ranked by business relevance, not technical ugliness. A messy file with no business consequence ranks below a clean-looking flow that quietly burns staff time or blocks the next valuable change. The relevance axis the report uses is concrete:

  • Revenue paths and customer experience.
  • Operational risk and support burden.
  • Delivery speed and reliability.
  • Compliance and onboarding.
  • The ability to safely ship the next valuable change.

The recommended next move is the one most justified by the evidence available, against the outcome you actually care about — not the one that would be most satisfying to refactor.

Safe change has four properties.

Before recommending any change, the report checks whether the change can be made:

Observable
We can see whether it worked — in logs, metrics, or an admin view.
Testable
There is a way to exercise the new behaviour against something other than itself: real database, real endpoint, real existing output, real product-confirmed example.
Reversible
There is a kill-switch, a flag, an old code path, or a rollback procedure. If there isn't, the report says so.
Confirmable
Weird or surprising behaviour is signed off by someone who knows the business, before it is changed.

Anything that cannot be made observable, testable, reversible, and confirmable is something the report cannot recommend changing safely — and it says so out loud rather than dressing it up.

Test trust is a spectrum.

A green test is not, on its own, evidence that behaviour is protected. The report tags each test that matters by the independent reality it actually anchors against:

  • High-trust. Round-trips an entity through the real database. Sends a real HTTP request to the running system. Compares a generated document against an approved output. Anchored against reality the test author did not invent.
  • Medium-trust. Real code paths with one important boundary stubbed (vendor faked, time frozen, identity mocked). Useful when the stub matches the real boundary.
  • Low-trust. Heavy mocking. Behaviour invented by the test author. May be worth keeping as a regression alarm but cannot be treated as evidence the system is correct.
  • Dangerous. The test mirrors the implementation so closely it will pass for any implementation that compiles. Cut, not kept.

A hundred green tests that mock everything they touch protect almost nothing. The report says when a test estate is large but trust is low, instead of pretending the suite is safety.

Access and certainty are linked.

You define the access boundary. The report works inside it. More context increases precision; less context means the report is more explicit about uncertainty.

  • Repository-only access. Reading-based recommendations.
  • Local build and test access. Executable safety checks become possible.
  • Staging or test-tenant access. Runtime verification becomes possible.
  • Anonymised or production-shaped data. Behaviour-preservation checks become possible.
  • Production monitoring access. Rollout observation becomes possible.

The report never claims that limited access produces full certainty. Each step down the ladder is named so you can choose to widen the boundary or accept the lower-confidence path.

Production-shaped data, made safe.

When realistic data is needed, the report may recommend a masked, production-shaped copy: relationships and state distributions preserved, names and tokens replaced, password hashes removed.

Masking and pseudonymisation reduce identity risk; they do not automatically equal anonymisation under GDPR or any other regime. The report describes what was actually done to the data and leaves the compliance label to you.

A test environment must not silently mutate the real world. Before any test or rollout exercise runs, the report confirms that emails are routed to a sink, payments are in sandbox, outbound webhooks point to test endpoints, cron jobs are disabled by default, and no production credentials remain in the non-production environment.

Implementation strategy is shaped by your constraints.

The report does not recommend feature flags or canary deploys to a system that cannot run them. The strategy is shaped by your actual access, your actual environments, and your actual deployment shape.

For non-trivial refactors, the canonical sequence is: prepare the safe environment, load masked data, add high-trust characterisation tests, refactor internal structure, compare old and new outputs where possible, deploy in a small reversible step, let your team exercise the system, observe logs and the workflows that matter, hold a stabilisation window, and only then begin the next feature.

The point of the sequence is that the refactor itself is the cheapest step. The other steps are what make it safe.

If AI is used, it is controlled.

The investigation phase uses AI as an inspection and compression tool. AI helps read large codebases more systematically — tracing dependencies across many files, summarising modules, finding repeated patterns, and surfacing likely business rules and areas that deserve closer human review. It is treated as an accelerated reading aid, not as an authority on what any of it means.

Implementation may use Claude or another AI coding agent. When it does, the human owns intent, business rules, structural decisions, and final review. The AI works inside a documented context (architecture, forbidden actions, business rules, working notes), in named operating modes (discovery, plan, scaffold, test-first, implementation, verification), with bounded sessions, branch isolation, per-edit safety gates, and the same test-trust classification described above.

The AI does not invent business rules, decide what is fragile, rank findings, or write the report. If a rule is unclear, the question goes back to you, not into the code. The deliverable is a human-led technical report.

Work advances by gates, not by calendar.

The work is not one continuous block. It runs as a sequence of small stages: orientation, flow discovery, safety discovery, strategy, synthesis, and — if you continue — preparation, refactor and stabilisation, feature build, rollout and observation.

Each stage has a goal, entry criteria, evidence it produces, and exit criteria. Stages do not advance because time passed. They advance because the exit criteria are met, or because you explicitly accept the risk in writing.

  • No refactor begins without a safety net around the touched flow.
  • No deploy without a clear rollback.
  • No removal of weird behaviour without product confirmation.
  • No AI implementation without a clear plan, assumptions, and test basis.

The 20 checks behind every report.

A compressed orientation map. The full report explains the evidence. If a check cannot be answered, the report names it as an uncertainty rather than pretending the answer exists.

Scope and access

  1. What is your primary outcome?
  2. What is in scope and out of scope?
  3. What access boundary did you grant?
  4. What was inspected with high confidence, with reading-only confidence, or not at all?

Build, run, test

  1. Can the system build locally?
  2. Can the system run safely in any environment we can reach?
  3. Can the existing tests run, and are they trustworthy?

Environments and data

  1. What environments exist: local, test, staging, production-like, production?
  2. What database, schema, and data access exists?
  3. Can masked or production-shaped data be created safely?
  4. Are side effects neutralised in non-production: emails, SMS, payments, webhooks, cron jobs, external writes?

Code and behaviour

  1. What are the representative request, job, and admin flows?
  2. What are the core entities and their mutation paths?
  3. Where do business rules live: code, database, config, stored procedures, admin tools, or people?
  4. Which weird behaviours are known business rules, accidental-but-relied-upon, dead, unknown, or unsafe?

From findings to recommendation

  1. Which findings are backed by evidence, not opinion?
  2. For each recommendation, is the change observable, testable, reversible, and confirmable?
  3. What minimum safety net is needed before implementation starts?
  4. What implementation strategy fits the real constraints: environment, data, deployment, rollback, monitoring, stabilisation?
  5. If Claude or another AI agent is used, what context, branches, operating mode, and safety gates control it?

The method, applied to our own code.

Before we apply this kind of discipline to a system you inherited, we apply it to one of our own. Two companion pages walk through pieces of CompanyGraph’s production code and operating method. Each shows a rule, the rule kept, a place the rule is leaking right now, and the gate that catches or fails to catch the leak — verifiable in the live system at the cited file, line, or admin surface.

Further reading.

Each of these articles unfolds one of the ideas above into the buyer’s decision it is supposed to inform. They are written for the founder, CTO, or technical owner who has to decide what to do with an inherited backend — not for the engineer who will eventually do the work.

Send your system context.

The report works inside whatever access boundary you set. The conversation can start with as much or as little context as you are comfortable sharing — the report becomes more precise as the boundary widens, and stays honest about what it cannot see.