TDD Buddy - Characterization Tests Are How Agents Enter Legacy Code

Related reading: Your Test Suite Is Your API for Agents names the interface; The Bar for TDD Just Moved names the floor. This post is about codebases that have neither yet, and how an agent gets a foothold anyway.

Legacy code is code without a specification. An agent changing legacy code is changing code with no specification.

Those two sentences are the entire problem. The first is a fact about the codebase. The second is what happens when the first is true and the team hands the code to an agent anyway. The agent reads what is there, infers what the code “probably” does, makes the requested change, runs whatever tests exist, and ships. The change is correct against the agent’s inference. The inference was incomplete. The behavior the team did not know was load-bearing has now silently shifted, and the regression will surface in production, three releases later, when a customer notices that an invoice rounds to the wrong cent.

Characterization tests are the move that closes this gap. They are old technique, named decades ago in Working Effectively with Legacy Code, and the agent era did not invent them. The agent era made them mandatory. The discipline that used to be optional, the thing experienced developers did when they had time, has become the precondition for letting an agent operate in any part of a codebase that was never properly tested. Which is most of every codebase.

The Agent’s Hardest Problem Is Code With No Spec

Sort the work an agent might be asked to do into three buckets.

Green-field code is easy. The agent and the team start from nothing, write tests first, build the implementation against the tests. The discipline is well-understood. The vocabulary is whatever the team chooses to establish. The agent has rails the entire time.

Well-tested legacy code is also tractable. There is a spec, even if the spec is implicit in the existing tests. The agent reads the suite, learns what the team has decided matters, and produces changes that respect the existing contracts. The tests catch regressions. The vocabulary already exists.

Legacy code with no tests is the hard case. It is also the dominant case. Most teams asking “can an agent help us ship faster” are asking that question about systems written by people who left the company three years ago, in frameworks that are two major versions out of date, with behavior nobody fully understands and a deployment cadence that depends on hoping the smoke tests cover the important paths.

The agent reading this kind of code has no way to know what is contract and what is accident. The function that returns null for the empty case might be returning null because that is the contract callers depend on, or it might be returning null because someone forgot to handle that branch and no caller has hit it yet in production. The agent cannot tell. It will pick whichever interpretation makes the prompted change easier and move on. Half the time it picks correctly. Half the time it ships a regression.

The damage is not loud. It is quiet. A small behavioral shift on an edge case, a rounding direction that flipped, an error message that no longer includes the field name, an off-by-one in pagination at the very last page. Each one survives review because the diff looks reasonable and the tests pass. Each one is found weeks later, in production, by a user, and traced back through three deployments to the change nobody questioned.

The fix is not “be more careful.” The fix is to give the agent a specification before letting it touch the code.

Characterization Tests Pin “Is,” Not “Ought”

Characterization is the act of capturing what the code currently does, exactly, before deciding what it should do. The captured behavior includes the bugs. That is the point.

A characterization test is not a correctness test. It does not claim that the system behaves the way the team wants. It claims that the system behaves the way it currently behaves, and any change to that behavior will register as a test failure. The failure is the prompt for a human to ask: did we want this to change. If yes, update the test. If no, revert the change. Either way, the change is no longer silent.

The discipline matters because the default for untested legacy code is “any change is silent.” A characterized legacy module shifts that default. Every change becomes either an intended change (which the human accepts and the test is updated) or an unintended change (which the test fails and the change gets reverted). The third category, the silent regression, stops being possible.

[Fact]
public void Pricing_engine_currently_rounds_half_to_even_for_wholesale_orders()
{
    var input = new PricingInput(
        unitPrice: 0.125m,
        quantity: 100,
        customerType: CustomerType.Wholesale
    );

    var result = pricing.Calculate(input);

    result.Total.Should().Be(12.50m);
}

That test does not claim wholesale rounding should behave this way. It claims wholesale rounding currently behaves this way. The test name reflects the claim: currently_rounds_half_to_even. The “currently” is load-bearing. A future reader, human or agent, sees that the behavior is pinned but the pin is descriptive, not prescriptive. The next decision (do we want half-to-even or half-up) is a separate decision the team can make explicitly, with a failing test as the conversation prompt, instead of discovering it three quarters later when finance asks why the invoices are short.

The naming convention separates characterization from specification. Tests named currently_does_X are descriptive pins. Tests named should_do_X or named in domain language (Wholesale_orders_round_half_up_to_the_nearest_cent) are specifications. The same suite can hold both, but the distinction has to be explicit, because the two kinds carry different obligations.

Agents Are Good at Capturing, Bad at Judging

Characterization is the rare task where the agent does the bulk of the work and the human does the judgment.

The mechanical part of characterization is exhaustive: enumerate inputs, run the function, record the output, write the assertion. An agent can do this at a scale a human would not finish in a week. Pick a hundred input combinations, exercise the function, capture the outputs, generate a test for each. The agent does not get tired. The agent does not miss the obvious-but-tedious cases (empty string, null collection, max int, negative quantity, leap year date). For mechanical coverage of behavior, the agent is the right tool.

The judgment part is everything else. Reading a hundred captured outputs and asking, for each one: is this the contract, or is this an accident. Is the function returning empty list because that is what callers depend on, or because the implementation happened to take that branch when the input was malformed. Is the rounding behavior intended, or is it a leftover from an old framework’s default that nobody ever questioned. Is the timestamp in UTC because that is the team’s contract, or because the server’s locale was UTC in 2014 and the code never specified a timezone.

The agent cannot answer those questions by reading the code. The answers live outside the code, in the team’s history, in conversations with customers, in the original ticket that justified the function in the first place. The agent has no access to those. It has only what is in front of it, which is the captured behavior. The captured behavior cannot distinguish contract from accident on its own.

If the agent is asked to generate the captures and then “approve” them, it will approve all of them. From the agent’s perspective they are all equally valid pins on current behavior. The agent’s approval ratifies the accidents alongside the contracts, and the resulting test suite is a frozen monument to whatever the code happens to do, including the parts that are wrong.

The human’s job is promotion, not approval.

The Human Move Is Promotion, Not Approval

Promotion is the act of reading captured behavior and deciding, per row, what to do with it.

Some captures get promoted into domain vocabulary. The test Pricing_engine_currently_does_X_for_input_Y becomes the test Wholesale_orders_round_half_up_to_the_nearest_cent, with a builder-based setup, a domain-typed assertion, and a name that reflects what the team has decided the system does. The captured row was a description of behavior. The promoted row is a specification of behavior. The vocabulary of the codebase is one entry richer.

Some captures get flagged. The test Pricing_engine_currently_returns_null_for_negative_quantity stays as-is, with a comment: // Captured behavior. Suspected-incidental. Decide before next pricing change. The team has not yet decided what the right answer is, but the wrong answer (silently shifting it) is no longer possible. The capture pins the current behavior until someone decides to move it deliberately.

Some captures get deleted. A row that captures behavior on an input no caller could ever produce is documenting a code path that exists only because the agent’s enumeration found it. Pinning it as a test commits the team to maintaining a behavior nobody wants. Better to delete the test and let that branch evolve naturally with the rest of the code.

The three moves (promote, flag, delete) are the actual work of characterization. The agent generated the raw material. The human turned the raw material into a specification. A team that skips this step has a test suite that contains accidents and contracts in equal measure, with no way to tell them apart. A team that does this step has a specification that grows organically out of what the code already does, anchored in the team’s vocabulary, defensible under future change.

Rubber-stamping the agent’s captures is the legacy-code version of an agent grading its own homework.

Characterization Is Where the Vocabulary Starts in a Legacy Codebase

The other thing that happens during promotion is more important than the tests themselves.

The first promoted test in a legacy codebase is also the first named scenario, the first builder, the first domain type. Before promotion, the codebase has primitives and incidents. After promotion, it has the beginning of a controlled vocabulary. The act of turning Pricing.Calculate(0.125m, 100, "Wholesale") into Wholesale_orders_round_half_up_to_the_nearest_cent() is the act of naming the concepts the system has been operating on all along without naming them.

CustomerType.Wholesale is a new type that did not exist before. aWholesaleOrder() is a new builder that did not exist before. 0.125m * 100 becoming anItemPriced(0.125m).orderedAt(100.units()) is the beginning of a fluent grammar that did not exist before. The legacy codebase had none of this. The characterized codebase has the seeds of it. Each promotion is a seed. Each seed is a vocabulary entry.

This is where the legacy on-ramp connects to the larger arc about test suites as the team’s controlled vocabulary. Characterization is the first stretch of road. It is also the road that everything else gets built on. A team that promotes characterizations into named scenarios with builder-based setup is bootstrapping a vocabulary out of an untested system. A team that captures and rubber-stamps is creating a frozen snapshot with no vocabulary at all.

The on-ramp is not just about getting agents safely into legacy code. It is about building the vocabulary the agent will need to do anything useful in that code beyond the first task.

Golden Masters Are Characterization’s Blunt Instrument

A golden master is the simplest characterization technique. Capture the entire output of a function or system for a representative set of inputs. Compare against the captured baseline on every test run. Any change to any output fails the test.

Golden masters are cheap to write, exhaustive in coverage, and brutal under refactoring. They catch every change, intended or not. They speak implementation, not domain. The test Module_X_output_matches_baseline tells the next reader nothing about what Module_X does, only that it does the same thing it used to. A failing golden master tells the developer something changed without telling them what was supposed to change.

For the very first stretch of road, that is fine. Golden masters are scaffolding. They keep the agent honest while the team builds toward something better. The trade-off is explicit: the team gets coverage today, without spending the time to write behavioral assertions, and they accept that the tests will be noisy under refactoring and silent on intent.

The discipline is to refine. As the team promotes captures into named scenarios with domain-typed assertions, the golden master shrinks. The behaviors that matter get pulled out into specific tests with specific names. The golden master eventually covers only the long tail of weird inputs nobody has gotten around to naming, and at that point it is fine for the long tail to live in a snapshot file. The named scenarios carry the contract. The snapshot carries the residue.

Treating golden masters as the final answer is the same mistake as treating captures as approvals. Both leave the team with a frozen system that resists change. The point of characterization is not to freeze the system. The point is to get the system to the place where deliberate change is possible. Golden masters are scaffolding. Scaffolding is meant to come down.

No Characterization, No Autonomy on Legacy

Pull the argument back to the original question.

A team that wants agents to do real work on a legacy codebase is implicitly asking: can an agent operate against code with no specification. The answer is no. Not “no with conditions” or “no until the model gets better.” Structurally no. An agent without a specification produces changes that are correct against its inference and incorrect against the team’s contract, and the gap between those two is invisible until production catches it.

The fix is to produce the specification before the work starts. Characterization is how that specification gets produced for code that has none. The agent generates the captures. The human does the promotion. The test suite that results is not a comprehensive specification of the legacy system, but it is a specification of the parts of the legacy system the agent is about to touch, which is the part that matters for the next task. The next task adds more characterized regions. The road extends.

A year of this discipline, applied to the parts of the codebase that get touched, produces a codebase where every recently-touched region has a specification, every promoted scenario carries domain vocabulary, and every agent contribution lands against a pinned reference that fails loudly on unintended change. The codebase that was legacy at the start of the year is, by the end, a codebase with an interface for agents. The interface was not written all at once. It was promoted, scenario by scenario, out of what the code already did.

The question “can an agent work in our codebase” is mostly “is our codebase characterized.” For most legacy systems, the honest answer is no. That is the work. Characterization is the on-ramp, and the on-ramp is the first stretch of every road agents are going to drive on in your code.