Agents Should Do TDD: For the Same Reason You Should Have

TDD's value was never that a human ran the loop. The value was the loop itself: a constrained search procedure that converges on code that is correct, named for intent, and shaped to change. Strip the human out and the discipline still works for the same reasons.

By Travis Frisinger · April 26, 2026 · 15 min read
TDDAI AgentsSoftware CraftTest Design

TDD was never about you.

The book had your name on the cover. The conferences had your face on the program. The training market had your seat in the room. For twenty years the discipline was framed as a thing developers do, and the developer was assumed to be a human at a keyboard. So the question of whether agents should do TDD has, for most teams, never come up. Agents generate code. TDD is what humans do. Two different worlds.

That framing was incidental, not essential. The discipline-bearer was a placeholder. The mechanism (write a tiny test that names intent, write the smallest code that satisfies it, then reshape what you have for change) works for reasons that have nothing to do with who’s running it. The loop is convergent. It pulls a chaotic search through implementation space toward code that’s correct, named, and shaped to outlive the next requirement change. That’s true whether the search is being run by a senior with twenty years of habits or by an agent with none.

So agents should do TDD. For the same reason you should have. The reasons haven’t changed; only the question of who’s at the keyboard has.

The Old Story Was About You

Read the canon. Beck’s Test-Driven Development: By Example is a book of dialogue between a developer and his keyboard. Growing Object-Oriented Software is structured around what Steve and Nat did when they got stuck. The kata literature, which is the closest thing we have to a TDD curriculum, is built around the idea of a person at a workstation building muscle memory rep by rep. The pedagogy assumed a body in the chair.

That assumption was load-bearing for the era. It was also distorting. It made the discipline look like a practice in the craftsmanship sense: something a developer adopts as part of their identity, signals through their commits, gets tattooed on their conference badge. Once a discipline becomes identity, the conversation shifts from does it produce better outcomes to who counts as a real developer. We spent fifteen years arguing about the latter. The former was settled before the argument started.

The argument also produced its own backlash. Developers who didn’t TDD watched the identity politics, decided they didn’t want any part of it, and concluded the discipline itself was overrated. The framing made TDD a tribal marker, and the tribal marker made TDD easy to reject without engaging with what it actually did. By 2020 most teams “did TDD” the way most teams “do agile”. They used the words and skipped the loop.

Strip the identity off the discipline and the picture is different. TDD is a procedure for arriving at code that does what you intended, named in terms a future reader will recognize, shaped so the next change doesn’t require a rewrite. None of that is a developer-identity claim. None of it is a tribal marker. It’s a procedure. Procedures work for the same reasons regardless of who runs them.

The procedure is what we should have been talking about all along.

The Loop Was Always the Point

What does the loop actually do?

Step one, red, forces you to name what you’re trying to accomplish before you accomplish it. You have to write down, in the form of an executable expectation, what success looks like for this small slice of work. That act of naming is where intent gets pinned to a word the rest of the system can recognize. Without it, intent lives only in the implementer’s head, where it cannot be reviewed, refactored, or reused.

Step two, green, forces you to satisfy that named intent with the smallest implementation that does the job. No speculative generality, no scaffolding for a future requirement, no decorative complexity. The constraint is brutal and clarifying. It produces code whose only justification is the test that demanded it.

Step three, refactor, gives you an explicit window, with a green test as a safety net, to reshape what you just wrote so it fits with what’s already there. New code joins the system; the system absorbs the new code. The vocabulary harmonizes. Duplication that wasn’t visible until this commit becomes visible. You delete it. The codebase is left in a state where the next small slice will be cheaper to write.

Run that loop a thousand times and you don’t just have a passing test suite. You have a codebase whose vocabulary was shaped by a thousand small acts of naming. You have an implementation whose surface area was disciplined by a thousand small acts of restraint. You have a structure that was reshaped by a thousand small acts of refactoring under safety. The compounding is real and it’s the entire point.

None of that depends on who ran the loop. The naming forces clarity regardless of whose hands are on the keyboard. The restraint disciplines surface area regardless of whose intuition wanted to add a fifth abstraction. The refactor pulls structure toward coherence regardless of whose taste decided what coherence meant. The procedure does the work. The procedure-runner is incidental.

That’s why agents should do TDD. The thing the loop produces (well-named, minimal, structured code) is exactly the thing you wanted from your developers, and it’s the thing that doesn’t appear without the loop.

What Vibes-Driven Generation Actually Costs

Most code, including most code shipped by good teams, is vibes-driven. The developer (or the agent) reads the ticket, forms a sense of what’s needed, types until something compiles, runs it against an ad-hoc smoke test, and ships it. Coverage tools may report numbers later, but those numbers attach to whatever tests got written after the implementation was already in its final shape. The implementation was not pulled toward correctness, naming, or structure. It was just typed.

Vibes-driven code has a characteristic shape. It compiles. It usually works for the first case its author had in mind. The names are whatever came up in the moment: processOrder, handleResult, doTheThing. Domain concepts are encoded as primitives because primitives were the easiest thing to type. Edge cases are handled by whichever defensive pattern the author last reached for, regardless of whether the defense is appropriate to the current context. Each function is reasonable; the codebase as a whole is incoherent.

This is not a moral failing. It’s what happens when there’s no loop pulling the work toward coherence. Generation without convergence accumulates. The accumulation feels productive. There’s more code, more features ship, the velocity chart goes up. But the codebase is becoming harder to change, not easier, because each new slice is layered on top of the previous slice without ever being shaped by it.

The cost is invisible until it isn’t. It shows up as the third quarter where every “small” feature takes longer than the last. It shows up as the on-call who can’t find where a behavior is implemented because the behavior is implemented in three places under three different names. It shows up as the migration that should have taken two weeks and takes six months because the vocabulary in the database doesn’t match the vocabulary in the API doesn’t match the vocabulary in the tests.

Vibes-driven code was the dominant mode of human development. It is now also the dominant mode of agent development, for the same reason: nothing is pulling the work toward convergence. An agent given a task, no test, and no constraint to TDD will produce vibes-driven code. The vibes will be more grammatically correct than a tired junior’s at 4 p.m. on Friday. They will still be vibes.

Without the Loop, Agents Are Just Faster Vibes

Watch what an agent does with an unconstrained task. “Add a discount for loyalty members on orders over fifty dollars.”

The agent reads the surrounding code, infers a plausible shape, and produces sixty lines. There’s a calculateDiscount method on the order service. It takes the order, reads customer.type, branches on a string comparison, computes a percentage, returns a new total. Maybe the percentage is hardcoded. Maybe it’s pulled from a config. The implementation looks like the surrounding code because the agent sampled the surrounding code. It compiles. It runs. The smoke test passes. Ship it.

Tomorrow, the same agent gets “Add a Black Friday promotion that stacks with the loyalty discount, but only for first-year members.”

The agent reads the surrounding code again, finds yesterday’s calculateDiscount method, and faces a choice. It can extend yesterday’s method with another branch. It can add a new method, calculateBlackFridayDiscount, that wraps the first one. It can write a small dispatcher. There is no test specifying which choice is right, so the agent picks based on whatever pattern is most prevalent in the surrounding code. Often it picks the wrap-and-call approach because that’s the path of least resistance. Now there are two methods. They share state through a shared field on the order. The interaction between them is implied, not specified.

A month in, the codebase has a calculateDiscount, a calculateBlackFridayDiscount, a calculateNewMemberPromotion, and a calculateLoyaltyAdjustment. Three of them call each other. One of them is dead code from a feature that got rolled back. The vocabulary fragments (discount, promotion, adjustment) without anyone having decided what each word means in this domain. The model fragments along with the vocabulary. There is no single place that says here is what loyalty means in our system. The concept exists only as a consensus across four methods, and the consensus is shifting.

This is not an agent failure. A junior developer left to ship the same three tickets with the same constraints would produce the same outcome, only slower. The failure is the absence of the loop. Nothing in the workflow forced the question what is the actual concept here before code got written. Nothing forced the smallest implementation. Nothing forced a refactor pass that would have collapsed the four methods into one named-correctly representation.

The agent didn’t make the codebase worse than a human would have. It made the codebase worse faster.

With the Loop, Agents Produce Different Code

Run the same first task with the loop enforced. The prompt isn’t add a discount; it’s add a discount via TDD: start with a failing scenario test using the existing builders; write the minimum implementation that passes; then a refactor pass to align with the patterns in OrderService.

The agent’s first move is no longer to type sixty lines of implementation. Its first move is to write a test:

[Fact]
public void Loyalty_members_get_a_ten_percent_discount_on_orders_over_fifty_dollars()
{
    var order = anOrder()
        .forCustomer(aLoyaltyMember())
        .containing(aBookCosting(60.dollars()));

    var receipt = checkout.process(order);

    receipt.discount.Should().Be(6.dollars());
}

That single act has done a lot of work. The agent had to choose a name (Loyalty_members_get_a_ten_percent_discount_on_orders_over_fifty_dollars) that pins the intent in the team’s vocabulary. It had to use the existing builders, which pulled it into the existing nouns: aLoyaltyMember, aBookCosting, Money. It had to write down what success looks like as a typed assertion: receipt.discount is a Money, not a decimal. The slice of work is now anchored to a named, typed, composable specification.

The implementation that follows is constrained. The agent writes the smallest code that turns this test green. There’s no speculative method for promotions, no defensive scaffolding for membership tiers that haven’t been specified. Just enough logic to produce 6.dollars() from this scenario.

Then the refactor pass. The agent looks at the implementation it just wrote and at the surrounding code. It notices the discount calculation lives inline in OrderService.process, while related calculations (shipping, tax) live in their own collaborators. It extracts a DiscountCalculator, gives it a Calculate(Order order) : Money signature, and updates the test to flow through it. The structure of the new code now matches the structure of the old code. The vocabulary aligns: DiscountCalculator sits beside ShippingCalculator and TaxCalculator in the same namespace. A future reader sees the family.

Tomorrow’s task, add a Black Friday promotion that stacks with the loyalty discount, but only for first-year members, starts with another test. The test forces the agent to ask, before writing implementation, what is the relationship between Black Friday promotions and loyalty discounts. The answer goes into the test name. The answer goes into the builder vocabulary: aLoyaltyMember().inTheirFirstYear(), duringBlackFriday(). The answer constrains the implementation. The refactor pass reshapes the DiscountCalculator to handle stacking explicitly, with the shape made visible by a series of tests that name each interaction.

A month in, the codebase has one DiscountCalculator with a clear contract, a vocabulary of well-named scenarios that document every interaction the team has ever specified, and builders that compose new scenarios in a single line. The same agent. The same tasks. A different codebase.

The difference wasn’t the model. It was the loop.

Agents Are Better at the Loop Than You Were

Here’s the part that’s uncomfortable for those of us who came up before agents existed.

Humans cheated on the loop constantly. We skipped the test “for now” because the deadline was Friday. We skipped the refactor because the green bar was right there and the next ticket was beckoning. We conflated green with done because the dopamine hit on green was sufficient and the discipline of looking at the code afterward was not. Most “TDD shops” I’ve worked in did red-green and called it red-green-refactor because the third step is the part that doesn’t feel productive in the moment.

The loop only works if all three steps run. The third step is where structure comes from. Skip it and you get a codebase with a passing test suite and a tangled implementation. Worse, in some ways, than a codebase without tests at all, because the tests now lock the tangle in place.

Agents do not have any of the human reasons to skip steps. They do not get bored. They do not feel that the refactor step is unsexy. They do not have a Friday deadline that makes them want to merge and go home. They do not need the dopamine hit of moving on to the next thing. If you tell an agent to run red-green-refactor, it runs red-green-refactor. It runs it the same at 9 a.m. on Monday and 4 p.m. on Friday. It runs it for the trivial change and the boring change and the change nobody wants to be touching.

The discipline that was always rare in humans (the willingness to do the un-fun part of the loop, every time, without negotiation) is default in agents. This is not a small thing. The reason most TDD efforts in most companies decay is not that the developers don’t know how. It’s that the third step gets cheated on, and the codebase that results doesn’t deliver the benefits, so the team gives up on the discipline. An agent that runs the third step every time, on every change, accumulates the benefits the discipline was always supposed to deliver and almost never did.

This is a feature of working with agents, not a bug. It’s also the inversion most teams haven’t internalized: the discipline you struggled to maintain in your team is cheaper to maintain in your agent. The constraint that humans found exhausting is, for an agent, just another instruction.

We all cheated on the loop. The agents won’t.

The Team’s Job Changed, Not Disappeared

If the agent runs the loop, you are not doing TDD anymore. That can sound like a loss. It isn’t. The discipline didn’t go away; the discipline-bearer changed, and your job moved up a level.

You’re now doing meta-TDD. You shape the constraints the agent’s tests must satisfy. Tests must use the existing builders. Tests must be named in domain language. New nouns require an explicit decision before they enter the vocabulary. Those rules used to live in the heads of the senior developers on the team, transmitted by example to the juniors over years. Now they live in the prompt, the lint rule, the test-naming convention enforced in CI. They are explicit, version-controlled, and applied uniformly.

You’re reviewing the test suite the agent produces, not for syntax but for vocabulary fitness. Did it introduce a near-synonym for an existing concept? Did it name the scenario in a way that will still make sense in a year? Did it use the builder API in a way that suggests the API itself needs a missing method? These are the design judgments senior developers were always supposedly making during PR review and rarely actually making, because the diff was full of syntax to scrutinize and the design questions were buried under the line-by-line reading.

You’re refactoring the test API itself when it drifts. The builders, factories, and domain types are now the most-used API in your codebase, because every test the agent writes consumes them. Investments in that API compound across every future agent-generated test. A small improvement in aLoyaltyMember() makes every loyalty-related test the agent will ever write slightly better. The leverage on test-API craft just multiplied.

You’re writing the prompt that says do TDD with these rules in this codebase. That prompt is a piece of code in its own right. It encodes the team’s standards. It evolves. It deserves the same care as a CI configuration or a linter setup. Most teams haven’t started thinking about it that way yet. They will.

None of this is less engineering than what you were doing before. It’s actually closer to the engineering work that mattered all along (design judgment, vocabulary curation, structural integrity) and further from the typing that was always a means to an end. The loop runs faster. The wins compound across more changes. The discipline gets applied more uniformly than any team of humans ever managed.

You didn’t lose TDD. You promoted it.

The Reframe

TDD was never about you.

It was about a procedure that pulls code toward correctness, naming, and structure through a tight loop of constrained generation and reflective revision. The procedure works because of what it does to the code, not because of what it does to the developer. The developer was a vehicle for the procedure, and the procedure produced its results regardless of which vehicle ran it. Only the ones that ran it faithfully got the results.

For twenty years, humans were the only vehicles available. Most of them ran the procedure unfaithfully. They skipped the third step. They wrote the tests after the implementation. They confused the practice with the identity and the identity with the tribe. The discipline got a reputation for being slow, for being dogma, for being something only certain kinds of developers wanted any part of. None of that was about the procedure. All of it was about the vehicle.

Now there’s a second vehicle. It does not get bored. It does not skip the un-fun part. It does not have a Friday deadline. It runs the procedure exactly the same way every time, on every change, with no negotiation. Give it the loop and it produces the kind of codebase the discipline was always supposed to produce. Withhold the loop and it produces faster vibes.

So agents should do TDD. For the same reason you should have. The procedure is what works. The vehicle was always incidental. The reasons the loop produces better code haven’t changed in twenty years. Only the question of whether anyone will reliably run it has changed, and that question now has a different answer than it ever did before.

The procedure doesn’t care who’s at the keyboard. Now, finally, neither do we.