This is the third and final post in a short arc on TDD in the agentic era. The first, “TDD Already Does BDD, Without the Gherkin”, made the craft case. The second, “BDD Was a Coordination Tax. AI Just Repriced It”, made the org case. This post names the new expected level and explains why agentic coding makes it non-negotiable.
For twenty years the bar was: do you have tests, and do they pass?
That bar is now a floor so low it’s almost invisible. If you want AI agents to do real work in your codebase, with autonomy, the bar has moved. And most teams haven’t noticed.
The earlier two posts set up the pieces. “TDD Already Does BDD, Without the Gherkin” argued the craft case: disciplined TDD with test data builders, factory hierarchies, mock-driven collaboration tests, and domain types already delivers what BDD frameworks promised (scenarios, ubiquitous language, behavior framing) with none of the translation tax. “BDD Was a Coordination Tax. AI Just Repriced It” argued the org case: BDD tooling was coordinating specialized roles in slow organizations, and role compression plus AI is collapsing that coordination cost, which means the tooling goes away though the practice survives.
This post is what both arguments were pointing at: the new expected level of TDD for teams that want to work with agents.
The Old Bar Was “Tests Exist”
The old test-maturity ladder looked like this:
- No tests
- A few tests, run sometimes
- Tests run in CI
- High coverage
- TDD practiced on new code
Most teams aimed for rung 3, claimed rung 4, and pretended rung 5 was happening when it wasn’t. The bar was horizontal (“tests pass”) and the ceiling was “enough tests that we catch the worst regressions.”
That bar was calibrated for a world where tests served one primary reader: humans. Humans who already knew the codebase, had context from standup, remembered the last incident, and could fill in the gaps a thin test left behind.
Agents don’t do that.
An agent reads what’s in front of it. It has no standup, no tribal knowledge, no instinct to distrust the wiki. The test suite isn’t supplementary context for it. The test suite is the context.
The bar isn’t “do tests exist and pass” anymore. The bar is “do tests communicate the system well enough for an agent to operate autonomously against it.” Those are radically different bars.
The New Floor
Here’s what the floor looks like now. Call it the minimum for teams that want real leverage from agents, not just faster typing.
Tests should be named as scenarios in domain language. Not test_1, not should_work, not handles_edge_case. Something like Loyalty_members_get_free_shipping_over_fifty_dollars. Agents read test names as a specification index, and bad names give them nothing.
Test setup should use composable builders and factories, not primitive bags. anOrder().forCustomer(aLoyaltyMember()).containing(twoBooks()).shippedTo(california()) instead of ten lines of hand-assembled dictionaries and magic strings. Builders give agents nouns; factories give them worlds. Agents compose new tests by composing those, not by inventing primitive-heavy setup from scratch each time.
Collaboration should be expressed, not hidden. When an object’s job is to coordinate with another, a test should say so explicitly, through a mock expectation, a contract test, or an interaction assertion. Agents need to see the collaboration graph. If it’s buried inside integration tests that only verify end-state, the agent has to guess at the shape.
Domain types should live under the tests. Money, Address, SKU, Reservation, Confirmation, rather than decimal, string, string, Dictionary<string,object>. Primitive obsession in tests gives the agent no rails. Domain types turn the test suite into a typed API the agent can navigate.
The test API should be treated as a product surface. Someone owns it. Someone notices when it gets ugly. Someone refactors the builders when they drift from how the domain is actually spoken. Someone writes the README for the test helpers. Agents working against a product-quality test API generate product-quality tests; agents working against a neglected test API generate the same mess the humans produce.
Ubiquitous language should flow end to end. The words product uses, the words the code uses, the words the tests use, and the words appearing in agent prompts should all be the same words. When they drift, the agent’s output drifts. When they align, the agent’s output aligns.
That’s the floor. It’s the minimum bar for a codebase where agents can meaningfully participate, not a stretch goal or a maturity model.
Why Agents Specifically Need This
This isn’t aesthetics. There’s a mechanism, and it has three pieces.
First, agents compose by recombining the vocabulary they find in the codebase. Give them anOrder() and aLoyaltyMember() and shippedTo(california()) and they will compose new scenarios by combining those primitives in ways that stay inside the domain. The output reads like the rest of the test suite because it uses the rest of the test suite. Give them a codebase of var x = new Order(); followed by twenty lines of field-setting and they will generate more of the same. The output won’t be wrong, exactly. It just won’t compose. Every test stands alone, nothing is reusable, and the vocabulary doesn’t grow. It just accumulates.
Second, agents reason about behavior by reading test names and assertions. A suite of well-named scenarios is a specification they can browse. They can answer “what does this system do when X?” by finding the test whose name says X and reading the assertion. A suite of test_1, test_2, test_3 is opaque: they have to read every body to learn anything, and even then they can’t tell which cases are intentional specifications versus incidental regression tests.
Third, agents respect domain boundaries when domain types enforce them. A method that takes Money can’t be accidentally called with a raw decimal for SKU count. A method that takes Address can’t be given a city string. The compiler holds the line. Agents operating in typed codebases make fewer boundary-violation mistakes because the types make the mistakes uncompilable. Agents operating in primitive-heavy codebases make constant boundary mistakes because there’s nothing stopping them.
Put together: rich TDD is pre-built ontology for the agent. It’s the vocabulary, the grammar, and the rails. Without it, the agent has to invent those from scratch on every task, and the invention drifts from what the team actually uses. Every agent-generated PR arrives with slightly wrong vocabulary that humans then have to translate back.
The translation cost you thought you were saving by skipping “test quality” work reappears on every PR review.
This Is Why the Coordination Stack Is Fading
The previous post in this arc argued that BDD tooling (feature files, step definitions, Cucumber runners) is being repriced by role compression and AI. That’s half the story. The other half is what absorbs the load those artifacts used to carry.
Feature files were carrying the load of being a shared specification readable by non-developers. Step definitions were carrying the load of mapping business language to code. Living-documentation generators were carrying the load of producing human-readable reports.
All three of those loads collapse into one place when the test suite is rich enough: the test suite itself.
- Scenario-named tests are the readable specification. Agents and humans both read them.
- Builders and factories in domain language are the mapping from business vocabulary to code. No separate glue needed.
- The test suite’s own output, rendered clearly, is the living documentation. It’s the same artifact, viewed differently.
This is why rich TDD and the coordination-tax repricing are two sides of the same shift. The artifacts around the test suite are fading because the test suite got strong enough to absorb their function. You can’t get rid of feature files if your tests are test_1, test_2, test_3; you need the feature files to specify behavior. You can get rid of feature files if your tests are scenario-named, builder-driven, and domain-typed, because they’re already doing the job the feature files were doing, more accurately and without the drift.
The consolidation only works if the test suite is good enough. That’s the floor this post is naming.
The Audience Split
Be honest: this bar applies to teams that want agents to operate with real autonomy.
If you use AI as autocomplete (inline suggestions, whole-function generation, boilerplate), the old bar still works. Mediocre tests plus Copilot is fine. You’re using AI to type faster, and the tests are still for humans. Every non-trivial task will need human steering, but that’s acceptable when steering is the workflow.
If you want agents to take a task, plan it, write the tests, implement the code, run the suite, iterate on failures, and open the PR, you need the new floor. Without it, the agent can’t close the loop without constant human steering, and “agentic” becomes a word you say while still doing most of the work yourself.
Teams at the old bar will not feel this yet. They’ll look at their current workflow, their current AI tooling, their current test suite, and conclude that things are fine. And for what they’re doing, they are.
They just won’t notice when the teams at the new bar start shipping three to five times faster without losing quality. The gap won’t announce itself. It’ll show up as “how are they doing that?” and the answer will be invisible in the diff, because the answer is the test suite and the discipline around it, built over years, finally paying off in a way it never could before.
What to Do About It
If you want to get to the new floor, there’s no shortcut.
Start with one area of your codebase. Introduce builders. Name tests as scenarios. Add domain types where primitives are carrying semantic weight. Use mocks to express collaboration. Refactor the test helpers when they drift. Treat the test API as a product. Do this for a quarter and the difference in what agents can do against that area versus the rest of the codebase will be obvious.
Then widen.
This isn’t a tooling change or a framework adoption or something you can bolt on. It’s the same discipline good TDDers have been practicing for twenty years, applied with the knowledge that the audience is no longer just humans who will fill in the gaps. The audience includes agents who won’t.
The practices haven’t changed. The stakes have.
The Steel Thread
Pull the three posts together and the argument is one sentence:
Agents made rich TDD non-optional, and rich TDD is what lets the old coordination stack fall away, so teams that invest in the craft get both the safety and the speed.
- The craft case (post 1): builders, factories, mocks, domain types, product-thinking devs. This is what tests should look like.
- The org case (post 2): the coordination artifacts around tests are being repriced because roles are compressing and AI is translating.
- The capstone (this post): the reason both are happening at once is that agents shifted the audience for tests from humans-with-context to agents-without, which raised the bar on test quality, which then absorbed the load the coordination artifacts were carrying.
Rich TDD isn’t a style preference anymore. It’s infrastructure for how compressed-role, AI-augmented teams work. The bar moved. The teams that notice first will own the speed advantage. The ones that don’t will keep wondering why their agent workflows feel shallow.
The tests were always the specification. Now they’re the specification, the interface, the coordination surface, and the onboarding document, for humans, for agents, and for the next version of your team.
That’s a lot of weight for a test suite to carry. Only a good one can.