World-Building Is the Test Discipline Agents Need

A scenario factory is not setup convenience. It is a world-modeling language, and the layered grammar of object mothers, builders, and factories is the composition primitive agents recombine. Teams that hand-assemble setup hand agents nothing to build worlds from.

By Travis Frisinger · June 8, 2026 · 11 min read
TDDTest DesignAI AgentsSoftware Craft

Related reading: The Bar for TDD Just Moved names the floor; TDD Already Does BDD, Without the Gherkin names the craft underneath; Your Test Suite Is Your API for Agents names the audience. This post is the construction discipline those three assume.

Setup is world-building. Most teams treat it as plumbing.

That confusion is upstream of a dozen problems that get other names. Flaky tests. Brittle fixtures. Suites that resist refactoring. Setup blocks that have to be read line by line to figure out what scenario is even running. None of those are root causes. They are downstream of a single design failure: the arrange step of a test was never about getting the data into shape. It was always a claim about which world the behavior under test lives in. The team that hides that claim inside hand-assembled object construction has not removed the claim from the test. The team has only made the claim illegible.

The reason this matters now more than it ever did is the same reason every test-quality argument is louder now than it was five years ago. Agents recombine what they find. They do not bring a world model to the codebase. They sample one out of what is already there. A codebase with a coherent layered grammar for worlds gives an agent worlds to compose against. A codebase with hand-assembled setup gives an agent a pile of unrelated nouns and no way to form a sentence. The expressiveness of the world layer is the ceiling on what the agent can write.

Setup Was Always World-Building in Disguise

Look at any arrange step in any test and ask the question most teams skip. Not “what does this setup do,” but “what world does this setup claim is true at the start of the scenario.” Every test answers that question whether the author intended to or not.

A test that begins with three lines of customer construction, four lines of order assembly, and two lines of shipping fields is making a claim. The claim is: this customer exists, this order is in this state, this address is associated with this order, and the rest of the system can be treated as if those facts are stable for the duration of the assertion. The test passes if the assertion holds within that claimed world. The test is meaningful only if the world claimed is coherent.

The data and the claim are not the same artifact. Data can be assembled in ways that produce a contradiction the test never notices. A customer marked as suspended attached to an order in checkout is a contradiction in most real domains. The test that assembles those two facts side by side and asks a question about discount calculation gets an answer. The answer is meaningless. The world claimed by the setup did not exist, so the behavior measured under that world is incoherent. The bug ships because the test passed.

Naming the world fixes the problem in one move. aCartReadyForCheckout() is a name. The name has obligations. The factory must return a customer who can check out, a cart that can be paid for, an address that can ship. The claim is now explicit and the factory is responsible for honoring it. A test that begins aCartReadyForCheckout() is a test of behavior under a coherent claimed world.

The world layer carries the contract. The test carries the question.

Object Mothers, Builders, and Factories Do Three Different Jobs

The vocabulary collapses these three into a single concept in most codebases. That collapse is the design loss.

An object mother produces a canonical instance. Customers.loyaltyMember() returns the loyalty member everyone on the team agrees is the default loyalty member. No variation. No configuration. One instance, one name, one canonical answer to “what does a loyalty member look like in this system.” Mothers are the team’s reference set of nouns.

A builder varies one axis. aLoyaltyMember().withExpiredCard() starts from the canonical loyalty member and changes the card. The builder is a way to ask: what happens under this scenario, all else being equal. The “all else being equal” part is where most hand-assembled setup fails. Twenty lines of construction does not start from a canonical baseline and vary one thing. It starts from nothing and assembles a guess. The guess is rarely coherent.

A scenario factory composes a whole world. aCartReadyForCheckout() is not a customer and not an order. It is a world in which checkout can proceed. The factory’s job is to assemble the customer, the order, the address, the payment method, the inventory state, and any other facts the scenario requires, into a consistent set. The world is named. The world is the unit of reuse.

The three jobs do not substitute for each other. A team with only mothers gets canonical instances and nothing to vary them with. A team with only builders gets variation with no canonical baseline. A team with only factories cannot vary the axis of interest because the factory is opaque. The three jobs compose. None of them is the same job as the others.

When the codebase collapses all three into one helper, what gets lost is the grammar. The team has tools, not a language. Tests get longer. Setup gets duplicated. The relationship between a canonical instance, a one-axis variation, and a whole-world composition disappears, and with it the discipline of stating what each test is actually about.

Three jobs. Three names. Three roles. Lose the distinction and the suite loses its grammar.

A Coherent World Is a Design Constraint

The factory’s most important job is the one it does silently: refusing to construct contradictions.

A world is coherent when the facts it claims do not contradict each other. A loyalty member with an expired card is coherent. A suspended customer with an in-progress checkout is not. A cart of zero items ready for checkout is not. The factory that returns either of the contradictory worlds without complaint is encoding the contradiction into the system’s vocabulary. Tests written against that factory inherit the incoherence. Agents reading those tests learn to generate more of the same.

Coherence is enforced by what the factory refuses. A factory that exposes a withSuspendedMember() method and a withCheckoutInProgress() method, and then allows them to be combined, has not enforced coherence. The factory has parameterized the contradiction.

The discipline is to make coherence a precondition of construction. The factory aCartReadyForCheckout() either returns a world in which checkout can succeed or it does not return at all. If a test needs a suspended-customer scenario, that scenario gets its own factory: aSuspendedCustomerAttemptingCheckout(). The name carries the claim. The factory enforces the world. The two factories cannot be conflated because they do not share a contract.

This is a design constraint, not a coding convention. Teams that get it right end up with a small number of named worlds, each coherent, each named in domain language. Teams that get it wrong end up with one mega-builder that takes forty optional parameters and produces whatever the caller asks for. The mega-builder is configuration soup with a fluent face. The world it produces is whatever the caller assembled, which is to say no world at all.

Worlds Compose Through Deltas, Not Duplication

The mega-builder problem has a shape. The alternative has a shape too.

aCartReadyForCheckout().withCustomer(aLoyaltyMember().inTheirFirstYear()) says something specific. The base world is aCartReadyForCheckout(). Everything that world claims is still claimed. The test states only the delta: in this scenario, the customer is a loyalty member in their first year. The other facts (the cart is in checkout, the address can ship, the payment is valid) are unchanged.

The delta-composition pattern keeps tests short and worlds coherent. A test that needs three deviations from the base world says exactly three things. The reader does not have to reconstruct the full state of the world from the assembly. The reader sees the base world by name, sees the deltas, and understands the claim.

The alternative is duplication. A test that wants a first-year loyalty member in checkout assembles the whole scenario from scratch, restating the customer, the order, the address, the payment, the inventory. Most of the assembly repeats what every other checkout test does. Each repetition is a chance for drift. One test sets shippingAddress.country = "US". Another sets shippingAddress.country = "USA". A third does not set it at all. Three tests claim three subtly different worlds. None of them notices.

The delta pattern removes the drift by removing the chance to introduce it. The base world is the only place country gets set. Tests that need a non-US address say .shippingTo(germany()) and the delta is explicit. Tests that do not need a different country never see the field. The vocabulary forces the world’s stability.

[Fact]
public void First_year_loyalty_members_get_double_points_during_black_friday()
{
    var world = aCartReadyForCheckout()
        .withCustomer(aLoyaltyMember().inTheirFirstYear())
        .duringBlackFriday();

    var receipt = checkout.Process(world.Order);

    receipt.LoyaltyPoints.Should().Be(120);
}

Reading that test is reading a sentence in the system’s language. The base world is named. The deltas are named. The assertion is in domain types, not primitives. The test could be moved to any other codebase that shared the vocabulary and still mean the same thing. That portability is what a controlled vocabulary buys.

Agents Recombine Worlds, Not Primitives

This is where the discipline pays off in the agent era.

An agent generating a new test in a world-rich codebase composes worlds. Given a prompt about a new promotion scenario, it reaches for aCartReadyForCheckout(), applies the deltas the prompt implies, and writes a test that reads like the rest of the suite because it uses the same vocabulary as the rest of the suite. The reviewer sees a familiar shape. The vocabulary stays stable. The world layer absorbs the new scenario without growing incoherent.

The same agent in a flat-setup codebase has nothing to compose. It reaches for what is there: new Order { ... }, twenty lines of field-setting, magic strings for status codes. It produces another field-bag. The vocabulary does not grow because the codebase has none for it to draw on. The test passes. The codebase is now one test longer and zero vocabulary entries richer.

The difference is not the agent’s capability. The difference is the ceiling. An agent recombining worlds can only express what the world layer supports. A world layer with three coherent worlds, each parameterizable by named deltas, gives the agent a grammar that scales by combination. A flat codebase gives the agent nothing to combine, and the agent generates more flatness, faster.

The same mechanism is at the heart of every test-quality argument the agent era has produced. The vocabulary the team built for craft reasons just became leverage. Worlds are vocabulary too: the kind that lives in factory names and delta methods rather than in domain types. The investment in world design compounds across every future agent-generated test the way the investment in domain types compounds across every future agent-generated implementation. Two compounding curves, same shape.

A team that invests in worlds is investing in what the agent can do tomorrow.

The God-Builder Is the Anti-Pattern

The shape to watch for is the one builder that does everything. Forty optional methods. Twenty boolean flags. A .build() call that returns whatever combination the caller assembled. The fluent face hides what the helper has become: a configuration object with method-chaining.

The god-builder is what teams reach for when the world layer is missing and they want the readability of builders without the design work of named worlds. It looks like a builder. It scales by parameter, not by composition. Every new scenario adds another method to the builder rather than another named world to the codebase. Over time the builder accumulates a parameter for every variation any test ever needed. The result is a helper that can construct any state, including contradictions, with no enforcement.

The pattern fails in a specific way under agent authorship. The agent reading the god-builder sees one helper with forty methods and treats it as a flat menu. It picks methods that match the prompt’s keywords, calls them in some order, and constructs whatever falls out. The constructed world is no longer coherent because nothing was responsible for coherence. The test passes, sometimes, and silently encodes a contradiction the rest of the time.

Named worlds prevent this by making the contradiction unconstructable. aCartReadyForCheckout() cannot return an empty cart because that would violate the world’s claim. The factory will not let that combination exist. The agent reading the factory cannot construct an empty checkout-ready cart, because the API does not offer the path. The constraint is structural.

The discipline is to resist the god-builder when a new scenario does not fit an existing world. The right move is to name a new world. aCartReadyForCheckoutWithDigitalGoodsOnly(). aCartAfterPromotionStacking(). The catalogue grows by addition, not by parameterization. The vocabulary stays specific. The coherence stays enforced.

A god-builder can do anything. A named world can do exactly one thing. Only one of those scales.

World-Building Is the Skill, Not the Helper

The thing the team is doing when they design the world layer is domain modeling. They are deciding what worlds exist in this system, what each world claims, and what the deltas between them are. The artifacts (factories, builders, mothers) are the residue of that thinking. The thinking is the skill.

Most discussion of test setup focuses on the artifacts. Pick the right helper library. Use builders. Avoid global fixtures. All true. None of it is the work. The work is the modeling: what is a coherent world in this system, how do worlds relate, what deltas are interesting enough to name. That work is product modeling under another name. It produces the same artifacts product modeling produces: a vocabulary of named concepts and the relationships between them.

The fact that the artifacts live in tests/ rather than src/ is incidental. The artifacts are the system’s domain language, expressed in a form that can be exercised. They are the most-used domain language in the codebase because every test composes from them and every agent reads them while writing more tests. Investments in that vocabulary compound at the highest rate of anything in the codebase.

Teams that treat world design as a first-class engineering activity end up with codebases that are pleasant to work in, easy to extend, and stable under agent authorship. Teams that treat setup as plumbing end up with brittle suites, drifting vocabulary, and agent contributions that grow the codebase without growing its coherence.

Setup is world-building. The team that builds worlds builds a system. The team that hand-assembles setup builds a pile of facts and asks the agent to make a system out of them. The agent cannot. Nothing can. The system was always going to come from the design, and the design lives in the world layer or it lives nowhere.