TDD Buddy - Just-in-Time Tests Are What You Get When You Gave Up

Just-in-time tests are what you get when you gave up on the suite.

That sentence will feel harsh to teams that have adopted just-in-time test generation as a genuinely useful tool, and that’s fine. The distinction this post is drawing is not between JIT tests and no tests. It is between JIT tests as a supplement and JIT tests as a substitute. The former is a reasonable workflow choice. The latter is a way of paying the vibes-driven-codebase tax with extra steps and a press release.

To make that argument cleanly, the post has to be honest about what JIT tests actually solve before dismantling what they forfeit.

The Test Suite Was Never Just a Regression Net

The most common frame for a test suite is the regression net: a mesh of assertions that catches breakage before it ships. This is accurate and incomplete. It names one of the suite’s functions and leaves the others unnamed, which is how a reductive frame becomes dangerous. A team that only thinks of its suite as a regression net will evaluate tests only by whether they catch breakage. By that criterion alone, JIT tests look fine.

They are not fine as a substitute, because the suite was never just a regression net.

A well-maintained test suite is at least four things at once. It is a vocabulary registry: the set of named concepts that the team has agreed to recognize, expressed as builder method names, test scenario names, and domain types. Every time a developer writes aLoyaltyMember() or anOrderOver(50.dollars()), they are not just constructing test data. They are referencing an entry in the team’s domain vocabulary, confirming that the concept exists, that it has a canonical name, and that it composes with other concepts the team has already recognized. That registry is the living record of what the team knows.

The suite is also a scenario index: a searchable record of every behavior the team has ever specified. Not what the code happens to do, but what the team decided it should do. These are different things. The code reflects the most recent implementation decision. The scenario index reflects the intention that preceded and constrained each implementation decision. When a future developer (or a future agent) asks “what does the system do when a loyalty member’s order is exactly at the threshold?”, the scenario index has the answer, precisely named and still executable.

The suite is also a contract layer: the explicit commitments the system makes to its callers. Collaboration tests and interaction assertions say, in executable terms, what one component promises to another. These commitments outlive the developers who made them. They survive refactors, turnover, and re-platforms, because they are specified rather than implied.

And the suite is an onboarding document. A new developer dropped into a well-maintained codebase can read the test names, find the builders, and understand the domain in an afternoon. A new agent dropped into the same codebase can do the same thing, faster. The suite teaches the vocabulary and demonstrates the idioms, not as documentation that may have drifted, but as executable specification that cannot.

None of those four functions is served by a test that is generated against a diff, run once, and discarded. The catch-the-regression function is served. The other three are not.

The reduction of “test suite” to “regression net” was always reductive. The industry is now selling that reduction back as innovation, and the teams most likely to buy it are the ones whose suites were already failing to do the other three jobs.

What JIT Tests Actually Solve

Be precise here, because precision is what separates a useful critique from a sweeping dismissal.

A coding agent reads a diff, infers the intent of the change, generates a test that would have caught a regression had it been in the suite before the change, runs it, and discards it. That workflow solves a real and specific problem: the gap between the tests that exist before a change and the tests that would have been needed to catch the regression that the change introduces.

That gap is real. In any non-trivial codebase, the existing suite does not cover every behavior. Changes introduce behaviors that nothing was testing. A JIT test, generated against the diff, closes that gap for the PR at hand. It catches the regression. The PR ships cleaner.

This is useful. It is narrow. And it is honest: the value proposition is exactly “we caught this PR’s regression,” no more.

There are also workflows where JIT tests are the right tool even in a well-maintained suite. A hotfix that touches an area the team has not reached yet. A one-off migration script that will never run again. A data-pipeline job that exists for a single event. In those cases, the expected lifecycle of the behavior is genuinely short, and a short-lived test is the proportionate response.

The argument against JIT testing is not that the tool is wrong. The argument is that the tool is being sold as a replacement for disciplined test maintenance, and it cannot do that job, because the thing it generates and discards is not the thing that maintained suites accrete.

The gap JIT tests close is real. The gap they cannot close is larger.

There is also a second-order effect worth naming. Because JIT tests are targeted at a diff, they tend to be written in the vocabulary of the implementation rather than the vocabulary of the domain. The test knows what the code did; it may not know what the behavior means. That is not a criticism of the generation quality. It is a structural property of diff-targeting: the diff is the signal, and the diff speaks implementation.

What JIT Tests Forfeit

Every JIT test is generated, run, and discarded. That sentence contains the whole argument.

When the test is discarded, the vocabulary does not accrete. The scenario is not named. The concept does not enter the builder hierarchy. The commitment is not added to the contract layer. The onboarding document does not grow. The suite tomorrow looks exactly like the suite today, except one more PR shipped without a regression. That is a fine outcome. It is not a compounding outcome.

Consider what happens when a team maintains tests instead of discarding them. Each commit that adds a scenario to the suite adds an entry to the vocabulary registry. aLoyaltyMember() gets added once. From that point forward, every test that needs a loyalty member can use it, every developer who reads those tests learns the canonical name, and every agent working in that codebase can recombine it with other builders to generate new scenarios. The builder becomes a word in the team’s language. Its value compounds because every subsequent use is cheaper than the first.

JIT tests do not do this. A JIT test for a DiscountCalculator change might assert something like:

[Fact]
public void DiscountCalculator_returns_correct_value_for_input()
{
    var calculator = new DiscountCalculator();
    var result = calculator.Calculate(new Order { Total = 60m, CustomerType = "loyalty" });
    Assert.Equal(6m, result);
}

The test is correct. It catches the regression. Notice what it does not contain. It does not contain aLoyaltyMember(). It does not contain anOrderOver(50.dollars()). It does not name the threshold as a domain concept. It does not name the relationship between loyalty status and the discount tier. It does not reference the promotional model that governs when loyalty discounts apply and when they stack with other promotions. It is a diff-targeted regression test, written in implementation vocabulary rather than domain vocabulary, and when it is discarded those absences are discarded with it.

Now consider the same change covered by a persistent scenario test:

[Fact]
public void Loyalty_members_receive_ten_percent_off_orders_over_the_loyalty_threshold()
{
    var order = aCartReadyForCheckout()
        .forCustomer(aLoyaltyMember())
        .containing(anItemCosting(60.dollars()));

    var receipt = checkout.Process(order);

    receipt.Discount.Should().Be(6.dollars());
    receipt.AppliedPromotions.Should().Contain(promotion => promotion.Is(LoyaltyDiscount));
}

This test asserts the same regression-catching fact. It also names the loyalty concept in the team’s builder vocabulary. It names the threshold relationship. It names the promotional model and anchors it to a domain type. It makes visible the connection between aLoyaltyMember() and LoyaltyDiscount, which means that tomorrow, when someone adds a stacking promotion, they will find this test, read its name, and understand what they are stacking against.

The persistent test compounds. The JIT test does not.

This is not a small difference dressed up in large words. It is the difference between an asset that grows and a tool that consumes itself. After a hundred JIT-tested PRs, the suite is in the same position it was in before the first one. After a hundred persistently-tested commits, the suite is a richer document, a broader vocabulary, and a more powerful seed corpus for the next agent task.

The compounding line is the whole point, and JIT tests leave it flat.

Why the Pitch Lands Anyway

If JIT tests forfeit so much, why do teams adopt them so readily?

The answer is precise and uncomfortable: most teams’ suites were already not compounding.

A team whose tests are named test_1 through test_847, whose setup is composed of bare new Order() calls and twenty lines of field-setting, whose domain concepts live as string comparisons and magic decimals, whose suite is a collection of regression snapshots rather than named scenarios, that team is not losing a vocabulary registry when they adopt JIT testing. They never had one. They are not losing a compounding asset. What they had was already flat.

For that team, JIT testing is a genuine improvement on their current workflow. The tests are machine-generated rather than human-generated, but the machine-generated tests are no less expressive than the hand-written ones, because “not expressive” was already the condition. The suite was already a regression net and nothing more. JIT testing makes the regression net cheaper to maintain.

This is why the pitch lands. It lands because it meets the team where they actually are, which is with a suite that was already disposable in everything but name. The JIT pitch says: “your suite is not working for you the way you hoped it would, so stop maintaining it and let the machine handle it.” And for a team whose suite was not doing vocabulary work, not doing scenario indexing, not doing contract layer work, the pitch is accurate.

The problem is not the pitch. The problem is the implication that the condition the pitch is treating (a disposable suite) is the inevitable condition, and that disciplined test maintenance was always aspirational mythology. It was not. Teams that maintained well-named, builder-driven, scenario-rich suites got real compounding, and the compounding is now worth more than it was before agents existed, not less.

JIT testing makes a weak suite tolerable. It makes it more tolerable specifically by making it seem like the right answer, which delays the decision to fix it.

That is not a vendor critique. That is an observation about what happens when a tool is designed to address the symptom of a degraded practice and gets sold as a replacement for the practice itself.

The Persistent Suite Is the Asset Agents Need

There is a mechanical reason why persistent, vocabulary-rich suites compound differently in an agentic context, and it lives in how agents generate.

Agents do not generate from nothing. They generate by recombining patterns they find in the surrounding codebase. Give an agent a codebase with aLoyaltyMember(), aCartReadyForCheckout(), anOrderOver(50.dollars()), Money, and Should().Be(), and it will generate new tests by composing those patterns in novel combinations. The output reads like the existing suite because it uses the existing suite. The vocabulary stays consistent. The new test slots into the existing index rather than standing beside it as a foreign object.

Give the same agent a codebase of new Order { CustomerType = "loyalty", Total = 60m } and it will generate more new Order { CustomerType = "loyalty", Total = 60m }. The output is consistent, but consistent with a suite that was never composable. The vocabulary does not grow because there was no vocabulary to grow. Each generated test is standalone, primitive-heavy, and opaque to the next agent that reads it.

The persistent suite is the seed corpus for agent-generated tests. The richer the seed, the more the agent’s output inherits the domain language. The richer the domain language in the tests, the better the agent understands the system’s concepts and their relationships. The agent reading anOrderOver(50.dollars()) has already learned that there is a threshold, that it is fifty dollars, and that the threshold matters enough to have a named builder method. That is information the agent carries into its next task.

JIT tests contribute nothing to this corpus. They are generated from the seed and discarded without enriching it. The seed stays the same, which means the next agent task starts from the same position as the last one, with no accumulated vocabulary, no new named scenarios, no broader index to search.

This is the argument this blog has been developing across the Your Test Suite Is Your API for Agents and The Bar for TDD Just Moved arc. The test suite is not a safety net for agents. It is the interface they compose against. An interface that does not grow does not compound. An interface that does not compound does not leverage.

JIT tests give agents nothing to build on. A maintained suite gives them a richer foundation after every commit.

The Test Suite Is a 10-Year Asset

The codebase a team maintains in 2036 is downstream of the test suite they maintain in 2026.

This claim is worth sitting with, because it is the frame that makes the JIT question concrete rather than abstract. Tests written and discarded in 2026 are gone. Not just absent from the 2036 suite: they never shaped the vocabulary, never named the scenarios, never constrained the implementations that were built on top of them. The codebase in 2036 reflects the vocabulary and concepts that survived, and tests are where vocabulary gets named and retained.

Tests written and kept in 2026 are still steering generation in 2036. aLoyaltyMember() written today becomes the canonical noun for loyalty-customer test setup across every agent task run against this codebase for the next decade. The threshold concept named as anOrderOver(50.dollars()) stays visible in the suite as a reference point when the threshold changes, when a stacking promotion gets added, when a new tier gets introduced. The scenario named Loyalty_members_receive_ten_percent_off_orders_over_the_loyalty_threshold is still findable in 2036, still executable, still accurate about what the team intended when that behavior was first specified.

None of that is dramatic. It is just what compounding looks like when the investment horizon is ten years instead of ten sprints.

The teams that JIT-tested their way through the late 2020s will face the same codebase problems in 2036 that vibes-driven development produced in the 2010s. Fragmented vocabulary. Duplicate concepts under different names. Behaviors that were specified once and lost when the team turned over. An implementation landscape that reflects accumulated generation rather than accumulated intention. The tool that produced the regression-catching tests changed. The outcome of not maintaining a suite did not.

The teams that built and maintained vocabulary-rich suites through the same period will have a codebase in 2036 where every agent task starts from a richer corpus. The builders compose more scenarios. The scenario index covers more behaviors. The contract layer documents more commitments. The onboarding surface is broader. The seed corpus for every future agent task is denser and more expressive.

That difference was set in motion by a decision made in 2026 about whether tests were worth keeping.

The ten-year asset is not built from JIT tests. It is built from the decision, made commit by commit, that each new behavior deserves a name and a permanent place in the suite.

JIT Tests Are a Supplement or a Hospice Plan

The frame that resolves this is not “JIT tests are bad.” The frame is: for what is this tool the right answer?

JIT tests are a useful supplement to a strong suite. When the suite is already vocabulary-rich, scenario-named, and builder-driven, a JIT test that closes a gap on a specific PR adds the regression-catching value without requiring that the team manually maintain every one-off case. The strong suite provides the corpus. The JIT test patches the specific gap. The combination is effective.

JIT tests are a hospice plan for a weak suite. When the suite was already not compounding, already not doing vocabulary work, already not building the scenario index, JIT testing makes the condition comfortable without treating it. The regression coverage stays adequate. The vibes-driven-codebase tax keeps accumulating, now with automation managing the surface presentation rather than humans doing it by hand.

The hospice plan is not contemptible. Hospice is appropriate when the patient cannot recover and comfort is the honest goal. Teams whose suites have decayed past the point where investment is feasible, teams with ten years of legacy primitives and no builders, teams with three months to ship and no runway for infrastructure work, those teams may legitimately be in hospice territory. For them, JIT testing is the right tool applied in the right context.

The error is adopting JIT testing when recovery was still possible. When the suite could still be built into an asset. When the vocabulary was still young enough to be named and extended. When the team had the runway and the will to invest in the compound. In that situation, JIT testing is not a supplement. It is the decision to stop compounding before the compounding had a chance to pay off.

Adopt JIT tests as a supplement to a strong suite. Refuse them as a substitute for building one.

The persistent suite is not infrastructure cost. It is the team’s vocabulary, its scenario index, its contract layer, and its onboarding document, accreted commit by commit, compounding use by use, steering generation for every agent that will ever work in the codebase, including the ones running tasks in 2036 that have not been written yet.

JIT tests catch this PR’s regression. The suite you build and keep catches everything that comes after.