Evals Are Tests Wearing a Lab Coat

An eval is a test: a named scenario, an assertion, a verdict. The eval-tooling wave rebuilt the BDD coordination tax as golden datasets owned by a separate role. The eval suite should be the test suite, in domain language, with builders.

By Travis Frisinger · June 29, 2026 · 11 min read
TDDAI AgentsTest DesignSoftware Delivery

Related reading: BDD Was a Coordination Tax. AI Just Repriced It is the prior post that this one extends; TDD Already Does BDD, Without the Gherkin and The Bar for TDD Just Moved name the test discipline this argument assumes.

An eval is a test. The industry just put a lab coat on it.

That sentence is the entire argument. Everything that follows is structural detail. The eval-tooling wave arrived with new vocabulary (golden datasets, judges, scorers, rubrics), new file formats, new dashboards, new job titles, and a confident announcement that “TDD does not work for AI” because there is no single correct output. Both moves are mistakes. The first is a vocabulary reinvention that hides what the discipline already knew how to do. The second is a strawman of TDD that good TDDers abandoned twenty years ago when they started asserting on properties and invariants instead of exact strings.

The cost of the lab coat is not aesthetic. It is structural. Treating evals as a separate discipline produces a separate specification language, owned by a separate role, maintained in a separate tool, drifting from the code it claims to describe. That sentence should sound familiar. It is the exact failure mode of feature files in the BDD era, restated with model-shaped nouns. The industry deleted Gherkin and rebuilt it, line by line, under a new name. The coordination tax that disciplined TDD finally killed is being charged again.

This post is about taking the lab coat off.

The Industry Reinvented the Test and Renamed It

Strip away the surface and look at the structure of an eval.

An eval has a named scenario: “respond to a churn-risk customer.” It has an assertion: “the response should include the retention offer.” It has a verdict on every run: pass or fail. It is run automatically in a pipeline. It blocks deployment if it fails. It is part of how the team knows whether the system works.

Everything in that paragraph is also true of a test. The vocabulary is different. The structure is identical. The “scenario” is what a test has always been. The “assertion” is what a test has always done. The “verdict” is what a test has always produced. The pipeline integration, the deployment gating, the source-of-truth role: all of it is what tests have done for two decades.

The pieces that look new under closer inspection are not.

“Golden datasets” are example tables. Tests have had example tables in parameterized tests, table-driven tests, and theory tests for as long as testing frameworks have existed. The “judge” or “scorer” is an assertion library. The “rubric” is an assertion list. The “trace viewer” is a test report. The “eval harness” is a test runner. Every piece of the eval-tooling stack maps to a piece of the testing stack that already exists in every mature codebase. The mapping is not approximate. It is exact.

The lab coat is the framing that the work is novel. The work is not novel. The work is testing.

”TDD Does Not Work for AI” Misreads What a Test Is

The standard objection is that TDD assumes one correct output. The model produces variable output. Therefore TDD does not apply. The argument sounds airtight to anyone whose mental model of TDD ends with Assert.Equal(expected, actual).

That mental model is not what disciplined TDD looks like.

A disciplined test asserts on properties of the output, not on exact equality of the output. The test of a discount calculator does not assert that the receipt is one specific dollar value. It asserts that the receipt has a discount line, that the discount is in a valid range, that the total matches the subtotal minus the discount. The test of a search ranker does not assert that the top result is one specific document. It asserts that results are ordered by relevance, that no result has a negative score, that the requested filter is respected.

The shift from exact assertions to property assertions happened years before models entered the suite. It happened because real systems have non-deterministic components everywhere. Floating-point arithmetic. Timestamp generation. Set iteration order. Concurrent execution. Database row ordering. Sort stability. Any of those will produce output that varies run to run while still being correct, and assertion-on-exact-equality breaks under all of them. The test design that survived the encounter is the test design that asserts on what must hold, not on what happened to appear.

A model is one more non-deterministic source. Apply the same discipline. Assert on the shape, length, presence of required content, absence of forbidden content, satisfaction of domain constraints. Skip the exact-string comparison that was never the right tool. The “TDD does not work for AI” claim is a critique of a strawman. The version of TDD it critiques was wrong for everything, not just for model-backed features.

[Fact]
public void Draft_to_a_churn_risk_customer_stays_under_two_hundred_words_and_includes_the_retention_offer()
{
    var draft = drafter.Compose(anEmail().to(aChurnRiskCustomer()));

    draft.WordCount.Should().BeLessThan(200);
    draft.Should().MentionThe(retentionOffer);
    draft.Tone.Should().Be(Tone.Empathetic);
    draft.Should().NotMention(internalCodename);
}

That is a property assertion. The model can produce a thousand different drafts that satisfy it. The test does not care which one comes back today. The test cares that whatever comes back is short enough, mentions the offer, lands in the right tone, and does not leak the codename. Those are the properties the team has decided matter. The test is the encoding of that decision.

No lab coat required.

Non-Determinism Was Always a Test-Design Question

The deeper version of the objection is that model output is not just variable, it is statistical. A test that passes on one run might fail on the next, not because the system changed, but because the model sampled a different token at temperature 0.7. Aggregating across many runs is the standard answer: run the eval N times, measure pass rate, set a threshold.

Aggregate scoring is a real technique. It belongs in the suite. It does not require a parallel specification artifact.

A test that runs N times and asserts on a pass rate is still a test. The framework needs to support running a scenario repeatedly and aggregating the result, which is a feature most test frameworks already have under the parameterized-test banner or which can be added with a few lines of helper code. The assertion changes from “this property holds” to “this property holds in at least 95% of runs.” The structure is unchanged.

[Fact]
public void Retention_offer_appears_in_at_least_ninety_five_percent_of_churn_drafts()
{
    var draftsContainingOffer = Enumerable.Range(0, 100)
        .Select(_ => drafter.Compose(anEmail().to(aChurnRiskCustomer())))
        .Count(d => d.Includes(retentionOffer));

    draftsContainingOffer.Should().BeGreaterThanOrEqualTo(95);
}

That test is the eval. It is also the test. The lab coat would have called this a “rubric pass rate” and put it in a separate dashboard. The discipline calls it an assertion on a probability and puts it in the suite. The verdict is the same. The location is not.

The same applies to model-as-judge patterns. If the assertion uses another model to score the output (does this draft sound empathetic?), the judge is just an assertion implementation. Wrapping a model call inside IsEmpathetic(string text) is the same kind of wrapping that wraps a floating-point comparison inside IsApproximately(double expected, double actual, double tolerance). The assertion library grew an entry. The test structure did not change.

The Golden Dataset Is a Feature File

This is where the BDD echo gets loudest, and the warning is the same.

A golden dataset is a separate file, in a separate format, often owned by a separate role, containing input and expected-output pairs that the system is supposed to match. Engineers do not write it directly. They consume it through a harness. The dataset evolves on its own schedule. The code evolves on its. Six months later, the two have drifted: features added to the code that the dataset does not cover, examples in the dataset for features that were removed, expected outputs that reflect an older version of the prompt.

Replace “golden dataset” with “feature file” and the paragraph is verbatim the post that this series wrote two arcs ago about why BDD tooling died.

The pattern is the same. A second specification language, separated from the code, owned by a different role, drifting at a different rate. The justifications are the same: non-engineers need to read it, the format is more accessible, the discipline of writing it is “different” from writing tests. The failure modes are the same: drift, ownership confusion, ceremony, a coordination tax paid every time anyone changes anything.

The industry already learned what to do with this pattern. Delete the separate artifact. Put the specification in the test suite, in domain language, expressed with builders and assertions the engineers actually maintain. Let non-engineers read the tests directly, because the tests are now named in language the business uses and structured around scenarios the business recognizes. The coordination cost collapses because the artifact collapses.

The same move applies to evals. The golden dataset belongs in the suite, expressed as parameterized scenarios with property assertions, named in the team’s vocabulary. Anyone who reads the test sees what the team has decided the model-backed feature should do. Anyone who edits the test edits the specification directly. There is no second file to keep in sync. There is no second role to coordinate with. The drift becomes structurally impossible because the artifact does not exist as a separate thing to drift from.

The lab coat sells the parallel artifact. The discipline deletes it.

Evals Belong in the Suite, in the Vocabulary

The right home for model-backed feature behavior is the same home as every other feature’s behavior: the test suite, in the team’s domain language, using the same builders and domain types the rest of the suite uses.

A model-backed feature is, structurally, a collaborator. It takes input, produces output, has a contract the team cares about. From the suite’s perspective it is no different from a third-party API, a database, a calculation engine, or any other collaborator the system depends on. The test pattern for collaborators is well-established. Mock the collaborator at the seam when you want to test how the surrounding code uses it. Hit the real collaborator in contract tests when you want to verify the collaborator itself holds up. Both kinds of test live in the same suite, in the same vocabulary, in the same repo, in the same commit.

The vocabulary is the lever. When the test reads anEmailDraft().forA(churnRiskCustomer()) asserting Should().MentionThe(retentionOffer), the team’s domain language is doing the work. The test names a concept (churn-risk customer) that exists in the business. It asserts on a concept (retention offer) that exists in the business. It uses builders that compose with the rest of the suite’s builders. A reviewer reading this test does not have to switch contexts to a separate eval framework’s vocabulary. The reviewer reads a test in the team’s language and understands what the model is being asked to do.

A new team member reading the codebase finds the model-backed feature alongside every other feature, specified the same way, runnable the same way, refactorable the same way. The discoverability problem the eval-tooling stack solves with a dashboard is solved here by the file system: the tests live next to the code they specify, named in language anyone on the team recognizes.

This is what the disciplined-TDD position has been saying for years, applied to a new audience. The eval-tooling wave is the latest group to arrive at the realization that specifying behavior is the work. The discipline that already specifies behavior has nothing new to learn from the wave, but the wave has a great deal to learn from the discipline.

The Split Is a Role Artifact, Not a Technical One

The reason evals live outside the suite in most organizations is not technical. It is organizational. A separate team owns “the AI part” of the product. That team needs a way to specify behavior. The existing test suite belongs, in the team’s mind, to a different team (the engineering team that owns the rest of the product). So the AI team builds its own specification artifact, in its own tooling, in its own repo. The split exists because the role split existed first.

That is the same dynamic the BDD post named, with the cast updated.

Feature files lived outside the test suite because a separate role (the analyst, the QA, the BA, depending on the era) needed an artifact to own. The split was a treaty between roles, not a technical decision about where specifications should live. When the role boundary compressed (analyst and engineer became the same person, or worked closely enough that the handoff disappeared), the artifact compressed with it. The feature file evaporated into well-named scenario tests in the same suite as everything else. The treaty was no longer needed, so the protocol stopped being maintained.

The role boundary around AI features is in the same place feature-file ownership was a decade ago. There is a separate team. They need an artifact to own. They build evals. The boundary is real today and produces real artifacts. The boundary is also compressing, fast, as model-backed features stop being “the AI part” and start being “the part that uses a model among five other collaborators.” When that compression finishes, the eval suite will collapse into the test suite the same way feature files collapsed into scenario tests. The artifact will follow the role.

Teams that read the trajectory now and merge the suites early get to skip the coordination tax that the rest of the industry is going to pay for the next three years. The merge is not hard. The merge is structurally available the day the team decides to do it. The barriers are organizational, not technical.

One Suite, One Vocabulary, One Source of Truth

Pull the argument back to where it started.

The lab coat on evals hides three things at once. It hides that evals are tests with new vocabulary. It hides that the “TDD does not work for AI” objection critiques a version of TDD nobody disciplined was practicing. And it hides that the separate eval suite is the BDD coordination tax wearing a different costume.

Strip the lab coat off and the work is testing. The same discipline. The same artifacts. The same vocabulary. The same suite. Model-backed features are specified alongside every other feature, in the team’s domain language, using the same builders that every other test uses. The agent reads the suite and finds a unified specification. The reviewer reads the suite and finds one place where intent lives. The team maintains one artifact, not two.

The eval suite should be the test suite. The lab coat comes off. The work that was always testing gets called what it was.

There is no eval suite and test suite. There is the suite. Everything the system does, including the parts a model does, is specified there, in the words the team uses for everything else.