TDD Buddy - Where the Review Point Moved

Reviewing the diff is now harmful, not just insufficient.

This is the first post in a two-part series, “The PR Is Eating Itself.” The second, What ‘Senior’ Means When Typing Is Free, follows from where this one lands. Read them together if you can.

Code Review Was a Compromise

Code review never solved the actual problem. It solved the closest problem that was solvable given the constraints of the moment.

The actual problem was always intent verification: does this change do the thing the team agreed it should do, in a way that fits the system’s vocabulary, at a boundary that’s understood by everyone downstream? That problem lives upstream of code. It lives in the specification, the scenario, the domain model, the naming choices the team made before anyone typed an implementation.

But the specification, the scenario, the domain model, these lived in someone’s head, or in a Jira ticket, or in a Confluence page that was already out of date. None of them were executable. None of them were version-controlled in a way that could be diffed. So review landed on the closest legible artifact: the diff.

The diff was the compromise. It was a shadow of intent, downstream of the decisions that actually mattered, but readable, diffable, line-commentable. Review tooling grew up around it because review tooling had to grow up around something. The pull request UI, inline comments, approval gates, CODEOWNERS files: all of it is archaeology built for the diff, because the diff was the thing you could hold.

Consider what reviewing a diff actually asks of a developer. It asks them to read an implementation and reconstruct, from code, the decision that produced it. Why this abstraction? Why this name? Why this boundary? The code doesn’t say. The developer reading it infers. A senior developer is good at this inference because they’ve spent years building up a mental model of how intent maps to implementation in a given codebase. The inference isn’t perfect, but it’s good enough. Most serious mistakes get caught. The rest get caught in production and fixed.

The inference was always the load-bearing work, not the line-reading. The diff was just the medium through which the inference happened.

The compromise was load-bearing. It wasn’t wrong to build on it. In a world where humans wrote every line, the diff was the tightest approximation of intent available. A senior developer reading a diff could reconstruct intent from the implementation well enough to catch serious mistakes. The information loss was tolerable because the reader could compensate for it.

That compensation was invisible. You couldn’t see what you were filling in. You just did it.

The compromise worked precisely because it was invisible.

The Diff Stopped Being the Specification

When an agent writes the diff, the game changes entirely.

A human-authored diff carries intent in its bones, even when that intent is poorly expressed. The variable names reflect what the developer was thinking. The structural choices carry fingerprints of the problem the developer was trying to solve. There’s a human behind the diff, and a skilled reviewer can often reconstruct the human’s reasoning from the artifact. The inference worked because there was a reasoning process to infer from.

An agent-authored diff is downstream of three things: a prompt, a test suite, and whatever contracts or mock expectations bounded the task. The diff is the output of those three inputs. It is not an approximation of the agent’s intent; it is the rendering of constraints into code. The intent lives in the constraints, not in the rendering.

There is no reasoning to infer from the diff, because the diff is not where the reasoning happened.

The contrast is stark when you hold the two artifacts next to each other. A two-hundred-line agent-generated diff, with a service class, a collaborator update, and four new test cases, tells a reviewer: here is what changed. A twelve-line scenario block, showing what situations are covered, what situations are absent, and what vocabulary was used to name each one, tells a reviewer: here is what was decided. Those are not the same document. The first is the artifact most teams review. The second is the artifact that determines whether the work is correct.

Reviewing the diff without inspecting the constraints is reading the shadow instead of the object. You can spend forty-five minutes reading two hundred lines of generated code and conclude that it looks reasonable, is formatted correctly, and passes CI. All of that is true and none of it tells you whether the task was specified correctly, whether the scenarios cover the cases that matter, whether the vocabulary introduced in the new tests is consistent with the vocabulary the team has spent two years building.

The diff looks fine. The specification was wrong. The review found nothing.

This is not a corner case. It is the default failure mode of teams that adopted agentic workflows without rethinking where they review. The volume of code generation rises. The quality of diff-level review holds steady or improves. The quality of what gets built degrades, quietly, scenario by missing scenario, vocabulary drift by vocabulary drift, until the test suite stops being a specification and becomes a collection of assertions that pass but mean nothing.

Reading the diff to verify intent is now the wrong unit of analysis.

Where Intent Now Lives

Intent migrated. It didn’t disappear; it moved upstream, into three surfaces that most teams treat as incidental rather than as primary review artifacts.

The first surface is test names. A scenario index is what a well-named test suite is. Every test name is a claim the team is making about what the system does under a specific set of conditions. The name Loyalty_members_get_a_ten_percent_discount_on_orders_over_fifty_dollars is not decoration. It is a specification. The absence of a test named Loyalty_members_with_expired_tier_status_do_not_get_the_discount is a gap in the specification, and that gap will not appear anywhere in the diff. The diff will contain the code that handles expired tiers or it won’t. The diff will not contain the claim that the team thought about expired tiers and made a deliberate choice.

The second surface is builder and factory APIs, what is usefully called the noun layer. The fluent builders your team maintains, aLoyaltyMember(), aCartReadyForCheckout(), anOrder(), are a vocabulary index for the domain. When a test reads:

var order = aCartReadyForCheckout()
    .forCustomer(aLoyaltyMember().withExpiredTier())
    .containing(aProduct().costing(60.dollars()))
    .build();

that line is doing vocabulary work. The builder API names the domain concepts your team has decided are real. When a new test introduces aLoyaltyMember().withDiscountOverride(0.15m) without a corresponding builder method, using a raw decimal instead, that’s a vocabulary drift: a concept that exists in the implementation but has no named, composed home in the test language. It will be invisible in the diff, which will show a test that passes.

The third surface is contract and interaction expectations, the verb layer. When a collaboration test specifies that the checkout service calls the discount calculator with a particular shape of request, that expectation is a claim about the system’s boundary. It defines what crossing the boundary means. A reviewer who reads that expectation can tell whether the agent understood the collaboration correctly: whether the request carries the right data, whether the response is consumed through the right abstraction, whether the mock was set up in a way that actually constrains the implementation or just greenlights any call.

Three surfaces. None of them live in the diff. All of them carry more intent than the diff does.

What a Review Looks Like Now

The sequence matters. Most teams do it backwards.

Start with test names. Skim the full list of new and modified scenario names. You are looking for gaps, not for correctness. What scenarios are present? What scenarios should be present but aren’t? Does the naming stay inside the team’s vocabulary, or has a synonym crept in that will start a split in how the team talks about the domain? A five-minute pass over test names tells you more about whether the task was specified correctly than a forty-five-minute line-by-line diff read.

A useful discipline here: before looking at any test body, write down the scenarios you would expect to see, given the task description. Then compare that list to the scenario names in the PR. The gaps are your first review comment. They are more valuable than any line-level observation, because they identify what the agent was not asked to verify, which is where the real risk lives.

Move to the builder API. Look at how existing builders were used and whether any new test setup was written by hand instead of through the builder layer. Raw object construction in test setup is almost always a signal: either the builder is missing a needed method, or the test is exploring a scenario the builder vocabulary hasn’t named yet. Both are design conversations, not implementation conversations. A raw new Customer { TierStatus = "expired" } in a test is a reviewer flag: the team hasn’t decided what to call this concept, and the agent picked the path of least resistance.

Compare that flag to what the test would look like if the vocabulary were right:

[Fact]
public void Expired_tier_members_do_not_receive_the_loyalty_discount()
{
    var order = aCartReadyForCheckout()
        .forCustomer(aLoyaltyMember().withExpiredTier())
        .containing(aProduct().costing(60.dollars()))
        .build();

    var receipt = checkout.process(order);

    receipt.discount.Should().Be(Money.Zero);
}

The builder call .withExpiredTier() is not implementation detail. It is a named claim that the team recognizes expired tier status as a first-class domain concept. The raw alternative, TierStatus = "expired" as a string literal, is the absence of that claim. The reviewer’s job is to notice the difference and ask which one the team actually wants.

Look at the contract and interaction expectations next. For each mock expectation or contract assertion introduced in the new tests, ask whether the boundary being described matches the system’s existing collaboration graph. Does the expectation reveal the right collaborator being called at the right time with the right shape? A misspecified mock expectation is one of the most expensive bugs a review can miss: it will pass CI, it will read as reasonable code, and it will leave the system’s actual boundary behavior untested.

Then, and only then, look at the diff.

At this point the diff is a confirmation, not an investigation. You are checking that the implementation plausibly corresponds to the contracts you already reviewed. You are not trying to reconstruct intent from implementation, because you already found the intent in the three surfaces that came before. The diff review is minutes, not hours.

The checklist, in order: test names first, builder API second, contract expectations third, diff last.

Everything else is archaeology.

Diff-First Tooling Is the Wrong Granularity

The pull request as a review artifact was designed for a specific workflow: a human writes code, a human reviews it. The granularity, line-by-line diff with inline comments, was calibrated for that pair. You can comment on line 47 because line 47 was where the human made a decision that can be questioned.

When an agent generates the diff, line 47 is not where the decision was made. Line 47 is where the decision was rendered. The decision was made in the test name, the builder call, the contract expectation. Those artifacts don’t have a canonical location in a diff-first review tool. They’re buried in the same flat list of changed lines as the implementation. A reviewer who finds a problematic test name has to comment on the line where the test name appears, in a file that also contains the assertions and the setup, in a review thread that will be displayed in the same UI as a comment about a typo in a docstring.

The tooling cannot distinguish between intent-bearing artifacts and implementation artifacts, because it wasn’t built to. It sees lines.

Test-suite-aware review tooling is the next frontier, and the shape of it is visible even if the implementations aren’t mature yet. The right granularity is not the line; it’s the scenario. A review surface that groups changed tests by coverage area, surfaces missing scenarios relative to the builder API’s vocabulary, and flags builder API calls that don’t use the team’s established composition patterns would give reviewers something the diff never could: a structured view of what the agent decided to specify and what it left unspecified.

The diff view would still exist. It would just be the last tab you open, not the first.

Teams that wait for the tooling to catch up will spend the intervening period doing the right review by discipline, not by structure. That’s fine. The discipline is what the tooling will eventually automate. Understanding the right review sequence now is the prerequisite for adopting the tooling when it arrives.

The Habit Lag Is the Real Bottleneck

Here is the real situation in most teams: the developers who read this post will agree. They will nod through the argument. They will leave the post with genuine conviction that test-name-first, builder-second, contract-third, diff-last is the right review sequence. Then they will open the next PR and start at the diff.

Habits are faster than convictions. The PR opens, the diff tab is there, the eyes go to the lines. The pattern was laid in over years of reviewing human-written code where the diff was, in fact, the right place to look. The hands don’t wait for the mind to remember that the situation has changed.

The stickiness of the habit is not a character flaw. It is a calibration problem. The habit was calibrated to a signal that has changed. Recalibration takes repetition in the new context.

Two practices dislodge it faster than reading alone.

The first is pairing on test review. When two people sit down to review a PR together, explicitly sequencing: test names first, here’s what we’re looking for, before the diff opens, the sequence becomes a social norm, not just a personal discipline. Social norms recalibrate habits faster than individual intentions. After a few paired reviews done in the right order, the sequence begins to feel natural. After a dozen, opening a diff before reading test names starts to feel incomplete, the same way a developer who has internalized TDD feels vaguely unsettled merging code that doesn’t have a test.

The second is explicit reviewer rotation by surface. Assign one reviewer to the test names and builder API and a different reviewer to the implementation. This structurally prevents the diff-first habit, because the person whose job is to review intent doesn’t have a diff to default to. They have to engage with the scenario index because that’s the artifact in front of them. Over time, the separation collapses back into a single reviewer who does both, but in the right order, because the separation taught them what it feels like to evaluate each surface independently.

Neither practice requires new tooling. Both require a team willing to be deliberate for long enough that the deliberateness becomes reflex.

The habit lag is the bottleneck. The argument is not the bottleneck. Most teams already know, at some level, that reviewing generated code at the line level is incomplete. They keep doing it anyway because the habit is faster than the knowledge.

Break the habit first. The knowledge will catch up.

The Review Point Moved. The Process Hasn’t.

Pull the argument back to where it started.

Reviewing the diff was a compromise. It was the best approximation of intent verification available in a world where intent lived in people’s heads and code was the closest thing to a legible record of what those people had decided. The compromise was invisible because it worked well enough, and it worked well enough because experienced developers were doing the intent-reconstruction work in their heads, invisibly, as part of reading every diff.

Agents broke the approximation. When an agent generates the diff, the intent is not reconstructable from the implementation by reading carefully. The intent is in the prompt, in the test suite, in the builder API, in the contract expectations. Those are the artifacts that constrained the agent. The diff is what the agent produced under those constraints. Reviewing the diff is reviewing the output of the constraints without inspecting the constraints.

That’s not a rigorous review. That’s a compliance check.

The posts that preceded this one in the broader series established that the test suite is your API for agents, that the bar for that API moved when agents became the primary consumers, and that rich TDD is the mechanism for keeping the API coherent. Your Test Suite Is Your API for Agents named the mechanism. The Bar for TDD Just Moved set the floor. BDD Was a Coordination Tax showed what happens when teams align on intent before implementation rather than after. This post names the consequence for review: if the test suite is the specification, then reviewing the specification is the job.

The diff is a downstream artifact. Reading it first, or reading it exclusively, is reading the shadow instead of the object.

There is a broader claim underneath this one. A team that reviews test names and builder APIs before diffs is a team that has accepted a change in what “senior” review judgment means. It no longer means the ability to reconstruct intent from implementation at high speed. It means the ability to evaluate whether a scenario index is complete, whether a vocabulary is coherent, and whether a collaboration boundary is specified precisely enough to hold under pressure. Those skills overlap with the old ones but they are not identical to them. The teams that develop them first will have a significant advantage, because those skills apply to every PR the agent generates, at a leverage that line-by-line reading can never match.

The review point moved upstream. The test names, the builder API, the contract expectations, these are where intent lives now. That’s where the review should land first.

Reviewing the diff to understand intent is now the same category of mistake as consulting the wiki to understand current behavior. It used to be the right call. The signal moved. The habit hasn’t.