TDD in the Agentic Era · Part 2 / 3

BDD Was a Coordination Tax. AI Just Repriced It

Gherkin was never really a test format. It was a coordination protocol between specialized roles. In the AI era, where roles compress and translation is cheap, the protocol becomes friction instead of leverage.

By Travis Frisinger · April 7, 2026 · 11 min read
BDDAITeam TopologiesSoftware Delivery

Gherkin was never really a test format. It was a coordination format.

The testing part was almost incidental, a side effect of needing the artifact to be executable so it couldn’t silently drift. The actual purpose was something else: getting product, QA, and engineering to agree, in writing, on what the software was supposed to do. The feature file was a treaty document between three roles with different vocabularies, different calendars, and different incentives.

Once you see that, the decline makes sense. Treaties live and die with the coordination cost they were built to reduce. When the cost changes, the treaty becomes ceremony.

This is the second post in a three-part arc on TDD in the agentic era. The first, “TDD Already Does BDD, Without the Gherkin”, made the craft case at the code level: disciplined TDD with builders, factories, and mock-driven collaboration tests already delivers what BDD promised. This post makes the org case. Even teams that legitimately needed BDD for coordination reasons are watching that need get repriced. The third, “The Bar for TDD Just Moved”, names the new expected level and why agentic coding makes it non-negotiable. If you were paying the tax for craft reasons, the first post says stop. If you were paying it for coordination reasons, this one says notice that the coordination just got cheaper. The third says what you now need in its place.

What BDD Was Actually Solving

Go back to why BDD emerged. The problem wasn’t “how do we test behavior.” TDD had already answered that. Disciplined TDDers were writing scenario-named tests, using ubiquitous language, and framing behavior over implementation, years before Cucumber existed.

The problem BDD was actually solving was organizational.

  • Product wrote requirements in one vocabulary, usually a Word doc or a Jira epic.
  • QA wrote acceptance criteria in another, usually a test plan in a separate tool.
  • Engineering wrote code in a third, usually with variable names and domain types that didn’t match either of the other two.

Three artifacts. Three vocabularies. Three owners. No shared source of truth.

The feature file solved that by being a single document that all three roles could nominally read, edit, and sign off on. The Given-When-Then format wasn’t brilliant prose. It was a lowest-common-denominator grammar that a PM could write, a tester could extend, and an engineer could bind to code.

It was Esperanto for cross-functional teams.

Why the Treaty Worked (When It Worked)

In the right context, the overhead was worth it.

Slow-moving orgs. Specialized roles. Async handoffs. Expensive meetings that had to be scheduled two weeks out. When the coordination cost of re-aligning PM, QA, and Dev was measured in calendar days, maintaining a feature file repo was the cheap option.

Gherkin saved meeting cycles, not test cycles.

That framing reveals what the treaty was actually trading. Teams absorbed:

  • A second specification language
  • A step-definition codebase
  • Feature-file drift and maintenance
  • Runtime regex matching instead of compile-time checks
  • Slow end-to-end test suites

And in exchange they got:

  • Fewer meetings
  • Fewer handoff clarifications
  • A document product could point at when something shipped wrong
  • A document QA could extend without needing a developer
  • A coordination surface with three signatures on it

For an org where the signatures were genuinely hard to get, that trade made sense. The tests were a side benefit. The treaty was the point.

PRDs Had the Same Arc

This pattern isn’t unique to BDD.

Product Requirements Documents made sense when product sat in a different building from engineering, shipped quarterly, and handed specs over a wall. The PRD was the treaty, the thing that captured intent in a format both sides could reference when disagreements surfaced six weeks later.

Then teams co-located. Iteration cadence dropped from quarters to weeks. Product sat in the standup. The PRD didn’t disappear, it compressed. It became a Linear doc, a Notion page, a one-pager, a short RFC, an Amazon six-pager. The function persisted: capture intent so distributed readers (adjacent teams, new hires, your future self, legal, an AI agent) can reconstruct it without asking. But its role shrank dramatically. It stopped being the central treaty and became a lightweight reference, consulted occasionally, not edited constantly, often assembled and updated by an agent on demand.

Gherkin is walking the same path, and likely walking it faster. Not a sudden disappearance, but a steady compression of its role. The ritual of maintaining feature files as a living specification fades long before the last .feature file gets deleted. Teams stop extending them, stop writing new ones, and eventually keep them around as legacy artifacts the way old PRDs still live in Confluence under “Archive / 2019.”

Design docs, ticket templates, release notes, handoff docs, runbooks: every coordination artifact follows some version of the same curve. Sometimes it vanishes. More often it compresses into something lighter, faster, cheaper to maintain. It was valuable when the coordination cost it reduced was high. It becomes overhead when the cost drops, and the team reshapes the artifact to match the new cost structure.

The artifact doesn’t know the cost changed. The team has to notice.

Role Compression Changes the Math

And the cost is dropping fast.

Modern product engineers carry stories from discovery to production. Full-stack engineers absorb what used to be separate frontend, backend, and QA roles. Nimble teams of one to five engineers, often AI-augmented, ship features that would have required a cross-functional team of ten a decade ago.

This isn’t just a startup pattern anymore. Enterprises are actively restructuring toward it. The shift from large, role-specialized delivery groups to small, cross-functional squads, with cross-repo and cross-service collaboration happening at the engineering level rather than mediated through product and QA tiers, is visible across banks, insurers, logistics, and telco. The “two-pizza team” rhetoric of the last decade is becoming a “three-to-five-person squad with an agent” reality.

When a 1-5 person team carries a story from “what should this do” through “is it done” in the same working session, there is no wall to throw the treaty over. The roles that needed the lowest-common-denominator grammar are compressed into a handful of people who sit in the same conversation. The three signatures become a quick verbal check. The three vocabularies converge into one shared internal vocabulary, usually pulled directly from the domain and reflected in the code.

You don’t write treaties with the person sitting next to you.

In a compressed-role team, every round-trip through Gherkin is a re-encoding cost with almost no counterparty. You translate your shared understanding into a formal grammar so that a later version of the same team can translate it back. That’s ritual, not communication.

AI Is the Second Compression

There’s a second role compression happening on top of the first, and it matters even more.

The gap between human intent and executable code used to be a translation problem, which is exactly what Gherkin was designed to sit inside of. The PM’s words became the developer’s code through a chain of intermediate artifacts, each trying to preserve meaning as it crossed a role boundary.

AI collapses that chain.

You can tell an LLM, in the domain language you share with the business, what behavior you want, and it will write the tests, in the host language, against real domain types, wired to the actual code. The translation layer that Gherkin occupied is the model itself now. There is no step definition because there is no step to translate. The AI reads the intent, the code, and the domain model directly.

What used to require a feature file plus a step-definition library plus a Cucumber runner plus a CI job to glue it all together now requires one sentence and a test file.

That isn’t an incremental improvement. It’s the market disappearing.

BDD Is Not Gherkin

It’s worth being precise, because the strongest pushback on this post will be: “You’re conflating BDD the practice with Cucumber the tooling.”

Fair critique. They aren’t the same thing, and the separation is what lets this argument land cleanly.

BDD the practice covers discovery conversations, specification by example, Example Mapping, three-amigos refinement, ubiquitous language, and behavior framing. It survives. Maybe thrives. These describe how a team thinks together about what software should do, and thinking together matters more in a compressed-role world, not less. It just looks different now: a lean squad doing fifteen-minute refinement against a live doc, a product engineer writing concrete examples directly into tests, a solo operator example-mapping with an AI as the other amigo.

BDD the tooling covers Gherkin as a sanctioned specification language, step-definition codebases, Cucumber-style runners, living-documentation generators, and the whole artifact stack sold on the premise that non-developers will read and edit the scenarios. It doesn’t survive. Or survives in compressed form, the way PRDs survive as Linear docs. It was built to support a specific role shape with specific coordination costs, and both of those are changing.

So when this post says “BDD was a coordination tax,” the tax being repriced is the tooling and artifact overhead, not the practice. The practice never was a tax. The practice is just good product-thinking, and good product-thinking travels well across team shapes, tool chains, and AI capabilities.

The mistake a lot of BDD shops made was treating the tooling as the practice, insisting that if you weren’t writing feature files, you weren’t “doing BDD.” That conflation is what turned a valuable set of ideas into a ritual. Strip the ritual, keep the practice, and you’re left with something that composes cleanly with compressed-role teams, AI augmentation, and disciplined TDD.

What’s ending is the ritual, not the practice.

Where It Still Fits

Be honest about the exceptions.

Regulated industries (finance, medical devices, aerospace, defense) often have hard requirements that specifications be human-readable in a format non-developers can sign off on. Audit trails demand tool-neutral artifacts. Compliance regimes sometimes mandate exactly the kind of feature-file output that BDD tooling produces as a byproduct.

Genuinely siloed enterprises where role boundaries are enforced by org charts, security clearances, or regulatory structure will continue to pay the coordination tax because the alternative is worse. Even there, the direction of travel is toward smaller squads and engineer-level collaboration across repos and services, which eats away at the treaty’s premise over time.

Large cross-functional programs where product, QA, and engineering operate on different cadences and the treaty document is genuinely the cheapest synchronization mechanism available.

In those contexts, BDD tooling earns its keep. The piece that changes is the default assumption. It shouldn’t be “teams practice BDD unless they have a reason not to.” It should be “teams stop paying coordination taxes unless they have a reason to keep paying them.”

The Friction in Compressed-Role Teams

Here’s the lived experience in a 1-5 person, AI-augmented squad.

Someone brings an idea into the team channel. Two engineers and a product-minded lead talk it through in ten minutes, covering examples, edge cases, and what “done” looks like. Someone describes the behavior to an agent in plain domain language. The agent drafts tests using the team’s existing builders and factories. The tests get reviewed in the same PR as the implementation. They read like the scenarios the team would have written by hand, because they are the scenarios the team would have written by hand. Tests pass. Ship.

Now imagine inserting Gherkin into that loop.

Someone brings the idea in. The team talks it through. Then somebody opens a feature file. Translates the conversation into Given-When-Then. Writes or updates step definitions. Binds the steps to code. Runs Cucumber. Maintains the feature file as the code evolves. Keeps the step-definition library in sync. Generates living documentation that the team already covered in standup.

Every one of those steps is a translation the team didn’t need and no longer has a distant counterparty to translate for. The team is already in the same conversation. They’ve already aligned on vocabulary. They review each other’s code. They pair and mob when something’s unclear. The grammar designed for three signatures across three calendars now gets three signatures across one standup, and asks the team to maintain a parallel specification language to prove it.

That’s larping as a cross-functional program.

What This Means for Team Shape

The deeper point isn’t about Gherkin. It’s about every artifact encoded for a role boundary that no longer exists.

Jira ticket schemas with fifteen required fields. PR templates that demand a test plan, a rollout plan, and a risk assessment. Release checklists designed for a separate ops team. Sprint retros for teams that don’t work in sprints. Design review forums for organizations where every engineer does design. Architecture decision record templates built for five-person architecture committees.

Every coordination artifact deserves the question: who is this for, now that the team shape has changed?

If the answer is “nobody, but we’ve always done it this way,” that’s dead weight. In a slow-moving era it’s tolerable dead weight. In an era where a compressed team plus an AI can out-ship a ten-person org, it’s fatal.

Team shape is downstream of coordination cost. Artifacts are downstream of team shape. When the cost collapses, the shape compresses, and the artifacts become overhead.

The Repricing

BDD frameworks weren’t wrong. They were priced for a market that’s going away.

In slow, siloed orgs the math still works. Coordination is expensive. Treaties are cheaper than meetings. Gherkin earns its seat. In compressed, AI-augmented orgs the coordination it was pricing has collapsed to near-zero, and the overhead buys you nothing. The grammar becomes ritual. The step definitions become a second codebase to maintain for an audience of one.

The craft of specifying behavior survives all of this. Test data builders, factories, scenario-named tests, ubiquitous language, mock-driven collaboration tests: those are independent of org shape. They work in a cross-functional team of twelve and they work in a solo operator with an agent. They compose, they refactor, they evolve with the code.

The treaty documents don’t.

BDD’s contribution was real. Its reasons-for-being were specific, local, and tied to a moment in how teams were organized. That moment is ending for a lot of us. When it ends for you, notice, and stop paying the tax.


Next in this arc: The Bar for TDD Just Moved. What agentic coding expects from your tests, and why the coordination artifacts could only fall away because the test suite got strong enough to absorb their load.