TDD Buddy - Examples Pin Intent. Properties Pin the Invariants.

Examples pin intent. Properties pin the invariants.

That sentence contains two distinct ideas that most teams collapse into one. The collapsing is the mistake. Example-based tests and property-based tests are not two flavors of the same activity. They are two axes of specification, and a codebase with only one axis is incompletely specified regardless of how many examples exist or how well-named they are. Agents do not compensate for incompleteness. They compose against what is there, and they fail quietly where there is nothing.

The Bar for TDD Just Moved established the floor: scenario-rich, builder-driven, domain-typed tests written in the team’s vocabulary. That floor is real and it matters. This post argues that the floor is not a ceiling. The floor describes one axis. There is a second axis that the floor does not address and that nearly every test suite, however well-crafted on the first axis, is missing. The second axis is not harder than the first. It is just different. Different enough that most teams have never named it, never budgeted for it, and never noticed it was absent until an agent made the absence visible by failing in a way no example caught.

This post names the two axes, explains why agents need both, and shows what the vocabulary looks like when both are present.

Examples Are for Naming Intent

A scenario test names a concrete case in the team’s language. That act of naming is load-bearing. Before the test exists, the behavior lives only in someone’s head or in a ticket that nobody will read in six months. After the test exists, the behavior is executable, refactorable, and discoverable by anyone who reads the suite.

[Fact]
public void Loyalty_members_get_a_ten_percent_discount_on_orders_over_fifty_dollars()
{
    var order = anOrder()
        .forCustomer(aLoyaltyMember())
        .containing(anOrderContaining(twoBooks()));

    var receipt = checkout.process(order);

    receipt.discount.Should().Be(order.subtotal * 0.10m);
}

Read that name. An agent browsing the test suite can answer “what does the system do for loyalty members on qualifying orders?” without reading any implementation. A new team member can find the canonical definition of the loyalty discount rule in one search. The builder vocabulary (aLoyaltyMember, anOrderContaining, twoBooks) pulls every piece of setup into the domain’s own nouns. The assertion is typed: receipt.discount is a Money, not a decimal, not a double, not a magic number floating in context.

The scenario is a named point in the input space. “This input configuration produces this output.” That is the full semantic content of an example-based test, and that semantic content is worth having. Without named scenarios, agents compose against nameless cases and the vocabulary never settles. The first post in the TDD arc made the craft case in detail: builders, factories, and domain types are the grammar of behavioral specification. The claim stands. Scenarios are the vocabulary contract the team uses to speak about behavior across disciplines and across time.

Named scenarios do more than verify outcomes. They document decisions. The test name Loyalty_members_get_a_ten_percent_discount_on_orders_over_fifty_dollars encodes three decisions: who qualifies (loyalty members), what the threshold is (fifty dollars), and what the rate is (ten percent). When any of those decisions changes, the test name changes. When the test name changes, the diff tells the story. The history of the test suite is the history of the system’s behavioral decisions, and that history is searchable, compilable, and verifiable.

Examples are not optional. They are the named entries in the specification index.

They are also, by construction, finite. The input space is not.

Examples Cannot Verify the Manifold

A function is not a list of cases. It is a mapping over a space. The space for even a simple discount calculation includes every combination of member tier, order total, promotion state, item count, currency, and rounding mode that could plausibly arrive at the checkout service. The scenario above pins one point in that space. The scenario that covers the edge at exactly fifty dollars pins a second point. A thorough example suite pins twenty, fifty, a hundred points.

The space has dimensions that a hundred named points cannot cover.

This is not a complaint about test quantity. More scenarios help, and every scenario named in domain language is worth writing. The point is structural: examples verify what the author thought to enumerate. They do not verify what the author did not think of. The gap between what was enumerated and what is possible is exactly where silent failures live, and it is the gap that agents most reliably exploit.

Consider the failure mode. An agent asked to add a promotional stacking rule modifies the discount calculation. The modification is plausible: it reads the surrounding tests, infers the shape of the calculation, extends it with the new promotion logic, and verifies that the existing scenarios still pass. The agent has no particular reason to probe the interaction between the new promotion and the edge cases the original author did not name, because those cases do not appear anywhere in the surrounding tests. The build passes. The PR opens. The regression ships to production.

This is not a failure of agent capability. It is a failure of specification coverage. The example suite defined what mattered by what it included, and the agent composed against that definition faithfully. The problem is that the definition was incomplete: it named the cases the author thought to enumerate, and left the rest of the space to chance.

Human reviewers used to partially compensate for this incompleteness. A senior engineer reviewing the PR would notice the stacking logic and ask “what happens if both discounts are at maximum values?” That question came from experience: the reviewer had seen discount stacking bugs before. The question is not in the test suite. It lives in the reviewer’s head, gets asked in a comment, gets answered with a manual check, and then disappears from the record when the PR merges. The next engineer to touch the calculation has no evidence that the question was ever asked.

An agent reviewing the same PR does not have the senior engineer’s pattern memory. It checks the tests. The tests pass. The calculation looks structurally similar to the calculations around it. The PR is fine.

The unbounded region of the input space is not covered by more examples. It is covered by a different kind of claim.

A Property Is a Constraint on the Search Space

A property-based test does not say “this input produces this output.” It says “across all valid inputs, this relationship holds.” The quantifier moved from existential to universal. That shift is the entire difference between the two axes, and it changes what the test can catch.

Consider the discount calculation. An example says “a loyalty member with a sixty-dollar order gets a six-dollar discount.” A property says something more durable:

[Property]
public void Discount_is_always_non_negative_and_never_exceeds_order_subtotal()
{
    Prop.ForAll(
        anyLoyaltyMember(),
        anyOrderWithSubtotalAbove(50.dollars()),
        (member, order) =>
        {
            var receipt = checkout.process(
                anOrder().forCustomer(member).containing(order.items));

            receipt.discount.Should().BeGreaterOrEqualTo(Money.zero)
                .And.BeAtMost(order.subtotal);
        });
}

The generator anyLoyaltyMember() does not return one member. It returns a source over the space of valid members, varying tier, tenure, promotional state, and any other dimension the type can carry. The generator anyOrderWithSubtotalAbove(50.dollars()) does the same for orders. The property-based runner feeds hundreds or thousands of combinations through the assertion on every run. When any combination violates the assertion, the framework reports the minimal reproducing case: the smallest set of inputs that cause the failure. Shrinking is automatic. The counterexample is concrete.

The assertion does not check a specific value. It checks a relationship: the discount is at least zero and at most the order total. That relationship is an invariant: a fact about the function that must hold regardless of which specific inputs arrive. Encoding the invariant as a test means every generated combination that violates it becomes a failing test with a reproducible counterexample.

Other invariants live in every codebase. Decoding an encoded value should recover the original. Sorting a list should not change its length. Merging two permission sets should produce a result that is a superset of both. The running total across line items should equal the sum of individual totals. Adding a discount and verifying the post-discount total should produce the same result regardless of whether the discount is applied before or after tax, if the domain says so. These are not scenarios. They are structural claims about the function’s geometry. Writing them as properties makes them executable specifications that no enumeration could have covered.

The design work in writing a property is naming the invariant. That naming is where most of the value lives. The act of asking “what does this function owe to the domain regardless of input?” is a different kind of question than “what should this function produce for this specific input?” It is a question about the function’s contract with the system, not its behavior in a named scenario. Teams that have written properties report that the design question is the hard part and the code is easy. The code is a generator and an assertion. The question is a claim about the domain.

The right vocabulary for properties is: constraints, invariants, relationships, and laws. The right vocabulary for examples is: scenarios, cases, and named intents. The two vocabularies are not interchangeable because the underlying things are not interchangeable. A property is not a generalized example. An example is not a special case of a property. They are answers to different questions.

Agents Need Both Axes

An agent working in a codebase with only example-based tests will generate example-based tests. This is not a criticism of agents. The agent composes against the vocabulary and patterns it finds. If the surrounding tests are all scenarios, the agent writes scenarios. Scenario thinking produces more scenarios. The agent is doing exactly what it should.

Property thinking requires something different: knowing the invariants the domain owes to itself. That knowledge is not present in an example-based test suite. The examples show what happens in specific named cases. They do not say what must always be true regardless of case. An agent reading a suite of well-named scenarios can infer a great deal about behavior. It can learn the vocabulary. It can learn the canonical cases. It cannot infer the universal quantifiers that were never written down, because those quantifiers require a claim about the entire input space, and that claim is not present anywhere in a suite of named points.

The practical consequence is asymmetric. Agents are reliable at extending the named-cases axis and unreliable at covering the invariant axis. They will add a new scenario for the new feature because the pattern for scenario tests is legible in the surrounding code. They will not add a property for the new invariant because there is no property-test pattern in the surrounding code to follow, and there is no expressed invariant to follow even if the pattern were there.

This is where the agent’s blind spot lives: in the gap between what the named cases assert and what the input space permits. The agent does not know that discounts must be non-negative unless that invariant is expressed somewhere. The agent does not know that the stacking order of promotions must be commutative unless that invariant is expressed somewhere. The agent does not know that totaling across items must equal the aggregate total unless that invariant is expressed somewhere. Without properties, those invariants are invisible, and invisible invariants get violated by code that passes every named test.

This was always true for human authors too. The senior engineer who knew to ask about stacking interactions was compensating for the missing property with experience and intuition. The problem is that the compensating knowledge never made it into the test suite, so the next author, human or agent, started from scratch. The senior engineer’s question lived in a PR comment for thirty days and then became invisible. The property lives in the test suite forever.

A two-axis specification closes the blind spot at the source rather than patching it at review time. The example tests tell any reader what the canonical cases are named and what they produce. The property tests tell any reader what must hold across the full input space. Together they describe the function completely: its named points and its global constraints. A function described that way can be composed against, extended, and refactored with confidence. A function described only by its named points can be extended in ways that satisfy all the names and violate the space, and no reviewer, human or agent, is likely to catch the violation.

Builders Compose With Property Generators

The builder vocabulary that example-based tests already use does not need to be replaced to support properties. The grammar is the same. The quantifier moves.

aLoyaltyMember() returns a single instance configured for the canonical loyalty member scenario. It is used in scenario tests where the specific configuration matters. anyLoyaltyMember() returns a generator over the space of valid loyalty members. It is used in property tests where the relationship must hold across the entire space. Both are defined in the same test-support layer, next to each other, sharing the same understanding of what a valid loyalty member is.

// Scenario builder (example-based axis)
public static LoyaltyMember aLoyaltyMember() =>
    new LoyaltyMember(
        tier: MemberTier.Standard,
        tenureMonths: 18,
        promotionState: PromotionState.None);

// Property generator (invariant axis)
public static Arbitrary<LoyaltyMember> anyLoyaltyMember() =>
    Arb.From(
        Gen.OneOf(
            Gen.Constant(MemberTier.Standard),
            Gen.Constant(MemberTier.Gold),
            Gen.Constant(MemberTier.Platinum))
        .SelectMany(tier =>
            Gen.Choose(1, 120).SelectMany(months =>
                Gen.Elements(
                    PromotionState.None,
                    PromotionState.Active,
                    PromotionState.Suspended)
                .Select(state =>
                    new LoyaltyMember(tier, months, state)))));

The method names follow a consistent a/any prefix convention. a means “give me the canonical one.” any means “give me a generator over the space of valid ones.” A developer reading the test layer knows immediately which axis each method serves. An agent reading the test layer inherits the same convention: when writing a scenario, reach for a; when writing a property, reach for any.

This is not a new API. It is an extension of the existing API into the second axis. The domain nouns are identical. The generators compose the same way the builders do. anyOrderWithSubtotalAbove(50.dollars()) is built from anyOrder() filtered by subtotal constraint, the same way anOrderContaining(twoBooks()) is built from anOrder() configured with items. The grammar the team learned for scenario composition transfers directly to property composition. The vocabulary investment compounds.

The anyLoyaltyMember() generator is also the most authoritative definition of what a valid loyalty member is in the test context. The scenario builder aLoyaltyMember() returns a canonical example. The generator returns the space: all member tiers, any tenure between one and one hundred and twenty months, any promotion state the type can carry. If the domain adds a new member tier, both the scenario builder and the generator need to be updated. The generator makes the update visible in a different way: if the new tier has a distinct discount behavior that violates the invariant, the property test will find a counterexample. The scenario builder will not, because nobody thought to name the new-tier scenario yet.

The vocabulary investment in the scenario layer is not redundant with the generator layer. The scenario layer names the cases. The generator layer covers the space. Both are built from the same domain nouns, maintained in the same file, and read by the same agents.

The Bug That Examples Miss Every Time

The claim that properties catch bugs examples miss is not hypothetical. Here is the pattern, precise enough to reproduce.

A discount calculation is written to apply the loyalty discount first, then any promotional discount stacked on top. The implementation is a simple sequential reduction: compute loyalty discount, subtract it from the subtotal, compute the promotional discount on the reduced subtotal, subtract that. The example suite covers loyalty-only orders, promotional-only orders, and loyalty-and-promotion orders with specific dollar amounts. All named tests pass. The code ships.

Three months later, a marketing campaign introduces a high-value promotional discount for new members who are also in the loyalty program. The discount can reach sixty percent. The calculation is extended to apply the promotional discount first, then the loyalty discount. The named tests still pass because none of them used a promotional discount large enough, combined with a loyalty discount large enough, to push the combined discount past one hundred percent of the subtotal. The code ships.

Four weeks after that, a subset of orders is producing receipts where the combined discount exceeds the order subtotal, resulting in negative totals. Negative totals are impossible by definition: the checkout flow cannot owe a customer money for buying books. The bug is the stacking logic: two sequential percentage discounts, each computed on a running total, can exceed the original subtotal when the percentages are high enough. Neither discount is wrong in isolation. The combination is wrong.

A property test would have found this the first time the stacking logic ran under test:

[Property]
public void Total_discount_never_exceeds_order_subtotal_regardless_of_promotion_stack()
{
    Prop.ForAll(
        anyLoyaltyMember(),
        anyOrderWithActivePromotion(),
        (m, o) =>
        {
            var receipt = checkout.process(
                anOrder().forCustomer(m).containing(o.items).withPromotion(o.promotion));

            receipt.discount.Should().BeAtMost(o.subtotal);
            receipt.total.Should().BeGreaterOrEqualTo(Money.zero);
        });
}

The generator anyOrderWithActivePromotion() generates orders with active promotions across the full range of configured discount rates, including the high-value rates the marketing campaign introduced. The first time the stacking logic produces a combined discount exceeding the subtotal, the framework surfaces the minimal counterexample: a gold-tier member with eighteen months tenure and an active promotion at sixty percent, on an order with a fifty-five dollar subtotal. The assertion fails. The developer sees the counterexample. The bug is fixed before the code ships.

The example suite did not enumerate this combination because nobody on the team thought to name it when writing the original tests. The property did not need to enumerate it because the property expressed the constraint. The generator found the violation by exploring the space systematically, the way an exhaustive enumeration would if exhaustive enumeration were possible.

The property also survives future promotions. When the next marketing campaign introduces an eighty-percent discount for platinum members, the property runs again against the new logic without any modification to the test. The constraint is expressed once. The verification is automatic across every future input the domain permits.

This is the second axis earning its keep.

Two-Axis Specs Are What Agents Compose Against

A function is fully specified when two conditions hold. Its canonical cases are named in domain language as executable scenarios, so any reader can understand what the function does on the important days and verify that it does that for those inputs. Its invariants are pinned as executable properties, so any reader can understand what the function promises across the full input space and verify that the promise is never violated.

Examples without properties leave the search space unverified. Properties without examples leave the vocabulary nameless. Both conditions must hold. Neither is sufficient alone, and the order does not matter: a codebase with rich properties but no named scenarios is navigable by a verifier and opaque to anyone trying to understand what the system is for. A codebase with rich named scenarios but no properties is intelligible and fragile at the boundaries.

The reason this matters for agents is not that agents are careless. It is that agents compose against explicit specifications. An example-based suite makes the named cases explicit. A property-based suite makes the invariants explicit. An agent working against both has the complete specification. An agent working against only examples has the vocabulary but not the rails. The rails are where agentic boundary mistakes happen, because the boundary is exactly the region the examples did not cover.

The Bar for TDD Just Moved named the floor: scenario-rich, builder-driven, domain-typed tests. That floor is necessary. The argument here is that the floor is not the ceiling. Above the floor is the second axis. Property tests are not an advanced technique for library authors or specialists. They are the invariant layer of every specification that includes a calculation, a state transition, a codec, a constraint, or any function where a relationship matters more than a specific output.

Every calculation has invariants. Totals are non-negative. Rates are bounded between zero and one. Reversible transformations reverse. Every state machine has invariants. Valid transitions produce valid states. Terminal states do not produce transitions. Every encoder has invariants. decode(encode(x)) equals x. Every merge operation has invariants: the result contains everything from both inputs, in the right precedence order. These invariants do not appear in example suites unless someone wrote a property for them. Nobody wrote a property because the property-test pattern was not present in the codebase to follow. The agent did not write a property because agents extend the patterns they find.

Two axes break the cycle. When anyLoyaltyMember() lives beside aLoyaltyMember() in the test-support layer, the pattern is present. When the codebase has properties alongside scenarios, agents encountering new calculations have a model to follow. When the vocabulary includes both the a-prefix canonical-case convention and the any-prefix generator convention, the grammar is teachable in a single prompt and learnable from a single glance at the test-support module.

A team that builds both axes is not doing more work than necessary. It is building the specification the system actually requires. The scenarios carry the vocabulary. The properties carry the constraints. Together they define what a function is: not just what it produces on named days, but what it owes the domain on every day, for every input the domain can send.

A specification with both axes is what an agent can compose against without a human standing guard at the boundary.

Examples pin intent. Properties pin the invariants. The system with only one axis is half-specified, and the half that is missing is the half that fails quietly.