Stop Asserting Equality: How to Test Agents When Every Run Is Different

#testing #ai #agents #typescript

Here is the test that quietly destroys most agent codebases:

expect(await agent.run("summarize this ticket")).toBe(EXPECTED_SUMMARY);

It passes on Tuesday. It fails on Wednesday because the model reworded one sentence. So someone adds a .trim(), then a lowercase, then a regex, and three weeks later the assertion is a 40-line normalization function that still flakes twice a week. Eventually the team does the only rational thing left: they delete the test. Now the agent has no tests at all, and everyone agrees "agents are just hard to test."

Agents are not hard to test. Asserting equality on non-deterministic output is hard, because it's impossible. Those are different problems, and conflating them is why so much agent CI is either flaky or fake.

The same prompt, same model, same temperature can produce different tokens on different runs. Provider-side routing, batching, and floating-point nondeterminism mean even temperature: 0 is not a guarantee. If your test strategy assumes a fixed output, you are testing the wrong layer of reality. Here's what to do instead.

Test invariants, not strings

The core move: stop asking "did the agent produce exactly this?" and start asking "is every property that must be true, true?"

An invariant is something that holds across all valid outputs, regardless of wording. For a ticket summarizer, the exact prose is irrelevant. What must be true is structural and factual:

It mentions the actual customer ID from the ticket
It does not invent a customer ID that wasn't in the input
It is shorter than the input
It contains a recommendation section
It does not leak the system prompt

None of those care about phrasing. All of them are deterministic, free, and unfakeable.

type Invariant = {
  name: string;
  check: (output: string, input: TicketInput) => boolean;
  severity: "block" | "warn";
};

const summarizerInvariants: Invariant[] = [
  {
    name: "references real customer id",
    check: (out, input) => out.includes(input.customerId),
    severity: "block",
  },
  {
    name: "no fabricated ids",
    check: (out, input) => {
      const idsInOutput = [...out.matchAll(/CUST-\d{5}/g)].map((m) => m[0]);
      return idsInOutput.every((id) => input.knownIds.includes(id));
    },
    severity: "block",
  },
  {
    name: "actually summarizes (shorter than source)",
    check: (out, input) => out.length < input.body.length,
    severity: "block",
  },
  {
    name: "has recommendation",
    check: (out) => /recommend|next step|action/i.test(out),
    severity: "warn",
  },
];

function assertInvariants(output: string, input: TicketInput) {
  const failures = summarizerInvariants.filter((inv) => !inv.check(output, input));
  const blockers = failures.filter((f) => f.severity === "block");
  if (blockers.length > 0) {
    throw new Error(
      `Invariant violations: ${blockers.map((b) => b.name).join(", ")}`,
    );
  }
  return failures; // warnings bubble up as metadata, not failures
}

This test will pass for any correctly-behaving output and fail for any broken one — which is the entire point of a test. The reworded sentence on Wednesday no longer breaks CI. A fabricated customer ID does.

When you must compare to a reference, compare meaning

Sometimes an invariant isn't enough — you genuinely need "is this answer close to the right answer?" The mistake is reaching for string equality. Reach for semantic similarity instead, with an explicit threshold.

import { cosineSimilarity, embed } from "./embeddings";

async function assertSemanticMatch(
  output: string,
  reference: string,
  threshold = 0.82,
) {
  const [a, b] = await Promise.all([embed(output), embed(reference)]);
  const score = cosineSimilarity(a, b);
  if (score < threshold) {
    throw new Error(
      `Semantic drift: similarity ${score.toFixed(3)} < ${threshold}`,
    );
  }
  return score;
}

The threshold is a real engineering decision, not a magic number. Calibrate it: run 50 known-good outputs against the reference, look at the distribution of scores, and set the threshold a couple of standard deviations below the mean. Now you have a test that tolerates rewording but catches an answer that has wandered off topic.

A single run is not a measurement

Here is the part most teams skip. Because output is a distribution, a single test run samples that distribution exactly once. If your agent is correct 90% of the time, a one-shot test is a coin that lands green 9 times out of 10 — and you will spend the tenth morning debugging a "flaky test" that is actually telling you the truth.

The fix is to treat agent tests like the statistical experiments they are: run the case N times and assert on the pass rate, not on a single result.

async function assertReliability(
  scenario: () => Promise<boolean>,
  { runs = 10, minPassRate = 0.9 }: { runs?: number; minPassRate?: number },
) {
  const results = await Promise.all(
    Array.from({ length: runs }, () => scenario()),
  );
  const passed = results.filter(Boolean).length;
  const rate = passed / runs;
  if (rate < minPassRate) {
    throw new Error(
      `Reliability ${(rate * 100).toFixed(0)}% < required ${minPassRate * 100}% (${passed}/${runs})`,
    );
  }
  return rate;
}

This reframes the question from "did it work?" to "how often does it work, and is that often enough?" That is the only question that matters in production, where the agent runs ten thousand times a day. It also turns flakiness from a nuisance into a signal: a pass rate that drifts from 95% to 78% across builds is a regression, even though no single run "failed."

The cost objection is real — N model calls per scenario adds up. Two mitigations: only multi-sample the scenarios that gate a deploy, and run the expensive suite nightly rather than on every commit. Determinism-friendly checks (invariants) run on every push; statistical checks run on a schedule.

Layer the checks so the expensive ones rarely fire

Putting it together, a sane agent test pyramid looks like this, cheapest first:

Invariants (free, every run). Structural and factual properties. String matching, schema validation, reference grounding. If these fail, short-circuit — don't waste money on anything else.
Statistical assertions (cheap, nightly). Pass rate across N runs, latency p95, tool-call counts, repetition checks. CPU, not API spend.
Semantic / judge checks (expensive, gated). Embedding similarity or model-as-judge, only for the fuzzy criteria the first two layers can't capture, and only when they've already passed.

The ordering matters because most regressions are caught at layer 1 for free. You never want to pay for an LLM judge call to discover the output was empty, or that the agent fabricated an ID — a string check already knew that.

The mindset shift

Deterministic software has a true/false oracle: the output is either right or it isn't. Agents don't have that oracle, and pretending they do produces tests that are simultaneously flaky and uninformative.

The shift is to stop testing for equality and start testing for correctness properties — invariants that must always hold, distributions that must stay within bounds, and meaning that must stay close to intent. Those are all measurable. They just aren't toBe.

Once you internalize that, the flaky-test death spiral disappears, because you're no longer fighting nondeterminism — you're describing it.

If you want a harness that already encodes this layering, agent-eval gives you tiered invariant, statistical, and judge assertions so a prompt change either holds your properties or it doesn't — and a sliding pass rate across builds shows up as a regression instead of a "flaky test." Test the distribution, not the string — your CI will finally start telling you the truth.