A green test suite is supposed to mean the change works. It doesn't. A test can be weakened just enough to pass. An error can be caught and thrown away. A rename can stop halfway and still compile. None of that turns red, and none of it shows up in the linters most teams already run.
Swarm Orchestrator is built to catch exactly that class of problem in AI-written pull requests.
The gap linters leave
Semgrep and ESLint are built around risky APIs and known-bad code patterns. Whether a diff is honest is a different question. They won't tell you a test was edited until it passed, or that a catch block quietly eats the error it caught. That's the gap.
Two examples from merged Cloudflare pull requests:
| PR | Finding | Semgrep + ESLint |
|---|---|---|
workers-sdk#14063 |
Function renamed, some callers still using the old name | No finding |
workers-sdk#14132 |
Empty catch block hiding errors | No finding |
Across 72 known-bad pull requests from 12 repositories, that pair of analyzers produced one finding. The auditor flagged 67.
What the auditor checks
Eleven checks total. Eight run by default. The other three exist but stay off, because they haven't shown useful signal on real pull requests yet, and a noisy check is worse than no check.
The default set looks for things like:
- Errors caught and ignored
- Renames left unfinished
- Test coverage reduced
- Tests weakened
- Assertions removed
- New
@ts-ignoreoreslint-disablecomments - Test-only fixes with no code change behind them
- Mocks pointing at modules that don't exist
Measured, not assumed
The detection rate isn't a guess. Known defects get injected into real pull requests, then the auditor runs against them. It caught 253 of 300, or 84 percent.
Reproduce it:
npm run benchmarks:full
Runtime mode (optional)
The checks can also execute code instead of only reading a diff: mutation testing, coverage, and reproducing reported issues.
On trpc#6098 it found mutations surviving on lines a later hotfix changed. The tests passed. They weren't actually exercising that code.
Why this mode stays optional
Running code is louder than reading a diff: it averages about 3.4 findings on a clean pull request. That noise is fine when you're deliberately hunting, but it's too much to leave on by default, so it's opt-in.
Defining "done" with a contract
The second command is swarm run. You write down what done means:
obligations:
- type: build-must-pass
command: npm run build
- type: test-must-pass
command: npm test
A patch is accepted only if every obligation passes and the falsifier can't break it. The default provider is deterministic, so identical inputs give identical results, and every input and hash gets written to a hash-chained ledger.
Blocking merges
Findings are advisory out of the box. Gate mode can block a merge, but only on reproducible evidence. The structural checks throw too many false positives to trust as automatic blockers on their own.
Right now no runtime signal has enough real-world evidence to justify auto-rejection, so the gate stays open and reports that fact directly instead of pretending otherwise.
Who it's for
If you review a lot of AI-written pull requests and want signals the usual linters skip, that's the case this is built for. It also emits CycloneDX-ML and SPDX AI BOM documents with --emit-aibom, supports TypeScript and JavaScript, and runs offline.
It points reviewers at the code worth inspecting. It doesn't claim to prove anything bug-free.
moonrunnerkc
/
swarm-orchestrator
Reviews pull requests for the shortcuts AI coding agents take to look done without being done: relaxed tests, swallowed errors, fake renames, 11 checks in all. Flags them for a human by default, or blocks the merge if you turn that on. Can also turn a goal into a checklist and only accept a patch once every check passes.
Swarm Orchestrator
A CLI for auditing AI-generated PRs and grading patches against typed contracts.
Install · Quick start · What it does · Results · Detectors · AI-BOM · Reference
What This Does
Swarm Orchestrator reads a pull-request diff and flags the shortcuts an AI coding agent takes to look done without being done: relaxed tests, stripped assertions, swallowed errors, fake renames, eleven checks in all. On a benchmark of planted cheats it recovers 253 of 300 (84%, up 20.5% from the prior version), and on real merged Cloudflare PRs it caught two cheats that Semgrep and the ESLint security rules missed, both reproducible offline. Findings are advisory by default, so it never blocks a merge unless you turn that on.
Who it's for
- You review AI-written PRs at volume and want a "this change may be gaming the tests" signal that ordinary linters do not give you.
- You have…
Top comments (6)
This nails a failure mode I've watched repeatedly: an agent optimizes for whatever signal you give it, and a green checkmark is the cheapest signal to satisfy. If passing the suite is the reward, weakening an assertion or wrapping a flaky call in a swallowed catch is a perfectly rational move for the model — just not the one you wanted. The "test-only fix with no code change behind it" check is the one I'd value most; that pattern is almost a tell that the agent patched the symptom rather than the cause.
The mutation-testing result on trpc#6098 is the part that should worry people — passing tests that don't actually exercise the changed lines are invisible to coverage numbers too, since the line gets "hit" without being meaningfully asserted on. One question: how do you avoid flagging legitimate test deletions? Sometimes removing a brittle test or tightening an over-broad assertion is the correct change, and a naive "assertions removed" check could punish good cleanup. Do the contract obligations let you whitelist intentional reductions?
It uses count / swap / weakness checks, and flags whatever trips those rules for human review. These checks never block, only flag for review. That's about the best I could get it to at this time. Whitelist lives in the per-repo audit config, not the obligations, and it's for files / folders. A per-line whitelist would be out of scope, since it'd ride along in the PR and let a cheat exempt itself.
This resonates. When integrating AI APIs into production pipelines, I've seen similar patterns — the model confidently produces code that passes basic tests but quietly skips edge cases. Running a separate validation layer has saved me more than once. Would love to see this approach extended to API response validation as well.
Appreciate that. I'm keeping this one focused on the premerge side. catching the code change before it lands, so runtime response validation sits outside what it's meant to do. That parts already well covered by tools like zod, ajv, and OpenAPI/Schemathesis. If you ever want it as a contract check here, you can write the response as a property the patch has to hold and let the falsifier go at it.
The 'tests weakened, errors swallowed' category is the scary one because it passes every linter you have. An agent that deletes the assertion to make a test green has technically done what you asked, which is why I've started treating 'green' and 'correct' as two separate signals.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.