Sunil Prakash

Posted on May 23 • Originally published at jamjet.dev

I tested 4 AI agent-governance tools against an open spec - here's the matrix

#security #ai #agents #opensource

The scenario

Your AI agent just deleted a customer record. Three months later, an auditor asks you to prove:

What tool actually ran (not "the agent made a deletion call" — the precise tool, version, and capability)
With what arguments (the exact customer ID, scoped fields, options — byte-for-byte)
Who approved it (which human, or which automated policy rule)
Against which version of which policy (the literal policy bundle the runtime evaluated, not "the policy at the time, probably")
Whether it actually succeeded (not "we said allow", but "the downstream system confirmed the row is gone")

You open your audit log.

It says: delete_customer approved, run_id=xyz, decision=allow. The arguments are in a different table. The policy version isn't recorded anywhere — you'd have to git log your settings file. The execution outcome lives in your application logs, which roll over after 30 days. And the auditor has no way to verify any of this without an engineer walking them through every join.

This gap shows up the moment an agent does something consequential and a non-engineer needs to understand what happened. It's the same gap regardless of which framework you used. Approval is not proof.

What's actually missing

The pattern across every agent-governance tool I looked at is the same: they're built around the decision (allow / deny / require-approval) and treat the action itself as an implementation detail. So the audit log records "the policy fired" but not a single record carrying everything a third party needs to reconstruct what actually happened.

A useful audit artifact has to survive the following:

It can be verified without trusting the runtime that produced it. If your auditor has to call your engineers to interpret the log, the log is testimony, not evidence.
The arguments and the decision are cryptographically bound. If args mutate between approval and execution, the audit must show it.
The policy version is in the record. Not "the policy at the time" — the literal bundle identifier.
The execution outcome is in the record. Approval ≠ execution. Both belong in the same artifact.
The chain of receipts is tamper-evident. Deleting a row from history must break something a verifier can detect.

A receipt that does all five becomes a single evidence record you can hand to an auditor, regulator, insurer, or a compliance team six months later — without them needing access to your database, your cloud creds, or your engineering team.

What I built

AgentBoundary is an open spec for that kind of receipt. v0.1 is stable; v0.2-alpha (draft) adds the optional provenance block and singly-linked chain shown in the example below. Same JSON document, deterministic schema, hash-bound to its arguments.

Here's one a Discord agent I run in production emitted on 2026-05-21 — it files GitHub issues on behalf of users:

{
  "version":      "agentboundary/v0.2-alpha",
  "receipt_id":   "f04df972-f9fc-4624-83cb-0ed3682297cf",
  "issued_at":    "2026-05-21T06:54:39.251Z",

  "actor": {
    "type":         "agent",
    "id":           "agent:jambot:discord:user:aa74fa40751b528f"
  },

  "tool":   { "name": "github-rest", "version": "2022-11-28", "capability": "github.issues.create" },
  "target": { "system": "github.com/jamjet-labs/jamjet-discord-bot", "environment": "prod" },

  "arguments_hash":  "2d257d4e72f62afa112766154b9b5ac0dd98ae79ee7c2758563a4363a0fb4bdf",
  "policy":          { "name": "jambot.file-issue.v1", "version": "1", "decision": "allow" },
  "execution":       { "status": "success", "completed_at": "2026-05-21T06:54:40.103Z", "result_ref": "github://issues/1" },

  "prior_receipt":      { "receipt_id": "cab5eff7-…", "receipt_hash": "3e7f5a93…" },
  "completeness_score": 0.913,
  "receipt_hash":       "..."
}

A verifier with only this JSON — no database, no Fly.io credentials, no GitHub token, no Discord session — can run six independent checks:

Tamper-evidence. Re-canonicalise the body without receipt_hash, take SHA-256, confirm it matches the stored hash.
Argument binding. Re-canonicalise the arguments separately, take SHA-256, confirm it matches arguments_hash. If anything mutated between approval and execution, this fails.
Spec compliance. Fetch the public JSON Schema, validate the receipt structurally.
Chain integrity. Fetch the receipt at prior_receipt.receipt_id and confirm its hash matches the link.
Emitter honesty. Recompute completeness_score from the provenance block using the deterministic formula in the spec. Catches an emitter that lies about how confident it was in each field.
Execution proof. Follow execution.result_ref to a real downstream artifact (in this case, a public GitHub issue) and read it.

How existing tools do against the bar

I built one adapter per vendor — translating their normative artifact (or, where they don't have one, the developer-recommended capture shape) into an AgentBoundary v0.2-alpha receipt. Then I ran all 40 conformance scenarios against the adapter-produced receipts.

Vendor	PASS	PARTIAL	DOCS-ONLY	NOT COVERED	N/A
JamJet reference	40	0	0	0	0
Anthropic permission_policy	12	9	3	14	2
Cloudflare HITL Agents	5	7	1	25	2
LangSmith Gateway	15	14	1	8	2
Microsoft AGT	17	5	1	15	2

Reference implementation first; vendors alphabetical. Not ranked. The PASS counts collapse meaningful categorical differences. Each vendor is solving for a different layer of the stack:

Anthropic's permission_policy is the richest runtime evaluation pipeline of the four — layered hooks, scoped tool patterns, permission modes, the canUseTool callback. But the audit log from Anthropic's Managed Agents Console isn't a published schema, so there's no portable artifact a third party can verify. That's why 3 DOCS-ONLY (highest of any vendor) and 14 NOT COVERED.
Cloudflare HITL is a workflow primitive — durable approval gates with multi-day windows and external notifications. It's deliberately not an emitted-artifact format. The 25 NOT COVERED reflects that their recommended audit table is 6 columns and doesn't model the things conformance is asking about.
LangSmith is an observability platform. The Run object captures the data, but where in the Run varies by team convention — one team puts the decision in tags, another in feedback_stats. A cross-team auditor can't reliably extract it. That's why 14 PARTIAL.
Microsoft AGT is the closest peer — also an artifact format, also designed for verifiable evidence, with a Merkle-chained audit log that's structurally stronger than AgentBoundary's current singly-linked design. The 15 NOT COVERED rows are deliberate scoping decisions, not bugs.

Per-vendor breakdowns with structural reasoning live in adapters/<vendor>/results.md in the public repo.

Where AgentBoundary itself currently falls short

The reference implementation scoring 40/40 against its own spec is the implementation grading itself. That's meaningful but not sufficient.

JamBot's emitter mutates receipts on approval-finalize. When a maintainer approves a held action, the existing row's execution.status is updated in place and receipt_hash is recomputed — which breaks chain links from any later receipt whose prior_receipt.receipt_hash was captured before the mutation. Fix queued for v0.2.
The chain is singly-linked, not Merkle. AGT's design (every entry commits to every preceding one) catches arbitrary-entry-reordering attacks that v0.2-alpha would miss. v0.3 candidate.
provenance is a 3-value enum where AGT has a float [0.0, 1.0]. Simpler to reason about, coarser in practice. v0.3 candidate if practitioner feedback warrants it.
No second non-reference implementation yet. Only one production deployment (JamBot). A second emitter in Rust, Go, or Java would validate the spec is implementation-portable.

These are also in the report's §8.

Run the suite yourself

npx agentboundary run scenarios/
# or
uvx agentboundary run scenarios/

60 seconds on a clean machine. No signup, no Docker, no account. Scenarios are at jamjet-labs/agentboundary/scenarios. If your results disagree, open an issue with the exact command and your environment — the suite is reproducible; if it isn't on your machine, that's a bug.

What I want from this post

If you maintain an agent-governance product and any of the per-scenario mappings are wrong: open a PR against adapters/<your-product>/. Right-to-respond issues are filed against all four vendors; windows close 2026-05-28 to 2026-05-30 and corrections are folded in inline.
If you're integrating agents into a regulated stack (finance, healthcare, infrastructure ops): try the suite against your own runtime. Emitting an AgentBoundary receipt from your existing audit log is usually a few hundred lines.
If you already have an audit format: map one of your real audit rows to the conformance scenarios and open an issue where the suite misrepresents your model. Concrete corrections are far more useful than general feedback. AGT and AgentBoundary's design centres are complementary; the two specs could reasonably converge.

Full report with the per-vendor deep-dives at jamjet.dev/blog/agent-action-control-40-tests. Canonical archive on the spec microsite at agentboundary.jamjet.dev/reports/2026-05-comparative.

Spec is Apache 2.0. Implementations welcome.

Top comments (4)

TxDesk • May 25 • Edited

Solid work, and the matrix-against-conformance-scenarios approach is the only credible way to evaluate this space. Most agent-governance writing stays at the principles layer because grounding principles in tests is hard, and you've done the hard part.

Two technical observations from reading the spec, not the suite (I haven't run it yet, will do this week and file issues if I find anything concrete):

The completeness_score check is interesting but limited in what it actually catches. Re-deriving the score from the provenance block detects inconsistency between the score and the block, but an emitter that lies consistently (a low-confidence field marked high in the block AND high in the score) passes the check. The score is internally-derived from data the same emitter controls. The honest version is that completeness is a self-reported quality signal, useful for filtering but not adversarially robust. If that's already acknowledged in the spec wording, fine. If not, worth being explicit because a regulator reading the spec might assume the check is stronger than it is.

Argument binding via re-canonicalize-and-hash is the right primitive, but the spec needs to pin the canonicalization algorithm by name and version, not "re-canonicalise". JSON canonicalization is a footgun zone (RFC 8785 vs ad-hoc implementations diverge on number representation, Unicode normalization, key ordering for non-ASCII keys, escape behavior). Two spec-compliant verifiers using different canonicalization libraries can disagree on whether the same receipt is valid. Worth naming JCS / RFC 8785 explicitly if that's what you're using, or whichever variant the reference implementation enforces.

The singly-linked vs Merkle tradeoff in §8 is the honest framing. For most regulated stacks the singly-linked design is sufficient because tampering requires editing a contiguous range, but adversaries who can pause receipt emission and reorder entries within a window can exploit the gap. AGT's design closes it. Worth keeping v0.3 candidate.

Sunil Prakash • May 30

Thanks, this was the most useful feedback I got. You're right on both.

Good catch - and it was a real bug. The spec recommended RFC 8785, but the code was actually using plain json dump, which isn't the same thing , this is fixed.

One nice surprise while fixing it: the JS side (JamBot) was already correct, because JSON.stringify happens to match RFC 8785. Python was the only one off, so every existing public receipt still verifies fine.

On completeness_score - Agreed - if an emitter lies consistently (marks a made-up field as observed), it still passes the check. It's a self-reported signal, not a real guarantee. will report an issue to brainstorm.

If you run the suite, tag me on anything you open.

TxDesk • May 31

Glad the RFC 8785 fix was clean. The JSON.stringify-happens-to-match-RFC-8785 thing is a fun coincidence and a small landmine someone will rediscover later.

On completeness_score: the self-report failure mode is the one I think eats the field eventually. A lying emitter passes today; an emitter that omits a field it doesn't want scored also passes. Two failure shapes, same root cause. The fix is probably orthogonal to your conformance matrix - needs an external observer that can independently verify the field was actually present in the source before the score is meaningful. Hard problem.

I'll bookmark the suite and ping you if I run it.

ancilis • Jun 2

The completeness_score self-report failure mode raised in the comments is the structural version of the whole problem: producer and attester are the same party. An agent writing its own receipt can lie or omit. The eventual fix the field reaches is to generate the receipt outside the agent, at the call boundary, by something it doesn't control.

Awesome though btw the whole idea of show me the receipts is absolutely critical in the regulated spaces, just need those receipts to have unquestioned integrity