Arqam Waheed

Posted on May 30

I Made My AI Models Argue, Then Let Hermes Be the Judge

#hermesagentchallenge #devchallenge #agents #ai

Hermes Agent Challenge Submission: Build With Hermes Agent

This is a submission for the Hermes Agent Challenge: Build With Hermes Agent

TL;DR — Ask any judgment call and three different AI models argue it out, then Hermes hands down one verdict, a confidence score, and exactly why they split. Every verdict, dissent, and mind-changed-in-debate is written into Hermes' own memory, so the next question re-weights the jurors before they ever vote. The judging is a pure function over that memory: no memory, no weights, no verdict. Three models, one verdict, $0.

What I Built

An LLM once talked me into the wrong database with total confidence. One smooth, authoritative answer. I shipped it. It cost me a weekend and a migration I'm still not over.

The villain here is single-model overconfidence: you get one polished reply, and the disagreement that should have warned you is invisible. You never see the other opinions, because you only asked one model.

So I stopped trusting one model. I convened a jury.

Council takes any judgment call ("Postgres or Mongo?", "is this PR safe to merge?", "is this clause risky?") and asks three different models, lets them disagree, then has Hermes deliver one verdict, a confidence score, and exactly why they split. Three models, one verdict, $0.

You ask a question. Council fans it out to three jurors (two free OpenRouter models from different families and one local model via Ollama), each takes a position with reasons. Then, if they disagree, a second deliberation round runs: each juror sees the others' answers and either holds or changes its mind, so the council debates instead of just voting once. Hermes then judges the deliberated opinions: a single verdict, a confidence score (high when they agree, low when they split 2-1), and a "why they disagreed" panel. Every verdict is remembered, a council skill learns which juror to trust for which kind of question, and the agent can even propose its own trust adjustments for you to approve.

The whole product is one question box. Everything interesting happens behind it, and the rest of this post is mostly pictures of that "behind."

Demo

Repo: https://github.com/ArqamWaheed/council

Live demo: https://council-jet-kappa.vercel.app/
Hermes orchestration is local-only (no Hermes binary on serverless); the hosted demo runs the same UI via OpenRouter/mock. Run locally for the real hermes -z path.

Try "Should a 3-person startup use microservices?" and open the dissent panel.

Local, one command (runs at $0 in offline mock mode, no key needed):

git clone https://github.com/ArqamWaheed/council && cd council && ./setup_hermes.sh && python server.py

Architecture, in pictures

I think the design is easiest to see, so here's the system as a sequence of images. Each caption is the explanation.

The core loop. One question, three independent Hermes subagents (2 hosted + 1 local) fanned out in parallel, then a fourth Hermes run (the foreman) synthesizes one verdict. Every arrow is the same hermes -z interface; nothing talks to a model directly.

The bet. A hosted model and an on-device model sit on the same jury, swapped with a single --provider/--model flag, no code change. This model-agnosticism is the one Hermes property the whole project is built on.

The UX surface. Confidence is high when jurors agree and drops on a 2-1 split. The dissent panel is collapsed by default, and you expand it exactly when the confidence number makes you nervous.

The actual product. A confident single answer hides this; Council makes the disagreement the headline. Getting the clustering right here was subtle (see "What I learned" below).

The headline feature: a council that **deliberates, not just votes. After round 1, disagreeing jurors get a second Hermes pass where they read each other's arguments and may hold or change their vote. A "⇄ changed" badge marks the ones that moved, and the confidence dial actually climbs when a 2-1 split is talked into agreement.

The agentic learning loop, human-in-the-loop. Hermes proposes; you approve or dismiss. Approved rules persist client-side and ride along with the next convene call.

Persistence the judge can verify. Verdicts are mirrored into Hermes' own memory, so recall is Hermes doing the work; proof lives in docs/hermes-proof/04-memory-recall.txt.

Code

Repo: https://github.com/ArqamWaheed/council

Interesting files:

hermes_run.py (the Hermes CLI driver every juror/judge call goes through)
run_council.py (orchestration + the deterministic judge + Hermes foreman + the --reflect loop)
skills/council/SKILL.md (the juror-weighting brain Hermes edits)
server.py (the /api/reflect + /api/learn endpoints)
index.html (the designed verdict UI with the foreman TTS readout and localStorage persistence).

Proof that Hermes is genuinely in the loop (subagent transcripts, skill diff, memory recall) is in docs/hermes-proof/.

# hermes_run.py: every juror/judge call is a real Hermes run
def ask(prompt, provider, model, skills=None, timeout=120):
    cmd = [binary(), "--provider", provider, "--model", model]
    if skills: cmd += ["--skills", skills]
    cmd += ["-z", prompt]                       # -z = one-shot, final answer on stdout
    return subprocess.run(cmd, capture_output=True, text=True, timeout=timeout).stdout

# jurors.py: fan out one Hermes subagent per juror, in parallel
with ThreadPoolExecutor(max_workers=len(roster())) as pool:
    opinions = list(pool.map(lambda c: ask_juror(*c), enumerate(roster())))

How I Used Hermes Agent

Why Hermes at all: the model-agnostic core. Hermes lets you point at any provider and swap with a flag, no code change. Council is built on top of that one property: the jurors are different models, and Hermes is the only piece that makes "different models" cheap. The clearest proof is the third juror: it runs locally via Ollama while the other two are hosted on OpenRouter, and all three answer through the exact same hermes -z interface (the model-agnostic diagram above). A hosted model and an on-device model, sitting on the same jury, no code change: that's model-agnosticism you can see. I genuinely didn't see another entry in this challenge exploit it; everyone picked one model and moved on. That's the whole bet.

Subagents: one real Hermes run per juror. Each juror is a genuine, isolated Hermes invocation on a different provider+model (hermes -z --provider openrouter --model … for the two hosted jurors, --provider ollama-local … for the on-device one), fanned out in parallel so no model's reasoning anchors another's (the convene-flow diagram above). Hermes does the inference; my Python (jurors.py to hermes_run.py) is just the fan-out plumbing, and every juror in the output JSON is tagged "via": "hermes". The gotcha worth flagging: Hermes enforces a 64K-context floor, which for the local model meant setting both ollama_num_ctx and a named custom_providers entry; without the named provider, --provider ollama silently routed to the wrong base URL. setup_hermes.sh encodes the working config so a judge can reproduce it in one command.

A true debate, not just a vote (round 2 is real Hermes work). This is the feature I'm proudest of. After round 1, if the jurors disagree, each one gets a second Hermes run that shows it the others' positions and lead reasons and asks it to hold or change its mind. Real jurors reconsider through the same hermes -z path as round 1, so the debate is genuine extra agentic work, not a UI flourish; mock jurors reconsider deterministically so the offline demo stays reproducible. The judge then synthesizes the verdict from the deliberated opinions, so a juror that's talked round actually moves the outcome (the deliberation diagram above). It's gated on disagreement (a unanimous round 1 skips it) and toggled with COUNCIL_DEBATE=0.

Why a skill, not a prompt, for judging. The foreman's verdict is itself a Hermes run (hermes -z --skills council) grounded in skills/council/SKILL.md, which is installed into Hermes (hermes skills list shows it). The weighting logic lives in a machine-readable weights block.

The judging brain is data, not a buried prompt. --learn and --reflect both edit this block, and the installed Hermes copy is kept in sync.

After a string of security questions, --learn appended a rule to upweight the local model on that topic (and synced the installed Hermes copy) because it had caught issues the hosted models missed:

python run_council.py --learn "Local Juror | security | 1.5"

On the next security question that juror's vote counts 1.5×, read straight back by the judge. Counterfactual: a static synthesis prompt can't get better; this does. (The before/after skill diff is in docs/hermes-proof/03-skill-learning.txt.)

Letting the agent propose its own learning, now on the web and grounded in evidence. python run_council.py --reflect (and the "Should the council reweight itself?" button in the UI) hands Hermes its own memory of past verdicts and asks it to propose one weight change, e.g. "the local juror has dissented on three database calls; upweight it." The key fix this round: the proposal is evidence-grounded, since Hermes is fed the actual dissent tally and any rule backed by fewer than two real dissents is rejected, so it can't just parrot the example baked into the skill. You then Approve or Dismiss it (the reflect-flow diagram above). That's the agentic loop done honestly: a single verdict has no ground truth, so the agent surfaces a pattern and a human confirms it's signal, not overfitting (the exact tension this post closes on). (Offline, it falls back to a deterministic heuristic so it never breaks.)

Making learning survive a stateless deploy. On a hosted demo the filesystem is read-only, so an approved rule can't be written back to SKILL.md. Council handles this honestly: approved rules are stored in the browser's localStorage and re-sent with every /api/convene call, where they're merged into the judge's weights for that request. Locally you get a persistent SKILL.md; on the web you get per-browser persistence, and either way the learning sticks.

Why memory. Each verdict is appended to a log and mirrored into Hermes' own MEMORY.md, so I can ask hermes -z "what did the council decide about auth?" and Hermes recalls it from its memory, not from my code (the memory-recall image above). Proof: docs/hermes-proof/04-memory-recall.txt.

The foreman reads the verdict aloud. The verdict card has a "the foreman reads the verdict" button (browser SpeechSynthesis, $0); Hermes also ships native TTS via hermes setup tts. On-theme and memorable: a jury foreman announcing the decision.

The build itself was agent-run. I kept a memory.md the coding agent read before each task and updated after (so context stayed cheap), committed every increment with Conventional Commits, and built the verdict UI with the frontend-design skill, which is why the confidence dial and colour-coded juror chips read as designed, not default-template AI slop. The repo's AGENTS.md + commit history show the process, not just the result.

Why these models, and the concession. Two free OpenRouter models from different families (≥64K context, since Hermes rejects smaller at startup) plus a local Ollama juror. Two honest concessions: (1) free models are slower and three calls add latency (~10-20s/verdict); (2) the free tier is aggressively rate-limited, so I hit 429s constantly while building, and Council retries and, if a juror still won't answer, falls back (Hermes to direct API to deterministic stand-in) rather than crashing the verdict, which also means the demo runs fully offline at $0. For a once-a-decision tool, I'll take it. Cost: $0.

License. MIT. Fork it, add your own jurors.

What I learned (and what's next)

The disagreement is the product. A 2-1 split is more useful than a confident single answer, so the clustering that decides "who actually disagreed" has to be right. A small local model once wrote a vague position ("to facilitate efficient integration…") whose reasons clearly endorsed Postgres; the first version mis-filed it as a dissenter. The fix: when a juror's stated position is ambiguous, fall back to reading its reasons, and ignore options only mentioned in a comparison ("better than Mongo" isn't a vote for Mongo). Now agreeing jurors cluster together, and the split count is honest.
Grounded beats glib. Letting the agent propose its own weighting only works if the proposal is tied to real evidence; an ungrounded "reflect" just echoes whatever example is in the skill.
Hermes' 64K-context floor caught a model that would've quietly underperformed.
A council should deliberate, not just vote. The round-2 debate above was the turning point: letting jurors read each other and reconsider means a juror that's genuinely persuaded moves the verdict, and you watch the confidence dial climb as a 2-1 split becomes unanimous. A one-shot vote can't do that.

Top comments (42)

Mykola Kondratiuk • Jun 1

curious how the memory weighting handles confident-but-wrong consensus. if all three models agree on a bad call, does re-weighting just make them more confident on the next similar question?

Arqam Waheed • Jun 2

nah it doesn't make them more confident, confidence isn't stored it's just how much they agree right now, and re-weighting only fires on dissent so when all three agree there's nothing to learn from, no weights move, the real problem is there's no ground-truth anywhere so a unanimous wrong call makes zero disagreement and nothing can walk it back unless you feed in whether it actually panned out, which nothing does yet.

Mykola Kondratiuk • Jun 2

unanimous agreement with no ground-truth is actually the riskier signal, not the safer one. i added a human-checkpoint gate on high-consensus critical-path calls - only thing that consistently catches the confident-but-wrong case before it lands.

Arqam Waheed • Jun 2

yeah agreed, unanimous + no grount-truth is the scary one, reads as max confidence but it's really just zero signal. the human-checkpoint gate on high consensus is exactly right, dissent already self-corrects, it's the silent agreement that needs a human. been thinking of flagging unanimity on critical-path as its own warning state instead of a pass, basically invert the trust.

Mykola Kondratiuk • Jun 2

flagging unanimity separately is where I landed — >=90% consensus queues for human review. what looks like strong agreement is often just correlated context windows. dissent catches itself; silent consensus doesn't.

Aditya • Jun 1

Love this approach. I built something very similar for the Notion MCP Challenge called "The Council," where multiple AI agents debate engineering decisions from different perspectives (security, performance, cost, and scalability) before an Arbiter produces a final verdict.

One difference is that my agents conduct the debate directly inside Notion using MCP, with every argument, counterargument, decision, and action item written back to the workspace as persistent organizational knowledge. The goal was solving the "why did we make this decision?" problem months later.

Really enjoyed seeing how you used Hermes for model-agnostic orchestration, deliberation rounds, and juror re-weighting. It's interesting how we both arrived at the idea that disagreement between agents is often more valuable than a single confident answer.

Andrii Krugliak • May 30

The second round is the part most setups skip, and it's the whole game. A juror that actually changes its vote after reading the others is doing real work instead of just voting. Did they flip more on factual splits or on judgment calls?

Arqam Waheed • May 31

Yesssss the second round is the whole game, a juror that flips after reading the others is actually deliberating instead of just voting once and dipping.

On ur question, way more flips on judgment calls than factual splits. Facts kinda resolve themselves, once one juror cites the right thing the rest just fall in line, not much of a debate. judgment calls are where it gets messy cuz there's no ground truth to point at, so they actually argue and move each other. Honestly the factual flips are the boring ones, the judgment flips are where u see the panel doing real thinking

Theo Valmis • Jun 1

The "argue + judge" pattern is the right architecture for any verification problem where you can't trust the writer's self-assessment. Two key design properties: the arguers shouldn't share the same blind spots (different models, different prompts, sometimes different families), and the judge has to have actual authority — its verdict has to gate the next step, not just generate a recommendation that gets ignored.

That's the structural shape we've been writing about at Mneme for code generation specifically: external verification contracts the agent has to pass before its output is considered complete, evaluated by something other than the agent itself. Same underlying primitive as Hermes-as-judge, applied to architectural constraints rather than text-quality judgments.

mnemehq.com/concepts/verification-...

Lloyd-Jackman-UKPL • Jun 3

Love this. If I had a penny for every time ChatGPT disagreed with CoPilot, or Claude and Qwen didn't see eye to eye.

It'd be interesting to know what effect rotating jurors from a wider pool would have. Like jury service 😄

Arqam Waheed • Jun 4

Love this framing, "jury service for LLMs". Right now the panel's fixed (3 jurors), but a rotating pool across model families is exactly the next step: less correlated bias, better per-topic trust signals. Empanel the juror that's earned it for the case.

VoltageGPU • Jun 4

Interesting approach to leveraging model diversity! I’ve experimented with ensemble decisions in confidential computing environments, and having a "judge" model like Hermes adds an extra layer of reasoning control. It’s similar to how we validate outputs in secure GPU setups—just with an LLM as the arbiter.

xulingfeng • May 30

This is brilliantly executed. The two-round debate (vote → see dissent → revote) is exactly what singles out the real answer from polished confidence. I've been running Hermes locally for our test automation stack and the same problem shows up — one model gives you a clean answer, you ship it, never seeing the debate that should've warned you.

The trust-weighting across question types is where this gets really powerful. Have you noticed patterns in which juror configuration yields the tightest confidence scores? Followed you 👀

Arqam Waheed • May 30

Appreciate it, honestly the trust-weighting is still more art than science rn but the clearest pattern: jurors with different base models (not just diff prompts) give tightest scores. Homoegenous panels agree too easily = false confidence. Diversity > raw capability. Lmk what configs ur hermes stack throws at it 👀

Alan Voren (PlayServ) • Jun 4

The single-model overconfidence framing is the real product here, not the jury mechanic. Every engineering team I know has a story about shipping the wrong thing because one LLM sounded certain. Making dissent the headline instead of burying it is the unusual call - most "second opinion" tools just average the answers, which is the exact opposite of what's useful.

Dhruv Joshi • Jun 4

This setup brilliantly reveals that a unanimous AI agreement is often just false confidence, making the dissent panel the most valuable feature here.

Michael Holding • Jun 3

A single answer gives certainty; a council reveals uncertainty. The real innovation here isn't the verdict, it's making disagreement visible and letting the system learn from it.

Valentin Monteiro • Jun 4

Disagreement visibility is the insight, agreed. But the operational question nobody's asking: how many tokens and how much latency does a multi-model debate add per decision? In production you end up choosing between confidence and cost, not between right and wrong.

View full discussion (42 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.