DEV Community

Cover image for I Made My AI Models Argue, Then Let Hermes Be the Judge

I Made My AI Models Argue, Then Let Hermes Be the Judge

Arqam Waheed on May 30, 2026

This is a submission for the Hermes Agent Challenge: Build With Hermes Agent TL;DR — Ask any judgment call and three different AI models argue it...
Collapse
 
itskondrat profile image
Mykola Kondratiuk

curious how the memory weighting handles confident-but-wrong consensus. if all three models agree on a bad call, does re-weighting just make them more confident on the next similar question?

Collapse
 
arqamwd profile image
Arqam Waheed

nah it doesn't make them more confident, confidence isn't stored it's just how much they agree right now, and re-weighting only fires on dissent so when all three agree there's nothing to learn from, no weights move, the real problem is there's no ground-truth anywhere so a unanimous wrong call makes zero disagreement and nothing can walk it back unless you feed in whether it actually panned out, which nothing does yet.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

unanimous agreement with no ground-truth is actually the riskier signal, not the safer one. i added a human-checkpoint gate on high-consensus critical-path calls - only thing that consistently catches the confident-but-wrong case before it lands.

Thread Thread
 
arqamwd profile image
Arqam Waheed

yeah agreed, unanimous + no grount-truth is the scary one, reads as max confidence but it's really just zero signal. the human-checkpoint gate on high consensus is exactly right, dissent already self-corrects, it's the silent agreement that needs a human. been thinking of flagging unanimity on critical-path as its own warning state instead of a pass, basically invert the trust.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

flagging unanimity separately is where I landed — >=90% consensus queues for human review. what looks like strong agreement is often just correlated context windows. dissent catches itself; silent consensus doesn't.

Collapse
 
aditya_007 profile image
Aditya

Love this approach. I built something very similar for the Notion MCP Challenge called "The Council," where multiple AI agents debate engineering decisions from different perspectives (security, performance, cost, and scalability) before an Arbiter produces a final verdict.

One difference is that my agents conduct the debate directly inside Notion using MCP, with every argument, counterargument, decision, and action item written back to the workspace as persistent organizational knowledge. The goal was solving the "why did we make this decision?" problem months later.

Really enjoyed seeing how you used Hermes for model-agnostic orchestration, deliberation rounds, and juror re-weighting. It's interesting how we both arrived at the idea that disagreement between agents is often more valuable than a single confident answer.

Collapse
 
theuniverseson profile image
Andrii Krugliak

The second round is the part most setups skip, and it's the whole game. A juror that actually changes its vote after reading the others is doing real work instead of just voting. Did they flip more on factual splits or on judgment calls?

Collapse
 
arqamwd profile image
Arqam Waheed

Yesssss the second round is the whole game, a juror that flips after reading the others is actually deliberating instead of just voting once and dipping.

On ur question, way more flips on judgment calls than factual splits. Facts kinda resolve themselves, once one juror cites the right thing the rest just fall in line, not much of a debate. judgment calls are where it gets messy cuz there's no ground truth to point at, so they actually argue and move each other. Honestly the factual flips are the boring ones, the judgment flips are where u see the panel doing real thinking

Collapse
 
mnemehq profile image
Theo Valmis

The "argue + judge" pattern is the right architecture for any verification problem where you can't trust the writer's self-assessment. Two key design properties: the arguers shouldn't share the same blind spots (different models, different prompts, sometimes different families), and the judge has to have actual authority — its verdict has to gate the next step, not just generate a recommendation that gets ignored.

That's the structural shape we've been writing about at Mneme for code generation specifically: external verification contracts the agent has to pass before its output is considered complete, evaluated by something other than the agent itself. Same underlying primitive as Hermes-as-judge, applied to architectural constraints rather than text-quality judgments.

mnemehq.com/concepts/verification-...

Collapse
 
lloydjackmanukpl profile image
Lloyd-Jackman-UKPL

Love this. If I had a penny for every time ChatGPT disagreed with CoPilot, or Claude and Qwen didn't see eye to eye.

It'd be interesting to know what effect rotating jurors from a wider pool would have. Like jury service 😄

Collapse
 
arqamwd profile image
Arqam Waheed

Love this framing, "jury service for LLMs". Right now the panel's fixed (3 jurors), but a rotating pool across model families is exactly the next step: less correlated bias, better per-topic trust signals. Empanel the juror that's earned it for the case.

Collapse
 
voltagegpu profile image
VoltageGPU

Interesting approach to leveraging model diversity! I’ve experimented with ensemble decisions in confidential computing environments, and having a "judge" model like Hermes adds an extra layer of reasoning control. It’s similar to how we validate outputs in secure GPU setups—just with an LLM as the arbiter.

Collapse
 
xulingfeng profile image
xulingfeng

This is brilliantly executed. The two-round debate (vote → see dissent → revote) is exactly what singles out the real answer from polished confidence. I've been running Hermes locally for our test automation stack and the same problem shows up — one model gives you a clean answer, you ship it, never seeing the debate that should've warned you.

The trust-weighting across question types is where this gets really powerful. Have you noticed patterns in which juror configuration yields the tightest confidence scores? Followed you 👀

Collapse
 
arqamwd profile image
Arqam Waheed

Appreciate it, honestly the trust-weighting is still more art than science rn but the clearest pattern: jurors with different base models (not just diff prompts) give tightest scores. Homoegenous panels agree too easily = false confidence. Diversity > raw capability. Lmk what configs ur hermes stack throws at it 👀

Collapse
 
playserv profile image
Alan Voren (PlayServ)

The single-model overconfidence framing is the real product here, not the jury mechanic. Every engineering team I know has a story about shipping the wrong thing because one LLM sounded certain. Making dissent the headline instead of burying it is the unusual call - most "second opinion" tools just average the answers, which is the exact opposite of what's useful.

Collapse
 
dhruvjoshi9 profile image
Dhruv Joshi

This setup brilliantly reveals that a unanimous AI agreement is often just false confidence, making the dissent panel the most valuable feature here.

Collapse
 
michael_holding profile image
Michael Holding

A single answer gives certainty; a council reveals uncertainty. The real innovation here isn't the verdict, it's making disagreement visible and letting the system learn from it.

Collapse
 
valentin_monteiro profile image
Valentin Monteiro

Disagreement visibility is the insight, agreed. But the operational question nobody's asking: how many tokens and how much latency does a multi-model debate add per decision? In production you end up choosing between confidence and cost, not between right and wrong.

Collapse
 
gitgem profile image
Gabriel Bachmann

Loved it. Thx, Arqam. Followed you for more.

Collapse
 
sahil_webmok_9997721ce837 profile image
sahil webmok

Nice post! 😊 Aaj ke competitive market me online presence bahut zaroori hai. Webmok jaise agencies businesses ko grow karne me kaafi help kar rahi hain. Thanks for sharing this information!

Collapse
 
harjjotsinghh profile image
Harjot Singh

Making models argue and using a judge is the adversarial-verification pattern done right. You get diversity (different models surface different errors) plus a referee, which beats trusting any single model's confident answer. The two things that decide if it works: the judge has to be genuinely independent (not the same model grading its own argument) and the debate has to be substantive, not two models politely agreeing. When it works you catch the confident-but-wrong answers a single pass would happily ship. I lean on exactly this argue-then-judge structure in Moonshift's verify layer. Does the judge ever get it wrong, and if so do you have a tiebreak or a human gate?

Collapse
 
arqamwd profile image
Arqam Waheed

Ya it gets it wrong, almost always the same way, when the debate stays polite and nobody presses the judge just rubber-stamps whichever juror sounds most confident. So I don't tiebreak by adding another voting model, that just stacks more confident guesses, instead the confidence dial gates it, low agreement or a juror that won't flip comes out as a split verdict instead of a fake-clean answer, and that low-confidence split is the human gate basically, the panel tapping u to come look instead of averaging the dissent into a tidy wrong answer. Curious how Moonshift handles the polite-agreement case, do u force the disagreement or just detect it after?

Collapse
 
harjjotsinghh profile image
Harjot Singh

Detect it, don't force it, that's the same conclusion I landed on. Forcing disagreement (assigning a mandatory contrarian) just manufactures objections that sound real and gives the judge fake signal to rubber-stamp, which is the exact failure you're avoiding. So Moonshift does what you described: the panel runs, and the signal that matters is the shape of the agreement, not the verdict. Polite consensus where nobody actually pressed is treated as low-confidence, not high, because unanimous-but-unexamined is indistinguishable from groupthink. The confidence dial is the gate: genuine independent agreement raises it, a juror that won't engage or a too-clean unanimous answer drops it, and below the line it surfaces as a split for a human instead of averaging the dissent into a tidy wrong answer. The one thing I'd add to keep it honest is diversity of the jurors themselves, different model families and framings, so agreement actually means they failed-differently-and-still-converged rather than three instances of the same prior nodding along. Your low-agreement-becomes-the-human-gate is exactly right. Do you vary the juror models, or is it the same model in different roles?

Thread Thread
 
arqamwd profile image
Arqam Waheed

Different models, that's the whole bet. Roster is gpt-oss, GLM, and a local qwen2.5, three different families not one model wearing three hats. Same model in different roles was the first thing I ruled out cause then agreement just means one prior nodding at itself, that's the unanimous but unexamined thing u can't tell from real signal.

Failing differently is a property of the model not the role, two instances of the same model fail in correlated ways no matter how u frame the prompt, so different families converging is the only consensus the dial treats as high. One catch tho, free tier variety is shallower than it looks, lotta open models share ancestry so different name isn't always different prior.

Collapse
 
p0rt profile image
Sergei Parfenov • Edited

deliberation round + the diversity thing + the anchoring fix are all covered above so lemme poke at the one case the whole thing is kinda blind to: confident-but-wrong consensus. @itskondrat asked this and i don't think it got answered.

everything here is tuned to catch disagreement — dissent panel, the 2-1 confidence drop, round 2, all of it only fires when jurors split. so the scary case isn't the split, it's the unanimous-and-wrong one. three jurors agree on the bad db, dial reads high, deliberation skips (unanimous round 1 = no round 2), human ships it. thats literally ur opening story, and the council waves it through with a green dial.

the diversity bet helps (diff families fail differently so real agreement means smth) but doesnt close it, and like u said urself free models share ancestry so the independence is thinner than the roster looks.
where it actually bites is --learn. u upweight a juror cuz it "caught what the others missed" — but u only know it caught it cuz a human noticed. so the weights only ever train on errors that were already detected, aka the cases u didnt really need help on. the confident-wrong-consensus ones give zero learning signal cuz nobody flagged them. memory gets better at what it already solves and stays blind to the stuff that hurts.
fix isnt another juror (just stacks more correlated priors). its an outcome signal from outside the panel: log the verdict, and when reality contradicts it later (the migration ur still not over lol) feed that back, separate from whether they agreed. that also turns the dial from a consensus meter into a calibration one — rn "67%" means "2 of 3 agreed", not "verdicts like this are right ~67% of the time". once u log outcomes u can check if high-confidence ones actually were, and a reliability curve tells u if the dial is honest or just measuring how chummy the jurors are.

sharp project tho, the SKILL.md-as-data weighting and the async-collect/sync-reveal anchor fix are the details most ppl skip.

Collapse
 
arqamwd profile image
Arqam Waheed

yeah ok this is the actual hole and i'm not gonna pretend it isn't. everything fires on disagreement, so the unanimous-and-wrong case is exactly the one that skips round 2 and ships green. and ur dead right about --learn, it only ever upweights a juror cuz a human caught the miss, so the weights train purely on errors we already detected. the confident-wrong-consensus ones leave no signal, memory gets sharper at what it already solves and stays blind to what actually bites. real blind spot, not a tuning thing.

and adding a juror doesn't fix it, agreed... just stacks more correlated priors, especially when the free models share ancestry like u said.

the outcome-signal idea is the right move. log every verdict, and when reality contradicts it later feed that back as its own channel, fully separate from whether they agreed. that flips the dial from a consensus meter into a calibration one — "67%" should mean "verdicts like this land right ~67% of the time", not "2 of 3 nodded". once outcomes are logged i can plot reliability and actually see if the dial's honest or just measuring how chummy the jurors are.

gonna spec this as an outcome-log + calibration layer. good catch, this is the part that mattered, thank you sm

Collapse
 
p0rt profile image
Sergei Parfenov

love that ur running with it. the consensus-meter → calibration-meter thing is the way to put it — "67%" should mean "verdicts like this land right ~67% of the time", thats the only version that survives prod.
two things to bake into the outcome-log when u spec it, both learned the annoying way:
attribution lag. reality contradicts a verdict on its own schedule — the migration bites weeks later (ur still not over it lol). by then the verdict + its inputs + the juror states are long gone from context. so the log only becomes a calibration signal if every verdict gets written with a frozen snapshot at decision time: inputs, per-juror votes, the confidence. cheap on write, basically impossible to reconstruct after. log just verdict+score and ull have outcomes u cant tie back to anything.

the outcome channel can inherit the exact same blind spot. "reality contradicted it" is only observable for failures that actually surface. a confident-wrong-consensus that ships and nobody ever notices never lands in the log as a miss — so ur reliability curve flatters itself in the one region u built this to fix. prob worth sampling some green-shipped verdicts for a ground-truth audit, not just waiting for failures to announce themselves.

the reliability diagram over a few hundred logged outcomes is gonna be the most useful thing here — tells u if the whole ensemble earns its keep way better than any single eval. would genuinely wanna see where it lands if u do a follow-up.
sharp project tho, the async-collect/sync-reveal anchor fix is the kind of detail most ppl skip.

Thread Thread
 
arqamwd profile image
Arqam Waheed

both of these are going straight into the spec, these are exactly the traps i'd have walked into.

attribution lag — yeah, the snapshot-at-decision-time thing is the whole ballgame. logging just verdict+score is useless if i cant tie an outcome weeks later back to the inputs + per-juror votes + confidence that produced it. cheap on write, impossible to reconstruct after, so every verdict gets frozen with its full context or the calibration signal never materializes. easy to get lazy here and i wont.

and the outcome channel inheriting the same blind spot is the sharp one, "reality contradicted it" only fires for failures that actually surface, so a confident-wrong-consensus that ships clean and nobody notices never logs as a miss. which means the reliability curve flatters itself in the exact region i built this to fix. so yeah, sampling green-shipped verdicts for a ground-truth audit instead of just waiting for failures to announce themselves. passive logging alone would quietly lie to me.

the reliability diagram is the artifact i actually want out of this... couple hundred logged outcomes and it tells me if the ensemble earns its keep way better than any single eval. ill do the follow-up and post where it lands, genuinely curious myself.

appreciate u poking at the part that mattered instead of the surface!

Collapse
 
xulingfeng profile image
xulingfeng

This is brilliantly executed. The two-round debate (vote → see dissent → revote) is exactly what singles out the real answer from polished confidence. I have been running Hermes locally for our test automation stack and the same problem shows up — one model gives you a clean answer, you ship it, never seeing the debate that should have warned you.

The trust-weighting across question types is where this gets really powerful. Have you noticed patterns in which juror configuration yields the tightest confidence scores? Followed you 👀This is brilliantly executed. The two-round debate (vote → see dissent → revote) is exactly what singles out the real answer from polished confidence. I have been running Hermes locally for our test automation stack and the same problem shows up — one model gives you a clean answer, you ship it, never seeing the debate that should have warned you.

The trust-weighting across question types is where this gets really powerful. Have you noticed patterns in which juror configuration yields the tightest confidence scores? Followed you 👀

Collapse
 
xulingfeng profile image
xulingfeng

Great point on homogeneous panels. I have been running 5 different model families (Mistral, Qwen, Yi, DeepSeek, Llama) and the spread in outputs is wild — unanimous votes correlate way more with correctness than any single model standalone confidence.

Do you track which juror changes its vote most often in the deliberation round? That alone might be a better trust signal than any static weight.

Thread Thread
 
arqamwd profile image
Arqam Waheed

The 5 family spread is doing the heavy lifting there, nice. on vote-changes as a trust signal, raw flip count alone is misleading tho. The real principle is whether a juror responds to arguments or just to the room. One that flips cuz a real counterpoint surfaced is updating correctly, one that flips just cuz everyone else caved is the unstable one. Same for holding, holding against social pressure is good, holding against a real argument is just stubborn. So dont track who changes most, track who reacts to evidence vs who reacts to the crowd, thats the signal.

Collapse
 
xulingfeng profile image
xulingfeng

We're running DeepSeek V4 Flash as the main worker + local qwen2.5:7b for offline tasks + Mistral for code review. The jury diversity def helps — our unanimous votes correlate with correctness way more than majority ones. One thing I'd love to see: a history-based weight decay for jurors that keep flip-flopping between rounds. You track that?

Thread Thread
 
arqamwd profile image
Arqam Waheed

Yeah unanimous beating majority is the whole point. when a diverse panel actually converges that means something, vs majority where someone's just outvoted.

Don't track flip-flop decay yet but ur right that i should. a juror swinging vote to dissent to vote isn't really deliberating, it's just unstable. Only catch is changing ur mind after real dissent is good, flipping with no new info is noise. Adding it to the list

Collapse
 
xulingfeng profile image
xulingfeng

That diversity > raw capability finding makes total sense — homogeneous panels converge way too fast. I've been running 5 jurors (Mistral, Qwen, Yi, DeepSeek, Llama) and the spread is wild. The outlier takes often surface edge cases the majority glosses over. Do you track which model's dissent correlates with eventual correctness?

Collapse
 
arqamwd profile image
Arqam Waheed

5 jurors is a lot, past 3 you kinda pay a consensus tax where the majority drowns the outlier. Sounds like ur already seeing that.

On tracking: it's not which model dissents, it's which model dissents against its own baseline. A cautious model flagging risk = noise, that's just its prior. An agreeable model suddenly breaking ranks = that's the part worth watching. Track the delta, not the absolute.

Collapse
 
xulingfeng profile image
xulingfeng

The diversity > raw capability point matches what we have seen too — homogeneous panels giving false confidence is exactly right.

Our stack runs DeepSeek V4 Flash as the main worker with a local Qwen for cross-validation on test automation decisions. The disagreements (Flash passes, Qwen flags a boundary case) are consistently the most useful signals.

Curious: have you noticed whether the latency gap between local vs API jurors affects the deliberation dynamics? Speed might subtly bias which voice carries more weight.

Collapse
 
arqamwd profile image
Arqam Waheed

Yeah that Flash-passes-Qwen-flags split is the main thinbg right therem the disagreement IS the signal, the agreement is just noise you already knew.

And good catch on latency, been chewing on this exactly. Short answer: yes it biases, but not how you'd think. The slow juror doesnt lose weight but it gets anchored out. Fast voice lands first and frames the question then the late juror ends up arguing against a framing instead of the raw problem. Speed = agenda-setting power, not just confidence. So I force all jurors to commit blind before anyone seestiming. Async collection, sync reveal. Kills the anchor.

Local-vs-API gap is brutal for this btw, your Qwen's probs eating a 2-3x latency tax vs Flash.

Collapse
 
nickmeinhold profile image
Nick Meinhold

Really like this, and "single-model overconfidence is the villain" nails it. The detail I'd point at is that your jurors are cross-family (two OpenRouter models plus a local Ollama one) rather than three of the same family. We ran an experiment on exactly that question and the result surprised us: a weaker, different-family local judge (Qwen 7B) tracked ground truth better than a same-family judge, 94% vs 81%. Independence beats capability for this job, so keeping that Ollama juror in the room is doing more work than it looks.

One twist you might enjoy: we used the same disagreement signal you do, but pointed it at the bill instead of accuracy. Call the cheap model twice with two different personas; if they agree, ship the cheap answer; if they disagree, only then pay for the expensive model. Same "low confidence when they split 2-1" idea, used to decide when to spend. Write-up if it's useful: enspyr.co/blog/echo-results-so-far . The memory-reweighting part of your build is the bit most people skip, nice.