DtoTHEmoon

Posted on May 27

Why Your AI Agent Keeps Making the Same Mistakes (It's Not the Model)

#ai #claude #agentaichallenge #chatgpt

Does this sound familiar?

Your AI just fixed a bug. Two weeks later, the exact same bug is back.

You deploy something, and you have no idea if it actually worked — so you manually test it.

You've written 100 lines of rules in your config file, but the AI still ignores half of them.

Every new chat session, you re-explain the same context from scratch.

I ran into all four of these problems while building an internal AI quoting system for a healthcare company — with no technical background. And after months of debugging, I realized: none of these were model problems. They were Harness problems.

What is Harness Engineering?

Harness Engineering is the discipline of building the scaffolding around your AI — the rules, constraints, verification scripts, and knowledge structures that make it produce consistent, reliable output.

Without Harness, even the best model will drift, forget, and repeat the same mistakes.

The data backs this up: research shows that 80% of Agent quality failures come from Harness gaps, not model limitations. And in one benchmark, the same 15 models all improved significantly when only the Harness changed — not the models themselves.

The problem is: most people don't know what their Harness is missing. They just know something feels broken.

The framework: two dimensions, not six steps

After studying real production failures and building my own system from scratch, I organized Harness Engineering into two dimensions.

Vertical Quality Layers (Q) — required for every project

Layer	Name	What it solves
Q1	SPEC	AI knows what to build, what not to, and how to verify
Q2	Rules + Security	Hard business limits + security red lines, equally mandatory
Q3	Skills	Repetitive workflows standardized with counter-examples
Q4	Scripts (unified gate)	Nothing is "done" until scripts pass

Horizontal Scale Layers (S) — enable only when needed

Layer	Name	When to enable
S1	Context	Sessions losing coherence after ~20 turns
S2	dev-map + Memory	Project iterating 2+ months, AI re-inventing solutions
S3	Multi-Agent	Single agent consistently failing on long task chains

The key insight: Q4 is not step four. It's the exit gate for every layer. Code changes, doc updates, multi-agent outputs — all must pass Q4 before anything counts as done.

Most people skip Q4 entirely. That's why the same bug keeps coming back.

What I built: Rein

Rein is an open-source Skill for Claude Code (and any agent supporting the SKILL.md standard) that acts as a silent Harness Engineering advisor throughout your project.

It watches your conversations for patterns — not keywords — and speaks up only when it detects a real gap. When everything's fine, it stays silent. Silence is a feature.

What it detects automatically:

Repeated failures (same bug fixed twice → missing Rule or regression test)
Context loss (re-explaining background every session → incomplete project docs)
Scale shifts (internal tool going external → time to harden your Harness)
Cost spikes (API bill climbing → identifies token waste sources)
Over-engineering (more config, slower shipping → tells you what to delete)

Test results: 97% pass rate across 16 scenarios with Rein vs 52% without.

The biggest gap was in root cause diagnosis: 92% accuracy with Rein, 24% without.

A real example from my project

My verify.sh only checked if the service started. It didn't check if the business logic was correct.

So when the AI "fixed" a pricing calculation bug, it passed my verification — service was running — but the actual calculation was still wrong. Same bug, two weeks later.

After adding a business baseline check (call a known correct quote request, compare against expected output), that class of bug disappeared entirely.

This is Q4. Not just "is the service alive?" but "is the output actually correct?"

Install

git clone https://github.com/DtoTHEmoon/rein-skill.git ~/.claude/skills/rein

Restart your agent. Rein activates automatically — no commands needed.

Also works with: OpenClaw, Codex CLI, Gemini CLI, Cursor, and any agent supporting SKILL.md.

The core philosophy

Start minimal. Add only when you have a real pain point. And know when to subtract — Rein will tell you when your Harness is getting in your own way.

If your scaffolding is slowing you down, it's time to cut.

GitHub: github.com/DtoTHEmoon/rein-skill

Top comments (2)

Harjot Singh • May 31

"It's not the model" is the single most important realization in agent building, and most people resist it because upgrading the model feels like progress while fixing the harness feels like chores. But repeated mistakes are almost never a model-capability problem - they're a memory/context problem: the agent has no persistent record of "we tried this and it failed," so it cheerfully re-walks into the same wall every session. Same model, same blind spot, forever, because nothing taught it otherwise.

The fix is structural, not a bigger brain: capture failures as durable lessons the agent is forced to see (a learnings file, a check that blocks the known-bad path), so the system improves even though the model doesn't. That "encode the lesson so it can't repeat the mistake" loop is core to how I build with Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) - the reliability comes from the harness learning, not from hoping the next model release is smarter. Excellent, contrarian-but-correct post. When you fixed the repeated mistakes, was it mostly persistent memory of failures, or tighter guardrails preventing the bad path? Curious which did more of the work.

DtoTHEmoon • Jun 1

Great question — and you're right that memory alone isn't enough. If you only record "we tried X and it failed," the agent will still walk into the same wall the moment the surface changes slightly.
What actually worked for me was structural blocking: moving the lesson from a rule the agent reads to a check the agent cannot pass without satisfying. In practice, that meant adding the known-bad pattern as a verification step in verify.sh — something that returns exit 1 and stops the process cold, not a note in CLAUDE.md that can be silently ignored.
The mental model I use now: if the lesson lives in text the agent interprets, it's memory (fragile). If it lives in a script that gates completion, it's mechanism (durable). Rein is built around that distinction — Q2 (rules) is where you write the lesson, Q4 (verification scripts) is where you make it physically impossible to skip.
So to directly answer your question: mechanism did almost all of the work. Memory got me partway there, but the mistake only truly stopped recurring when the bad path had a hard gate in front of it.
Curious how Moonshift handles this — do you encode lessons at the prompt level or at the pipeline gate level?