Does this sound familiar?
Your AI just fixed a bug. Two weeks later, the exact same bug is back.
You deploy something, and you have no idea if it actually worked — so you manually test it.
You've written 100 lines of rules in your config file, but the AI still ignores half of them.
Every new chat session, you re-explain the same context from scratch.
I ran into all four of these problems while building an internal AI quoting system for a healthcare company — with no technical background. And after months of debugging, I realized: none of these were model problems. They were Harness problems.
What is Harness Engineering?
Harness Engineering is the discipline of building the scaffolding around your AI — the rules, constraints, verification scripts, and knowledge structures that make it produce consistent, reliable output.
Without Harness, even the best model will drift, forget, and repeat the same mistakes.
The data backs this up: research shows that 80% of Agent quality failures come from Harness gaps, not model limitations. And in one benchmark, the same 15 models all improved significantly when only the Harness changed — not the models themselves.
The problem is: most people don't know what their Harness is missing. They just know something feels broken.
The framework: two dimensions, not six steps
After studying real production failures and building my own system from scratch, I organized Harness Engineering into two dimensions.
Vertical Quality Layers (Q) — required for every project
| Layer | Name | What it solves |
|---|---|---|
| Q1 | SPEC | AI knows what to build, what not to, and how to verify |
| Q2 | Rules + Security | Hard business limits + security red lines, equally mandatory |
| Q3 | Skills | Repetitive workflows standardized with counter-examples |
| Q4 | Scripts (unified gate) | Nothing is "done" until scripts pass |
Horizontal Scale Layers (S) — enable only when needed
| Layer | Name | When to enable |
|---|---|---|
| S1 | Context | Sessions losing coherence after ~20 turns |
| S2 | dev-map + Memory | Project iterating 2+ months, AI re-inventing solutions |
| S3 | Multi-Agent | Single agent consistently failing on long task chains |
The key insight: Q4 is not step four. It's the exit gate for every layer. Code changes, doc updates, multi-agent outputs — all must pass Q4 before anything counts as done.
Most people skip Q4 entirely. That's why the same bug keeps coming back.
What I built: Rein
Rein is an open-source Skill for Claude Code (and any agent supporting the SKILL.md standard) that acts as a silent Harness Engineering advisor throughout your project.
It watches your conversations for patterns — not keywords — and speaks up only when it detects a real gap. When everything's fine, it stays silent. Silence is a feature.
What it detects automatically:
- Repeated failures (same bug fixed twice → missing Rule or regression test)
- Context loss (re-explaining background every session → incomplete project docs)
- Scale shifts (internal tool going external → time to harden your Harness)
- Cost spikes (API bill climbing → identifies token waste sources)
- Over-engineering (more config, slower shipping → tells you what to delete)
Test results: 97% pass rate across 16 scenarios with Rein vs 52% without.
The biggest gap was in root cause diagnosis: 92% accuracy with Rein, 24% without.
A real example from my project
My verify.sh only checked if the service started. It didn't check if the business logic was correct.
So when the AI "fixed" a pricing calculation bug, it passed my verification — service was running — but the actual calculation was still wrong. Same bug, two weeks later.
After adding a business baseline check (call a known correct quote request, compare against expected output), that class of bug disappeared entirely.
This is Q4. Not just "is the service alive?" but "is the output actually correct?"
Install
git clone https://github.com/DtoTHEmoon/rein-skill.git ~/.claude/skills/rein
Restart your agent. Rein activates automatically — no commands needed.
Also works with: OpenClaw, Codex CLI, Gemini CLI, Cursor, and any agent supporting SKILL.md.
The core philosophy
Start minimal. Add only when you have a real pain point. And know when to subtract — Rein will tell you when your Harness is getting in your own way.
If your scaffolding is slowing you down, it's time to cut.
GitHub: github.com/DtoTHEmoon/rein-skill
Top comments (2)
"It's not the model" is the single most important realization in agent building, and most people resist it because upgrading the model feels like progress while fixing the harness feels like chores. But repeated mistakes are almost never a model-capability problem - they're a memory/context problem: the agent has no persistent record of "we tried this and it failed," so it cheerfully re-walks into the same wall every session. Same model, same blind spot, forever, because nothing taught it otherwise.
The fix is structural, not a bigger brain: capture failures as durable lessons the agent is forced to see (a learnings file, a check that blocks the known-bad path), so the system improves even though the model doesn't. That "encode the lesson so it can't repeat the mistake" loop is core to how I build with Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) - the reliability comes from the harness learning, not from hoping the next model release is smarter. Excellent, contrarian-but-correct post. When you fixed the repeated mistakes, was it mostly persistent memory of failures, or tighter guardrails preventing the bad path? Curious which did more of the work.
Great question — and you're right that memory alone isn't enough. If you only record "we tried X and it failed," the agent will still walk into the same wall the moment the surface changes slightly.
What actually worked for me was structural blocking: moving the lesson from a rule the agent reads to a check the agent cannot pass without satisfying. In practice, that meant adding the known-bad pattern as a verification step in verify.sh — something that returns exit 1 and stops the process cold, not a note in CLAUDE.md that can be silently ignored.
The mental model I use now: if the lesson lives in text the agent interprets, it's memory (fragile). If it lives in a script that gates completion, it's mechanism (durable). Rein is built around that distinction — Q2 (rules) is where you write the lesson, Q4 (verification scripts) is where you make it physically impossible to skip.
So to directly answer your question: mechanism did almost all of the work. Memory got me partway there, but the mistake only truly stopped recurring when the bad path had a hard gate in front of it.
Curious how Moonshift handles this — do you encode lessons at the prompt level or at the pipeline gate level?