I started noticing this while building a real frontend/backend system with AI assistance.
The problem was not that the AI could not code. It could. It could solve local problems, patch visible failures, and make reasonable suggestions. The problem was that as the system became more complex, the agent would start losing the thread.
A change in one layer would quietly violate an assumption in another. A fix would target the nearest visible symptom instead of the deeper boundary that had failed. The repo would still build. The UI might still render. The tests might still pass. But the system was slowly falling out of alignment with itself.
That is what led me to build Scarab.
Not as another coding agent. Not as a magic repair tool. Not as something that tells a developer how to fix their repo.
Scarab began as a diagnostic response to a much simpler question:
How do you keep repo context steady while an AI coding agent is working inside it?
The more field testing I’ve done, the more I think this problem is bigger than “AI drift.” Human-built repos drift too. Docs get stale. Tests prove only part of the truth. Runtime behavior contradicts the intended architecture. A patch makes the red turn green without making the system more coherent.
AI did not invent that problem. It just made it faster, louder, and much harder to ignore.
The deeper issue is not only whether code works. It is whether the repo’s evidence layers still agree.
What does the repo claim?
What does the code actually do?
What do the tests really prove?
What does the runtime contradict?
What baseline is the repair preserving?
That is the space I’m interested in: deterministic diagnostics before repair, so developers and AI agents are not just arguing from vibes.
I wrote the longer theory behind this here:
https://scarabsystems.substack.com/p/ai-didnt-invent-code-drift-it-made?r=8isus0
Top comments (23)
This resonates hard. "A fix would target the nearest visible symptom instead of the deeper boundary that had failed" — that's the sentence that sums up the whole class of failures I keep running into. The symptom looks fixed, the build passes, but the system is quietly less coherent than it was before.
I've been calling it "test coverage theater" on my end — where the AI patches the visible gap, and the pass/fail surface still looks green, but the invariants between layers have silently drifted. Curious: does Scarab detect drift by comparing against a stored baseline, or does it infer the expected boundaries from the codebase structure itself?
That’s a really important distinction, and I should be precise here.
Scarab is not meant to “guess” what the repo should be.
The stronger version of the model is deterministic: the team provides, or the repo already contains, an agreed baseline — governance docs, accepted scripts, framework conventions, architecture notes, test expectations, deployment assumptions, etc. Scarab then uses those as declared truth surfaces and checks whether the repo still aligns with them.
So in a product/team setting, I would not want Scarab silently deciding intent on behalf of the team. The team still owns the definition of what the system is supposed to be.
Where it gets interesting is when the repo contains contradictions.
A config says one thing, the runtime path expects another.
A test passes, but browser behavior fails.
A build path and dev path enforce the same contract differently.
A generated artifact starts acting like source truth.
A final-output constraint starts governing an intermediate step.
That is where Scarab becomes useful: it does not need to invent intent; it can surface places where the repo’s own declared or operational truths no longer agree.
So I’d say: baseline comparison matters, but the deeper value is evidence-backed contradiction detection across boundaries.
The team decides what “fixed” means. Scarab helps show where the current system is no longer proving what the team thinks it is proving.
I must tell you that I'm actually a physicist at my core so my approach to this entire space has been from a completely different viewpoint than those who generally work in it.
I think this is why I always looked at the issue from the perspective of how a holistic system is intended to operate cleanly rather than the more scoped approach to diagnosing an issue.
The physicist perspective reminds me of someone I wrote about once. Same energy — "I built this system, I know it's solid." Formal verification all green, every boundary covered. Three months in, the model walked itself right out of those boundaries. He didn't skip installing a door. It just never occurred to him that he needed one.
Not saying you're the same person. But there's a shared vibe — that "I see things from a different angle than everyone else" thing. He used math to prove safety. You use physics to reason about code consistency. Same underlying question though: when the system isn't lying to you, you trust it. When it starts lying — how do you know?
One thing I keep coming back to: baselines drift too. Docs go stale. Architecture decisions become assumptions nobody checks. Test assertions test things the codebase no longer assumes. When Scarab finds a contradiction — does it resolve it, or just put it on the table?
I wrote about that CTO here — you two would probably get along 😄
dev.to/xulingfeng/our-cto-built-an...
Maybe you two should grab a coffee. Or we should take this somewhere else — this comment thread is starting to look like a miniseries 🍿
love it!... thing is that's really the only way I personally can see it... I'm not a programmer... or a software developer in the common sense... I never really had any other perspective when I started to run into problems...
which is why I value this conversation so much because honestly I need your perspective. I know there are aspects I am not seeing... that objectively has to be true but I don't know what I don't know so please keep your thoughts and questions coming...
That’s exactly the hard part.
Scarab should not automatically resolve that contradiction, because resolving it means deciding which truth source has authority — and that still belongs to the repo owner/team.
If docs say one thing, tests say another, runtime behavior says another, and the architecture has quietly moved on, Scarab’s job is to put that conflict on the table with evidence.
So rather than saying “the docs are wrong” or “the test is wrong,” the useful diagnostic output is more like:
That distinction matters because stale baselines are one of the easiest ways for a repair to become dangerous. The agent can “fix” the code to match an outdated document, or update a test to match broken runtime behavior, and both can look green while making the repo less coherent.
So Scarab does not decide the truth for the team.
It surfaces the contradiction clearly enough that the team can decide which truth should become authoritative again.
This "don't resolve, just surface" approach is actually the same core logic as testing — tests never tell the code how to fix itself, they just say "here's a gap." Whether to fix it and how is a human decision.
But one thing I keep circling back to: after the contradiction is on the table and the team makes a call — does Scarab track the history of that decision?
Say the team picked "docs are truth" three months ago, so the code got updated to match the docs. Three more months pass, runtime drifts again. The root cause of this new contradiction is the same as last time — but because the previous decision wasn't recorded as part of the baseline, Scarab would surface it as a fresh contradiction instead of "this is a recurring pattern."
I see this all the time on the testing side — the same bug keeps showing up across different iterations because nobody put a marker on the root cause. If Scarab could maintain "contradiction resolution history" as a layer of the baseline too — that would close the diagnostic loop in a way I haven't seen anyone do yet. Have you thought about going that direction?
ahhh... I see your point... and yes! that's why when I developed the diagnostics I designed them to update new governance boundaries once accepted and proven through the repo.
it was one of the main issues I was running into that made me create the suite in the first place... I had a baseline but as I was creating and developing that baseline changed... so I needed a way for the repo to also be able to track the repo's own evolving truth.
So you already built that loop in — same direction I was thinking. That's actually interesting.
Most diagnostic tools stop at "here's the problem." You went further and made the baseline update itself once a resolution is validated. That's a real feedback loop.
From the testing side, here's what keeps bugging me:
When the baseline auto-updates — who validates that the update was right? Say runtime drifted, the team decided "runtime is truth," so Scarab updated the baseline. But what if the runtime itself was broken? Now the baseline just drifted from correct to incorrect — and the system thinks the contradiction is resolved.
I see this pattern all the time in production: something breaks, someone hotfixes a workaround, and the workaround becomes the new standard. Nobody goes back to fix the root cause because the system says "this is resolved."
Does Scarab have a way to tell "baseline was intentionally updated" apart from "baseline passively drifted"? Or is that distinction something the team still has to make?
Exactly — and that distinction is critical.
Scarab does not auto-update baseline truth.
A baseline change has to be explicit. The team/user has to declare new governance or a new accepted operating baseline. Even then, Scarab should not treat that as “resolved” just because someone updated the baseline document.
The new baseline still has to be proven against the repo’s mechanics: tests, runtime behavior, framework expectations, and the relevant evidence surfaces. Until that proof exists, the system should still treat the state as unresolved or unverified.
So the distinction is:
Passive drift: runtime changed, tests adapted, docs went stale, and nobody formally reconciled the contradiction.
Intentional baseline update: the team explicitly accepts a new truth, proves the repo aligns with it, and only then creates a new checkpoint.
That is exactly why Scarab should not silently decide that runtime is truth. Runtime can be broken too. Scarab’s role is to keep the contradiction visible until the new baseline is deliberately accepted and verified.
This framework is clean. The distinction between passive drift and intentional baseline update is the kind of distinction you only arrive at after living through both.
But here's what I've seen play out as a QA lead: when the "keep the contradiction visible" model works well, it works great. When it doesn't — it's because the team reaches a point where everything is a contradiction. The baseline is outdated, the tests are flaky, the runtime has drifted, and nobody can agree on what the "real truth" even is anymore. The visibility becomes noise.
The hard part isn't surfacing contradictions — it's prioritizing which one to resolve first when there are hundreds of them. Does Scarab have any mechanism for that, or is it expected that the team triages manually?
That’s the exact failure mode I’d want to avoid.
A diagnostic system that just says “here are 400 contradictions” is not actually helping. At that point it becomes another noisy dashboard.
The useful version has to separate contradiction discovery from contradiction prioritization.
So the model I’m working toward is not “surface everything equally.” It is more like:
Which contradiction blocks the repo from proving anything else?
Which one sits at a source-of-truth boundary?
Which one has the widest downstream blast radius?
Which one makes tests unreliable as evidence?
Which one turns runtime behavior into an untrusted baseline?
Which one is just local cleanup versus a real authority conflict?
So yes, the team still owns the final decision, but Scarab should make the triage much less manual by grouping findings into repair lanes and showing which contradictions are upstream of the others.
In a badly drifted repo, the first goal is not to fix everything.
The first goal is to identify the smallest set of contradictions that prevents the rest of the system from being trusted.
Once those are resolved, the noise floor drops and the rest of the repo becomes easier to reason about.
This triage framing is the right shape — especially "identify the smallest set of contradictions that prevents the rest of the system from being trusted." That line separates a diagnostic tool from another noise generator.
One thing I keep circling back to: "which one sits at a source-of-truth boundary?" — that assumes you can identify where truth lives. In a repo that's been drifting for months, the source-of-truth map might be part of the drift. How does Scarab build that boundary map? From git history? From tests that still pass? From declared docs, or from inferred runtime behavior?
Because if the triage model leans on a boundary map that's itself unverified — you're just one layer up, same problem.
Bro, talking to you is legit draining my brain cells.🤣 I drank two extra cups of coffee tonight just to keep up. You owe me ☕
Haha, fair — I’ll put the coffee on my tab. ☕
And yes, that is exactly the trap: if the source-of-truth map is already drifted, then treating it as unquestioned authority just moves the problem one layer up.
So the answer is: Scarab should not treat any single surface as automatically authoritative.
Docs can be stale.
Tests can be theater.
Runtime behavior can be a workaround.
Git history can preserve old assumptions.
Framework conventions can be bypassed.
Each of those is evidence, not truth by itself.
The diagnostic question becomes: which surfaces agree, which surfaces conflict, and which contradiction prevents the repo from proving anything else?
So Scarab is not meant to say, “the docs say X, therefore X is truth.” It should say something more like: “docs claim X, tests validate Y, runtime behaves like Z, and this boundary is where those claims stop agreeing.”
That is the source-of-truth problem exposed as evidence.
In other words, the boundary map is not assumed clean. It is part of what has to be validated.
The team still owns the final authority decision, but Scarab’s job is to prevent a stale map, a green test, or a hotfixed runtime path from quietly becoming “truth” just because it is the loudest surviving signal.
This is exactly the answer I was hoping for.
"Docs claim X, tests validate Y, runtime behaves like Z" — as a QA person, that line hit. The hard part is when all three disagree: docs say the old rule, tests pass against the new behavior, runtime does something else entirely. Scarab surfaces the conflict — but who unblocks it? Someone on the team has to make a call about which surface to trust first, even if temporarily, before the repo can prove anything else.
Is there a default escalation path when everything contradicts everything? Or does Scarab just hold the contradiction open and wait?
I love these questions man! you're also putting me through the paces of articulating what Scarab really is....
and Yes! — this is where I’d draw the line very carefully.
Scarab is not a repair suite, and it is not a judgment suite.
It does not decide what the team should believe. It does not silently choose docs over tests, runtime over architecture, or one baseline over another.
Its job is diagnostic.
If everything contradicts everything, Scarab should surface that as an authority problem: these surfaces no longer agree, this boundary is blocking trust, and this is the smallest decision point that must be resolved before repair can continue safely.
Then the human team, repo owner, developer, or coding agent operating under their direction makes the actual repair or governance decision.
After that, Scarab can be run again against the updated repo/governance state to see whether the contradiction cleared or whether the system is still unstable.
So the loop is not:
Scarab finds contradiction → Scarab decides truth → Scarab fixes it.
It is:
Scarab finds contradiction → team chooses/repairs/updates governance → Scarab verifies whether the repo now aligns with that accepted truth.
That distinction matters because the diagnostic has to stay honest. Scarab should make the decision surface smaller and clearer, but it should not take ownership of the decision itself.
One clarification I should probably add: the field-test patches themselves were coded by my Codex agent.
Scarab is not the repair agent. Scarab produces the diagnostic evidence and the bounded repair lane.
That distinction is important because the patch quality comes from the constraint. Codex is not being asked to “go fix the repo” in a broad sense. It is operating against a narrow diagnostic report: what boundary failed, what evidence supports it, what context matters, and what should stay out of scope.
That is what allows the repair to stay small.
The coding agent still writes the patch, but Scarab keeps it from guessing its way through the repo or expanding the repair beyond the proven boundary.
That is also why Scarab is designed to be repo-agnostic, software-agnostic, and AI-agent-agnostic.
It does not need to be tied to one framework, one coding agent, or one kind of application. The diagnostic layer is looking for the same underlying class of failure: places where repo truth, boundaries, ownership, runtime behavior, tests, or declared governance stop agreeing.
That evidence can then be handed to a human developer, a team, or an AI coding agent.
The repair actor can change.
The diagnostic principle stays the same.
Yes — and that “three months pass” scenario is exactly why Scarab was designed as more than a one-time audit tool.
The original use case was active development with an AI coding agent.
Scarab was built to run during the development loop, so when a change breaks the current accepted repo truth, the contradiction is surfaced before it quietly becomes the new normal.
That changes the posture from:
“Three months later, we discovered the repo drifted again.”
to:
“This change introduced a contradiction against the current accepted truth, and it needs a decision before it is accepted.”
That also matters for history. A baseline decision should not disappear into memory or tribal knowledge. If the team makes a governance decision — “docs are truth here,” “this runtime behavior is now accepted,” or “this test is no longer the right oracle” — that decision becomes part of the accepted checkpoint the repo is evaluated against going forward.
Then if the same contradiction reappears later, it is not just a fresh random finding. It is recurrence against a previously accepted boundary.
So I think of Scarab in two modes:
Periodic diagnostic mode: useful for finding existing drift.
Active guardrail mode: the original design — running during development so new drift is caught while the repo is being changed.
The field tests are showing that Scarab can be useful as a recovery/audit tool too, but the first purpose was always to keep the repo’s current accepted truth under watch while development is happening.
Got it. Scarab = diagnostic layer, Codex = repair layer. That's the cleanest line through your whole architecture — the diagnosis doesn't need to be perfect, just honest enough for a capable repair agent (human or AI) to act on.
This comment thread has been more engaging than most full articles on here. Honestly, the depth of this conversation has been a full dungeon crawl. 😄 Brain's drained — gotta restock on coffee potions (HP + MP)🤣. Let me write my next article and we'll come back for another run. Signing off for now.
Yip.. Feature creep, code-drift, etc. All existed well before AI. Eg. when I started at a new company, I had enthusiasm and wanted to rewrite everything, so I took their simple dashboard and turned it into a massive sprawling ecosystem... They didn want that. So a month of my time was wasted, because I created something extraordinary, that they never wanted... They just wanted drag and drop capability, I gave them JS/Blazor swapping, high detailed graphs, drilldowns, drag and drop, variable resizing, auto-arranging, permissions management, etc. They just wanted drag and drop... A symptom of a bad starting point and a poorly defined scope, leads to code drift and feature creep. That's why now, I dont bother doing more than I need to, you want an apple, i make an apple, I dont even offer a fruit salad.
AI did not create code drift. It just made drift cheaper and faster....That is the real risk. Small generated changes can look fine in isolation, but over time they quietly break the architecture, naming logic, boundaries, and assumptions the team relied on.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.