AI didn’t invent code drift. It made it harder to ignore.

This resonates hard. "A fix would target the nearest visible symptom instead of the deeper boundary that had failed" — that's the sentence that sums up the whole class of failures I keep running into. The symptom looks fixed, the build passes, but the system is quietly less coherent than it was before.

I've been calling it "test coverage theater" on my end — where the AI patches the visible gap, and the pass/fail surface still looks green, but the invariants between layers have silently drifted. Curious: does Scarab detect drift by comparing against a stored baseline, or does it infer the expected boundaries from the codebase structure itself?

That’s a really important distinction, and I should be precise here.

Scarab is not meant to “guess” what the repo should be.

The stronger version of the model is deterministic: the team provides, or the repo already contains, an agreed baseline — governance docs, accepted scripts, framework conventions, architecture notes, test expectations, deployment assumptions, etc. Scarab then uses those as declared truth surfaces and checks whether the repo still aligns with them.

So in a product/team setting, I would not want Scarab silently deciding intent on behalf of the team. The team still owns the definition of what the system is supposed to be.

Where it gets interesting is when the repo contains contradictions.

A config says one thing, the runtime path expects another.

A test passes, but browser behavior fails.

A build path and dev path enforce the same contract differently.

A generated artifact starts acting like source truth.

A final-output constraint starts governing an intermediate step.

That is where Scarab becomes useful: it does not need to invent intent; it can surface places where the repo’s own declared or operational truths no longer agree.

So I’d say: baseline comparison matters, but the deeper value is evidence-backed contradiction detection across boundaries.

The team decides what “fixed” means. Scarab helps show where the current system is no longer proving what the team thinks it is proving.

I must tell you that I'm actually a physicist at my core so my approach to this entire space has been from a completely different viewpoint than those who generally work in it.

I think this is why I always looked at the issue from the perspective of how a holistic system is intended to operate cleanly rather than the more scoped approach to diagnosing an issue.

The physicist perspective reminds me of someone I wrote about once. Same energy — "I built this system, I know it's solid." Formal verification all green, every boundary covered. Three months in, the model walked itself right out of those boundaries. He didn't skip installing a door. It just never occurred to him that he needed one.
Not saying you're the same person. But there's a shared vibe — that "I see things from a different angle than everyone else" thing. He used math to prove safety. You use physics to reason about code consistency. Same underlying question though: when the system isn't lying to you, you trust it. When it starts lying — how do you know?
One thing I keep coming back to: baselines drift too. Docs go stale. Architecture decisions become assumptions nobody checks. Test assertions test things the codebase no longer assumes. When Scarab finds a contradiction — does it resolve it, or just put it on the table?
I wrote about that CTO here — you two would probably get along 😄
dev.to/xulingfeng/our-cto-built-an...
Maybe you two should grab a coffee. Or we should take this somewhere else — this comment thread is starting to look like a miniseries 🍿

love it!... thing is that's really the only way I personally can see it... I'm not a programmer... or a software developer in the common sense... I never really had any other perspective when I started to run into problems...

which is why I value this conversation so much because honestly I need your perspective. I know there are aspects I am not seeing... that objectively has to be true but I don't know what I don't know so please keep your thoughts and questions coming...

That’s exactly the hard part.

Scarab should not automatically resolve that contradiction, because resolving it means deciding which truth source has authority — and that still belongs to the repo owner/team.

If docs say one thing, tests say another, runtime behavior says another, and the architecture has quietly moved on, Scarab’s job is to put that conflict on the table with evidence.

So rather than saying “the docs are wrong” or “the test is wrong,” the useful diagnostic output is more like:

this baseline is declared here
this behavior contradicts it here
this test is proving a different assumption
this runtime path appears to have become the actual operating truth
these are the repair/update lanes depending on which authority the team chooses

That distinction matters because stale baselines are one of the easiest ways for a repair to become dangerous. The agent can “fix” the code to match an outdated document, or update a test to match broken runtime behavior, and both can look green while making the repo less coherent.

So Scarab does not decide the truth for the team.

It surfaces the contradiction clearly enough that the team can decide which truth should become authoritative again.

This "don't resolve, just surface" approach is actually the same core logic as testing — tests never tell the code how to fix itself, they just say "here's a gap." Whether to fix it and how is a human decision.
But one thing I keep circling back to: after the contradiction is on the table and the team makes a call — does Scarab track the history of that decision?
Say the team picked "docs are truth" three months ago, so the code got updated to match the docs. Three more months pass, runtime drifts again. The root cause of this new contradiction is the same as last time — but because the previous decision wasn't recorded as part of the baseline, Scarab would surface it as a fresh contradiction instead of "this is a recurring pattern."
I see this all the time on the testing side — the same bug keeps showing up across different iterations because nobody put a marker on the root cause. If Scarab could maintain "contradiction resolution history" as a layer of the baseline too — that would close the diagnostic loop in a way I haven't seen anyone do yet. Have you thought about going that direction?

ahhh... I see your point... and yes! that's why when I developed the diagnostics I designed them to update new governance boundaries once accepted and proven through the repo.

it was one of the main issues I was running into that made me create the suite in the first place... I had a baseline but as I was creating and developing that baseline changed... so I needed a way for the repo to also be able to track the repo's own evolving truth.

So you already built that loop in — same direction I was thinking. That's actually interesting.
Most diagnostic tools stop at "here's the problem." You went further and made the baseline update itself once a resolution is validated. That's a real feedback loop.
From the testing side, here's what keeps bugging me:
When the baseline auto-updates — who validates that the update was right? Say runtime drifted, the team decided "runtime is truth," so Scarab updated the baseline. But what if the runtime itself was broken? Now the baseline just drifted from correct to incorrect — and the system thinks the contradiction is resolved.
I see this pattern all the time in production: something breaks, someone hotfixes a workaround, and the workaround becomes the new standard. Nobody goes back to fix the root cause because the system says "this is resolved."
Does Scarab have a way to tell "baseline was intentionally updated" apart from "baseline passively drifted"? Or is that distinction something the team still has to make?

Exactly — and that distinction is critical.

Scarab does not auto-update baseline truth.

A baseline change has to be explicit. The team/user has to declare new governance or a new accepted operating baseline. Even then, Scarab should not treat that as “resolved” just because someone updated the baseline document.

The new baseline still has to be proven against the repo’s mechanics: tests, runtime behavior, framework expectations, and the relevant evidence surfaces. Until that proof exists, the system should still treat the state as unresolved or unverified.

So the distinction is:

Passive drift: runtime changed, tests adapted, docs went stale, and nobody formally reconciled the contradiction.

Intentional baseline update: the team explicitly accepts a new truth, proves the repo aligns with it, and only then creates a new checkpoint.

That is exactly why Scarab should not silently decide that runtime is truth. Runtime can be broken too. Scarab’s role is to keep the contradiction visible until the new baseline is deliberately accepted and verified.

This framework is clean. The distinction between passive drift and intentional baseline update is the kind of distinction you only arrive at after living through both.

But here's what I've seen play out as a QA lead: when the "keep the contradiction visible" model works well, it works great. When it doesn't — it's because the team reaches a point where everything is a contradiction. The baseline is outdated, the tests are flaky, the runtime has drifted, and nobody can agree on what the "real truth" even is anymore. The visibility becomes noise.

The hard part isn't surfacing contradictions — it's prioritizing which one to resolve first when there are hundreds of them. Does Scarab have any mechanism for that, or is it expected that the team triages manually?

That’s the exact failure mode I’d want to avoid.

A diagnostic system that just says “here are 400 contradictions” is not actually helping. At that point it becomes another noisy dashboard.

The useful version has to separate contradiction discovery from contradiction prioritization.

So the model I’m working toward is not “surface everything equally.” It is more like:

Which contradiction blocks the repo from proving anything else?

Which one sits at a source-of-truth boundary?

Which one has the widest downstream blast radius?

Which one makes tests unreliable as evidence?

Which one turns runtime behavior into an untrusted baseline?

Which one is just local cleanup versus a real authority conflict?

So yes, the team still owns the final decision, but Scarab should make the triage much less manual by grouping findings into repair lanes and showing which contradictions are upstream of the others.

In a badly drifted repo, the first goal is not to fix everything.

The first goal is to identify the smallest set of contradictions that prevents the rest of the system from being trusted.

Once those are resolved, the noise floor drops and the rest of the repo becomes easier to reason about.

This triage framing is the right shape — especially "identify the smallest set of contradictions that prevents the rest of the system from being trusted." That line separates a diagnostic tool from another noise generator.
One thing I keep circling back to: "which one sits at a source-of-truth boundary?" — that assumes you can identify where truth lives. In a repo that's been drifting for months, the source-of-truth map might be part of the drift. How does Scarab build that boundary map? From git history? From tests that still pass? From declared docs, or from inferred runtime behavior?
Because if the triage model leans on a boundary map that's itself unverified — you're just one layer up, same problem.
Bro, talking to you is legit draining my brain cells.🤣 I drank two extra cups of coffee tonight just to keep up. You owe me ☕