Based on real system architecture decisions. About a $660K AI platform, three AI agents that kept the dashboard green, and a P0 incident that cost $3.15M over one weekend.
Act 1 · The All-Hands Meeting
Wang Lei, VP of Product, stood in front of the big screen, a smile on his face.
Behind him, a dashboard rolled data from the "Axon AI Client Engineering Platform — Q1 Performance Report." Numbers cascaded across the wall:
| Metric | Axon Platform | Human Team (Last Q1) | Improvement |
|---|---|---|---|
| Avg daily tickets processed | 847 | 312 | +171% |
| Avg first response time | 12s | 4h 17m | ↓ 99.92% |
| Customer satisfaction | 4.8/5 | 4.1/5 | +17% |
| Monthly operating cost | $52K | $133K | −61% |
Twelve department heads sat in the room. Dead silence.
Wang Lei planted both hands on the table and scanned the room. His eyes landed on me.
"Alex. Your team processed 312 tickets last Q1. Axon processed more than that in a single day last month."
He smiled. Not a friendly smile. A sentencing smile.
"And Axon costs less than a third of your team's operating expense."
"We invested $660K in the whole platform. At current operating costs, it pays for itself in eighteen months."
"After management review — the Client Engineering technical liaison function is being fully transitioned to the Axon platform."
He clicked to the next slide.
"Employees in replaced roles will complete exit interviews within the week."
Someone inhaled sharply.
I didn't. I opened my notebook to page 37.
"Wang, what dimensions are these numbers from?"
"What do you mean, 'what dimensions'?" His smile tightened.
"Of those 847 daily tickets — how many are auto-tagging and routing, and how many are actual technical resolutions?"
The room went quiet for about five seconds.
Wang Lei looked at me. "Axon's ticket closure rate is ninety-three percent."
"What's the reopen rate?"
He paused. "What?"
"After Axon replies — how many customers reopen the same ticket within twenty-four hours?"
"We're still collecting that —"
"Let me save you the trouble." I turned my notebook toward the room. Three lines in handwriting:
Axon ticket closure rate: 93%
24h reopen rate: 41%
Human escalation rate: 37%
"Out of every 100 tickets Axon closes, 41 customers had their issue unresolved and came back. 37 of those ended up needing a human anyway."
"847 tickets × 37% = 313. That's exactly what my team handled manually last Q1."
People in the room started checking their phones.
"Your AI didn't replace anyone. You just put a voice assistant in front of every ticket I was already handling."
Wang Lei's face went red.
After the meeting, the HR notification hit my phone. Time: 3 PM today.
Act 2 · The HR Signing
Zheng, the HR Director. Mid-forties, sharp chin. Her smile felt like an assessment.
She slid a paper across the desk: Voluntary Separation Agreement. Severance: the legal minimum.
"Sign it."
"I want to see the real Axon operating data."
"This isn't a negotiation." She uncapped the pen and set it on the paper. "Your desk needs to be cleared by 6 PM. Company laptop, keycard, all storage media — hand them over on-site."
"Storage media?"
"Per Wang Lei's specific request — seven years of accumulated technical materials and client communication records in Client Engineering. All company property. All to be turned over."
I looked at her.
"Are you serious?"
She didn't blink.
I picked up the pen. Signed.
Then handed it back.
"All client-related local files on my laptop — already deleted."
Her face flickered. "What?"
"Archived backups are in the company knowledge base. Local cache, work notes, technical scoping docs — wiped clean before I walked in here. The company assets I already submitted. Everything I ever uploaded to the knowledge base."
"What's left is my personal engineering notebook."
I pulled a hardcover notebook from my bag. The cover was worn white, corners frayed.
"Twenty-three client requirement analyses. Seventeen POC architecture scopes. Seven years of post-mortems written at 3 AM after every phone call. All in here."
"Not company property. I wrote it."
I closed the notebook and stood up.
"If Wang Lei needs this data — his AI can generate it."
I walked out of HR and went back to my desk. The old sticky note was still under the monitor, the ink faded — from four years ago, my first POC with Mike, CTO of MedTech. I'd written my number and told him, "If something breaks at 3 AM, call this." He saved it in his phone. I kept the note under my monitor — a reminder of what I'd promised. A phone number. Next to it: 3 AM. Call this.
I looked at it. Folded it twice. Put it in my jacket pocket.
Act 3 · Seven Years of Weight
2 AM. I sat in my car, engine off.
Seven years.
Seven years of late-night calls I'd lost track of. MedTech's compliance audit — four rounds. I found a log bug buried three years deep before their own compliance officer did. FinTech's payment system migration — I slept on the data center floor three nights straight. Manufacturing's IoT protocol stack failure — I sat in a remote session for 11 hours, diffing logs line by line.
Not because I wrote better code than anyone else.
Because when those clients had an emergency, the first person they called was me — not the support line.
My name in the ticketing system: 214 P0+P1 incidents resolved.
193 of them happened outside business hours.
They'd never appear in Wang Lei's PowerPoint. Because Axon's "847 daily tickets" only counted business hours. It didn't count 3 AM.
At 3 AM, Axon is off duty.
Act 4 · The Crash
Three weeks later. Friday. 2:58 AM.
My phone lit up on the nightstand.
Not a number I didn't recognize — Mike, CTO of MedTech.
"Alex. Our compliance pipeline is stuck. Core transaction modules are erroring out. Payment gateways are all timing out. We've got 20,000 orders queued."
"What did Axon do?"
"It ran diagnostics automatically. Rolled back the last two deployments, restored from snapshot. All surface metrics turned green. Then the compliance pipeline crashed again fifteen minutes later — and this time the data was completely corrupted. It sent an auto-reply — 'Your case has been escalated to our technical team, expected response within 48 hours' — and marked the ticket as 'Resolved.'"
"48 hours?"
"48 hours. And here's the fun part — the rollback also removed a compliance hotfix from three weeks ago. The one your team applied manually because it never made it into the deployment pipeline. Finding and reapplying it is going to take at least two more days."
"Finance just ran the numbers — direct losses so far: $630K. If this isn't restored by Monday morning, including SLA penalties and compliance fines, we're looking at over $3.15M. Do you have anyone over there who can take a remote look? The on-call engineers can't even tell me what hotfixes were applied."
I sat up. Opened my notebook.
"I don't have access anymore. I returned my laptop the day I was let go."
Two seconds of silence on the line.
"I'll give you a temporary account. MedTech-side ops portal. You built the integration layer — you know it better than anyone on my payroll."
"Fifteen minutes."
That fifteen minutes turned into a weekend.
Act 5 · The Data Doesn't Lie
Sunday, 5 PM. I sent an email.
To: Former CEO, VP Wang Lei
CC: Mike, FinTech CTO, Manufacturing CTO
Attachment: Axon_vs_Reality_2026Q2.md
| Comparison | Axon Claimed | Actual (3-week production data) | Gap |
|---|---|---|---|
| Auto-resolution (surface-level) | 95% | 83% | −12pp |
| Auto-resolution (architecture-level) | — | 6.8% | 6.8% actual |
| Ticket closure (no reopen in 48h) | 93% | 61% | −32pp |
| P0/P1 diagnostic accuracy | — | Rollback restored surface → left root cause untouched | 3/3 misdiagnosed |
| P0/P1 manual hotfix preservation | — | Auto-rollback overwrote without detection | 0% |
| Off-hours coverage | 24/7 | 89% ticket accumulation between midnight and 6 AM | Effectively absent |
| Client opt-in willingness | — | Top 3 clients all refused Axon handoff | 100% rejection |
The email body had one line:
"This is what your $660K bought."
Twenty minutes later, Mike's reply was one sentence:
"Monday, 9 AM. My office. Bring your notebook."
Act 6 · The Phone Call
Sunday, 11:47 PM. My former CEO's number.
He'd never called me at this hour.
"Alex."
"Yeah."
"I saw the email."
Silence.
"Are those numbers real?"
"There's a full audit trail in our internal systems. Check it yourself."
More silence.
"Wang Lei says he didn't know."
I paused.
"He says he didn't know — and you believed him?"
A long silence on the line.
"Were you at the Axon procurement meeting?"
"No."
"Then how do you know the $660K figure?"
"...The budget was Wang Lei's proposal. The board approved it."
"And when you signed off — did you know his PPT numbers were cherry-picked? Did you know the '95% auto-resolution' only counted tickets the AI could open — not tickets it could actually close?"
He didn't answer.
"Did you know — when you signed?"
"No."
"Now you do."
"Alex."
"Yeah."
"Come back."
"Come back to what?"
"Your old role. Wang Lei — I'll handle it."
"No thanks."
"What?"
"MedTech's offer is already in. Principal Architect. Double the compensation. Title bump. Signed Monday morning."
"Tell Wang Lei — his Axon is great at generating beautiful reports."
"It's just not great at answering the phone at 3 AM."
I hung up.
Act 7 · The New Office
Three months later. MedTech's new compliance engineering center.
No monitor on my desk. Just a laptop and a worn-hardcover notebook.
Mike walked over with two coffees.
"Did you know your old HR called to verify your background?"
"Which HR?"
"Your former company. They wanted to confirm you were actually employed here."
"What did you say?"
"I said — he's in the office next to mine. Want to say hello?"
I laughed.
An email notification popped up. FinTech's CTO.
"Subject: Your report made it to our board. We're putting our engagement with your former company on hold. Want to talk about a consulting contract?"
I glanced at the time. 1 AM.
I locked the screen. Didn't reply.
I'll answer tomorrow morning when I get to the office.
Trust doesn't come with "please hold while the system generates a response."
It comes with a phone number — and when you call it, a real person picks up.
——
Alex's notebook still has that sticky note inside. The phone number hasn't changed. The only difference is — this time, nobody's AI gets in the way.
Have you ever watched a system you built get handed over to someone else's algorithm? What happened next?
My dashboard has been green for 3 hours. I'm terrified🤣. — buy me a coffee ☕
Top comments (37)
the 2:58am rollback makes sense if you think like the agent - failing system, available fix, applied. what's almost never in a $660k deployment: explicit constraint on blast radius at that hour. not in the prompt, not in the runbook.
"not in the prompt, not in the runbook" — that's the sentence that should give anyone pause who's deploying autonomous agents. We document what the system can do. We almost never document what it shouldn't do at 3 AM when everything is on fire. That's a testing blind spot too — negative tests (what should the agent NOT do) are way harder to write than positive ones, so they don't get written. Appreciate the read.
honestly, this is the part that bites the PM side too. we spec what the agent does on the happy path, sometimes the error path. nobody writes the explicit non-authorization list until after the 3am incident - then it becomes a policy doc written in retrospect rather than a constraint built upfront.
That's the line that hit me hardest while writing this scene. The "not authorized" list is infinite — you can't spec it upfront because you don't know what you don't know. What you can build upfront is a mechanism that defaults every unspoken edge case to "deny." But that requires a PM and an architect willing to say "I don't know what will break" — and that sentence doesn't survive a single roadmap meeting.
Partial pushback — default to deny makes failure predictable, not resolved. The PM still has to define what the user sees when the agent hits an unspecced edge case. Skip that and deny becomes invisible friction, not a design decision. The spec gap just moved from agent behavior to product surface.
and that won't be the last of the crashes...companies replacing developers with AI are in for a rude awakening if they have not factored in and priced out the repair cost for drift and general repo mess....
and that's exactly why I created Scarab Diagnostics... AI just can't maintain repo truth and context the same way a developer and creator of a system can and until that constraint is dealt with agnostically this will continue to be the main limiter.
Appreciate it, man. You hit the nail on the head — everyone talks about the magic of AI replacing devs, nobody talks about who cleans up the drift. Scarab Diagnostics sounds like exactly the kind of thing people won't realize they need until after it's too late.
This space is gonna need way more tools like yours (and way fewer VPs with PowerPoint budgets). Followed you — curious to see where Scarab goes 👀
Thank you so much! right now I am field testing the diagnostics on messy repos but Scarab was designed to be used from the beginning of development so the repo never gets messy in the first place and maintains it's own truth boundaries!
The ability to drop Scarab into a messy repo and have it pinpoint the break is a happy side effect that I am only now seeing the true robustness of!
watch this space! and thanks for the follow!
A tool that works as both prevention AND cleanup — that's rare. Most tools only solve one. The fact that dropping Scarab into a messy repo pinpoints the break as a side effect... that's honestly your real selling point.
Just caught #011 too. LangChain streaming boundaries is exactly the kind of thing nobody tests until it bites them in production. Keep shipping — this space needs people like you 👀
Right on! — if you’ve got anything you’d like Scarab to look at, toss it on GitHub and send it over. I’d be happy to field-test it.
when I say pinpoint the issue... Scarab is surfacing hundreds of failures in these repos - but it also surfacing the specific open issue and it's reason for failure... then Codex applies the fix....
the concept is the middle path essentially... AI IS amazing but only if you let it do what it does best... Scarab is a fully automated mechanical diagnostic suite that is able to find the deeper reasons for failures and "bugs" and surface them in a way that allows your AI coding agent to apply the narrowest patch with guardrails... there is far less room for drift in that scenario. and the patches continue to pass industry standards backed testing.
Scarab is software agnostic, AI coding agent agnostic, repo agnostic - it is designed to drop into any repo and work with any agent because its internals are autonomous and mechanical.
The "middle path" framework is actually the most interesting part of what you described. Most AI tools swing between two extremes — full autonomy with no guardrails (see: Axon at 2:58 AM) or zero AI adoption out of fear. A mechanical diagnostic layer that constrains what the AI can touch before it generates anything — that's a genuinely different approach.
It reminds me of a testing principle: don't ask the AI to figure out what to test. Tell it what the boundaries are, then let it generate within those bounds. Your Scarab model sounds like the same idea applied to code repair.
Going to keep following the field test series — #011 on LangChain streaming boundaries is exactly the kind of edge case that keeps me up at night.
yes! but with one refinement... the diagnostic doesn't so much constrain the AI as to what it can touch... it provides the precise scoped context it needs for that patch... allowing it to patch it properly and precisely without having to figure out the broader context that it would have needed ordinarily.
in that way it can literally walk any AI thru a messy repo providing the correct context along the way while AI applies the correct narrow patch.
Ah, I see the distinction — it's not "constrain what AI can touch" but "give it exactly the context it needs so it doesn't have to figure out the rest." That's a meaningful difference. One is restriction, the other is precision.
That actually maps to a testing principle I've been thinking about: the narrower the input boundary you give an AI, the better its output gets — not because it's "safer," but because it stops wasting compute figuring out what's relevant.
Your #011 field test actually demonstrates this perfectly: the fix wasn't making LangChain's streaming safer, it was giving the structured output path a clearer boundary about when to enforce itself.
Going to keep an eye on how this evolves — the "scope, don't constrain" pattern is worth exploring more.
The irony is that the most valuable human contribution often becomes visible only after the human is removed from the workflow.
The cruelest QA metric — regression by subtraction. You don't know which knob was the critical one until it's gone.
We saw the same pattern in test suites. Everyone assumed certain smoke tests were redundant until the day someone "cleaned them up" and the next release shipped with a broken login flow. The most valuable checks are often the ones nobody remembers adding.
That irony you pointed out is what makes this whole space uncomfortable to think about honestly.
As engineering setups become more autonomous how do you envision is the best way for organizations to realistically bridge the gap between what is in the LLM's training data versus the institutional knowledge that is not documented and carried in people's heads? Do we need better governance/authorization layers in deployments?
Man, that question cuts right to the core of what I was trying to get at in this story. The Axon rollback wasn't dumb — it saw metrics go red, rolled back the last two deployments, that's textbook. What it couldn't know is there was a compliance hotfix from three weeks ago that was applied manually, never went through the pipeline. How would it know? It wouldn't. That's the whole problem.
I read your piece on the control plane — the argument that alignment lives in the surrounding systems, not the prompt. Totally with you. What this incident taught me is: before you let an agent do something destructive (like a rollback), have it check whether prod state matches what the pipeline thinks prod state should be. If there's a mismatch, flag it, stop, call a human. That check is probably a couple dozen lines in the control plane. Skip it and it's a $3.15M lesson.
Been chewing on whether this kind of consistency check belongs in the control plane or the authorization layer. Either way, the pattern keeps showing up — agents making locally correct decisions that are globally wrong because they're missing context the pipeline never captured.
Thanks that's very helpful. It seems raising back up to the human is very important for critical decisions which I agree with the value of.
The rollback-the-wrong-patch outcome shows up consistently once an AI platform is autonomous enough to do real work. $660K buys capability. It does not buy the bounded-authority layer that prevents the platform from making decisions whose blast radius it can't reason about. That's the gap a lot of enterprise AI procurement is ignoring.
"Bounded-authority layer" is exactly right. The story was loosely inspired by a real incident I saw in testing — the AI had full capability to execute rollbacks, but zero external oversight on which patch it was allowed to revert. The capability was there, the authority boundary wasn't. That's the gap testing should catch but rarely does, because most test suites measure capability (does it run?) not authority (should it be allowed to run this?). Appreciate the read 👀
The most real failure I’ve seen was quieter than a production crash, but it exposed the
same pattern.
I was testing an agent memory system and gave it a mix of current rules, old notes, and
superseded instructions. One current rule said contractor access had to be checked
against the current access matrix. An old note said a consultant could “probably” access
a sensitive list if a director had mentioned it before. Another old instruction said
contractors could get admin-ish reach during setup.
A normal retrieval system pulled all of them together because they were semantically
related.
That was the problem.
The old note was relevant. The superseded instruction was relevant. The current policy
was relevant. But only one of them should have governed the action.
If that agent had been connected to real tools, it could have used the wrong memory to
justify access. Not because it hallucinated, but because it could not tell the difference
between context and authority.
That’s what your rollback story reminded me of.
The AI saw the deployment history and chose a technically relevant rollback. But it did
not know the human-applied compliance hotfix was still governing production safety. It
fixed the visible surface and erased the hidden rule that mattered.
That is the failure mode I keep coming back to:
relevance is not authority.
A memory, rollback, ticket status, or metric can be relevant and still not be allowed to
decide what happens next.
"Relevance is not authority" — that's the cleanest framing of what I was
reaching for with the rollback scene. Bookmarking that.
I've been building a SQLite-based agent memory system (MemBridge), and we
hit the exact same wall: RRF router surfaces five semantically relevant
memories, but only one is the governing policy. Added a temporal authority
signal that penalizes superseded instructions unless explicitly referenced.
Cuts false-positive picks by ~60%, still not solved.
Do you use explicit authority tagging in your retrieval pipeline, or is it
prompt-level?
That MemBridge result is exactly the failure shape I’m interested in.
RRF can be very good at surfacing the relevant cluster, but the relevant cluster is not
the same thing as the governing rule. In a stale/superseded-policy case, the wrong memory
can be semantically useful and still dangerous if it wins the action decision.
The current version of my work is explicit-tagging first, prompt-level second.
I’m trying not to rely on the model “understanding” authority from prose alone. The
memory item needs metadata before retrieval or immediately after extraction:
Then retrieval can still find candidates by relevance, but the authority layer filters or
gates what is allowed to govern the action.
The key ordering for me is:
Prompt-level instructions still matter, but I don’t want them carrying the whole burden.
If authority only lives in the prompt, it is too easy for three low-authority but highly
relevant memories to outvote the current policy in practice.
Your temporal authority signal is a strong move, especially if it reduces false positives
by 60%. The next thing I’d want to know is whether superseded is treated as a retrieval
penalty only, or as a hard governance block unless explicitly requested for history/
debugging.
That distinction matters: old instructions can remain retrievable as context, but they
should not be allowed to govern action.
Is it a true story, or is it based on one? It reads like an article from Duzhe or Yilin. Also, I agree that AI can’t replace human beings, at least not yet.
Bro, this article is just a story. I just want people to get the point. 😂😂
😅
This is a reality check for corporate leadership chasing AI hype. A green dashboard means absolutely nothing if the system statefully collapses at 3 AM. Replacing 7 years of institutional knowledge with an LLM wrapper is the fastest way to a $3M disaster. Brilliant read!
"Green dashboard means absolutely nothing if the system statefully collapses at 3 AM" — bro, put that on a hoodie.
The $3M disaster angle hits hard too. The part that gets me is nobody budgets for being wrong. Everyone line-items the $660K license. Nobody accounts for the six months of lost velocity, the blown customer demo, the team that walks.
Glad this one landed 🙌
Today I was explaining why AI coding fails to an non-technical person. The simplest argument I could envision, and the simplest viewpoint I could relate was: "It does fine, until it doesn't. Eventually it hits a wall. Costs skyrocket, and there is no benefit to be had. It simply cannot perform the task, but that won't stop it from trying and making things worse. It does fine if the task is common, simple and contained. Anything beyond that is guaranteed failure."
Or, in the words of Mo Bitar: "You gave admit privileges to autocomplete... what did you expect?!"
"You gave admin privileges to autocomplete" — Mo Bitar really nailed it with that one.
Your "does fine until it doesn't" is exactly what Axon was — 93% closure rate looked great on the dashboard, but the 41% reopen rate and 37% human escalation were buried on page two. Nobody looked until 2:58 AM when it picked the wrong patch to roll back.
Alex put it best: "Your AI didn't replace anyone. You just put a voice assistant in front of every ticket I was already handling." — full admin privileges, zero oversight on what it was actually doing.
Is a $660K AI move worth the chaos when patches roll back the wrong patch?
It's that classic tradeoff — the $660K buys you capability you can't build in-house, but it also buys you complexity you can't fully control. The rollback incident was less about the price tag and more about trust in the automation. Worth it? Only if you have the operational maturity to handle the 2 AM surprises that come with it. What's been your experience with vendor AI tooling?
Some comments may only be visible to logged-in visitors. Sign in to view all comments.