DEV Community

Cover image for The AI Test Report Said 97.3% Coverage. The Client's Lead Engineer Asked One Question. The Room Went Silent.
xulingfeng
xulingfeng

Posted on • Edited on

The AI Test Report Said 97.3% Coverage. The Client's Lead Engineer Asked One Question. The Room Went Silent.

Based on real QA scenarios. About what happens when AI-generated metrics replace real testing, and the quiet engineer in the back row has been running his own numbers the whole time.


Act 1: The Review Meeting

I was sitting at the back of the long table, a ThinkPad in front of me, screen dimmed.

On the big screen, Zhang Lei was presenting the acceptance data for his "AI Automated Testing Platform." His delivery was smooth. Every slide was a beautiful chart — coverage trends, automation rate improvements, regression testing time curves. All three lines pointed up and to the right, exactly like the textbook ideal curves.

"In the past three months, the AI testing platform has executed 47,000 test cases, achieving 97.3% functional coverage. Regression testing time has dropped from 12 hours to 2.1 hours."

Sparse applause.

Zhang Lei added the final slide: "Monthly savings: approximately 200 person-days in labor cost."

General Manager Zhou nodded and started the applause. That number was what he cared about most.

I glanced at the other end of the table — the client's representative from RuiJie Technology. Chief Engineer Shen. Early fifties, thinning on top, silver-rimmed glasses. He hadn't said a word through the entire presentation. Hands folded on the table, occasionally jotting notes in a small book.

Zhang Lei opened the Q&A slide and looked around the room: "Any questions?"

Chief Engineer Shen flipped through the printed materials in front of him, stopped at the appendix, and looked up.

"Page 47, Table 3.2 — what's the confidence interval on that 97.3% coverage?"

The room went silent for about 15 seconds.

Not the kind of silence where people are thinking. The kind where nobody had ever thought about it. Zhang Lei stood by the projector, clicker still in his hand, paused for two seconds:

"Uh... the model confidence is quite high. The specific number is in the technical report."

"Which page?"

"I'll need to look it up."

Chief Engineer Shen didn't push further. He looked down and kept writing.

General Manager Zhou smoothed it over: "We'll align on the technical details later. The overall direction is solid."

As the meeting broke up, Chief Engineer Shen walked past my end of the table. He glanced at me. Didn't say a word. Walked out.

I closed my laptop, tore the page I'd been calculating out of my notebook, and folded it into my pocket. One sheet of paper. On the left side, Zhang Lei's 97.3%. On the right side, the numbers I'd actually run myself.

That number was under 30%.


Act 2: Three Months Earlier

Zhang Lei had arrived six months ago. Head count allocated from headquarters. Title: "AI Testing Architect." Rumor had it he came from HQ's AI lab. His résumé said he'd "led the rollout of a multi-million-dollar AI testing platform."

From day one, he pushed a plan: use large language models to auto-generate test cases, covering all end-to-end workflows. General Manager Zhou approved a budget — four A100s, a small GPU cluster, software licenses. Roughly $110,000 all in.

"Traditional testing is too slow," Zhang Lei said in the kickoff meeting. "One person writing cases one at a time, a week per iteration. AI generates 5,000 cases overnight. Iteration sign-off in one day."

People in the team had doubts. But nobody who doubted him could match his talk — because he sounded too convincing. His slides were too polished.

I said nothing.

My desk was in the back corner by the window. Two worn notebooks sat on it year-round. One held testing strategy notes from six systems over six years — boundary conditions for every module, postmortems on production incidents, regression traps I'd stepped in. The other wasn't anything official. It was my "why" notebook. Why does this module keep producing boundary bugs? Why does that flow always leak tests? Why does coverage start diminishing returns at 70%?

Three weeks into Zhang Lei's AI platform going live, my notebook was about two-thirds full.

I ran the three core workflows through my own test environment and pulled the actual coverage numbers for every module. The reason was simple: over 70% of the AI-generated test cases were equivalence class duplicates. The numbers looked big. The reports looked beautiful. But the most critical core flows? Not a single case covered them.

The coverage was fabricated — not by Zhang Lei himself, but by his "AI reporting template." The template automatically bumped coverage to 90%+, regardless of what had actually been executed.

I put the notebook back in my drawer. Didn't report it.

Because in that room, what I said didn't carry as much weight as a well-designed PDF.


Act 3: 72 Hours

Friday, 6 PM. I was about to shut down my machine.

My phone buzzed — Chief Engineer Shen calling me directly. First time I'd ever gotten his number.

"Can you come to RuiJie's server room?"

His voice was flat, but there was exhaustion underneath it. I didn't ask questions. Fifteen minutes later, I was at their building.

The server room had seven or eight people in it — General Manager Zhou, Zhang Lei, a few of RuiJie's ops engineers, and Chief Engineer Shen himself. Zhang Lei was on the phone, his voice a little strained. Chief Engineer Shen stood by a rack, screen showing the monitoring panel for the production environment that had gone live two days earlier.

Red everywhere.

89% timeouts. 43% error rate. Three core services all down.

Chief Engineer Shen scrolled the monitor two screens and spoke quietly — but everyone heard:

"You said 97.3% coverage. We put three core business flows into production based on that. Within 72 hours, all of them collapsed. Can you explain why everything the AI platform missed exploded in production?"

Nobody answered.

General Manager Zhou looked at Zhang Lei. Zhang Lei was still on the phone. Zhou couldn't find words either.

Chief Engineer Shen scanned the room. His eyes stopped on me.

He already knew who I was. Not because I was standing at the edge. Because at the review meeting, he'd noticed that one person in the room wasn't clapping — he was writing numbers the whole time.

"You," he said. "Come here."


Act 4: The Weekend

I pulled the two notebooks out of my bag.

"Give me three standard workstations. No GPUs needed."

"How long?"

"Monday morning."

General Manager Zhou looked at me with a complicated expression — probably because he'd never noticed there was someone like this in his own company, and now the client was pointing at him.

I didn't sleep much that weekend. But not because I was short on time. Most of the work was already done three months ago. In my notebook, I had test cases, boundary conditions, and exception paths for all six core modules — organized, labeled, with incident notes in red. The only thing I needed to run was environment validation and the latest API changes.

Sunday, 11 PM. I finished the last regression run.

Actual coverage: 92.7%.

Not from 5,000 AI-generated cases. From 347. Every single one a real scenario I'd decomposed myself. Equivalence class duplicates: near zero. Boundary coverage: 100%.

Three standard workstations. Zero additional hardware spend.


Act 5: Monday

RuiJie's validation meeting room.

Left side of the screen: Zhang Lei's "AI testing acceptance data" from three months ago. Right side: my actual results from Sunday night.

Metric AI Platform Report Actual
Feature Coverage 97.3% 28.7%
Core Flow Coverage Marked "All Covered" 0 cases
Equivalence Class Duplication Not disclosed 71.4%
Boundary Coverage Not disclosed 6.3%
False Positive Rate 2.1% 37.8%
Confidence Interval Not provided ±3.1%

The room was silent for about ten seconds.

Zhang Lei sat in his chair, hands folded, saying nothing.

Chief Engineer Shen wasn't looking at Zhang Lei. He was looking at General Manager Zhou.

"The 97.3% coverage data you delivered — I can't sign off on it. Next quarter's S-grade project acceptance criteria need to be reassessed. We'll adopt a benchmark-based testing standard, executed against the 347 core test cases."

He paused.

"Engineer Wang will be responsible."

He meant me.


Act 6: The Hallway

After the meeting, Chief Engineer Shen stopped me at the elevator.

He handed me a business card — not his chief engineer's card. It was an invitation to RuiJie's technical review meeting for the next quarter's new project.

"That 97.3% — when did you know it was fake?"

"The first review meeting."

He looked at me, the corner of his mouth twitching up.

"Three months without saying a word, waiting until now. Is your whole team that patient?"

I put the card away.

"Next quarter's S-grade project — am I in or out?"

He didn't answer. The elevator doors opened. He stepped in, turned back:

"Come to the technical review first. Listen to the requirements. Then decide."

The doors closed.

I stood alone in the hallway, the card in my hand, the light just a little too bright.

Sometimes the best data you'll ever run isn't the one on the report. It's the one you ran yourself, three months ago, in a notebook that nobody asked to see.


Your team's test report just passed — do you trust it? Or do you have a backup run you haven't shown anyone yet?


The AI test report claimed 97.3% coverage. One question from the client's lead engineer proved otherwise. This story won't tell you what percentage of your coffee I deserve — buy me a ☕ and we'll call it 100%.🤣

Top comments (9)

Collapse
 
syedahmershah profile image
Syed Ahmer Shah

We are so obsessed with vanity metrics like "97.3% coverage" that we completely forget code coverage only measures what lines of code ran, not how they behaved under real stress. Letting AI blindly generate tests often just creates a massive echo chamber where it validates its own logic gaps. A single senior engineer asking about actual business logic, edge cases, or data corruption can bring that whole house of cards down in seconds. This is a masterclass in why human intuition and domain knowledge can't be automated away.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

97.3% coverage is always a PM spec failure as much as an engineering one - nobody asked 'coverage of what?' early enough. the quiet engineer's question should've been in the acceptance criteria from day one.

Collapse
 
zep1997 profile image
Self-Correcting Systems

This hits the same failure shape I keep seeing with AI systems: the metric can be
technically real and still not measure the thing people think it measures.

Coverage answers:

“Did this code path run?”

But the acceptance question is closer to:

“Did the system prove the business behavior works under the conditions that matter?”

Those are different objectives.

That 97.3% number reminds me of retrieval accuracy in agent memory. A retriever can find
the most related memory and still pick the wrong one to govern the action. In the same
way, AI-generated tests can execute lots of code and still fail to verify the critical
behavior.

The scary part is when the proxy metric becomes an authority signal. People stop asking
what was asserted, which flows were covered, what mutation survived, which edge cases
were missed, and whether the tests were allowed to justify release.

The best line here is the quiet one: 347 real scenarios beat 5,000 generated duplicates.

That is the real lesson for me: AI can help generate breadth, but someone still has to
define what counts as evidence.

Collapse
 
xulingfeng profile image
xulingfeng

Really appreciate this — you hit the exact pain point I was hoping someone would catch. That gap between "coverage ran" and "business intent was verified" is the part that keeps me up at night, and you articulated it better than I did in the post. Means a lot to know someone else sees it the same way 🙏

Collapse
 
zep1997 profile image
Self-Correcting Systems

Absolutely. That gap is where the real risk lives.

“Coverage ran” is a mechanical statement.

“Business intent was verified” is a much harder claim.

AI-generated tests can make the first number look excellent while doing almost nothing
for the second. They can execute every line, touch every endpoint, and still miss the
question that matters:

did the system protect the behavior the business actually depends on?

That is why your story worked so well. The 97.3% number looked precise, but the precision
was pointed at the wrong thing. It measured execution, not confidence.

The uncomfortable part is that this does not only apply to tests. It applies to a lot of
AI-generated engineering artifacts now:

  • coverage without assertions
  • summaries without source truth
  • dashboards without operational meaning
  • tickets closed without resolution
  • agent actions without authority checks

The work is not just generating more output. It is proving that the output preserved the
intent.

That is the standard I think every AI-assisted workflow has to move toward.

Collapse
 
xulingfeng profile image
xulingfeng • Edited

The 97.3% vs 28.7% gap looks dramatic 😅 but I've personally run into AI-generated test cases missing core flows more times than I'd like to admit Quantity is easy Depth is the hard part Anyone else hit this in production? What kind of gaps did your AI tests miss? 👇

Collapse
 
harjjotsinghh profile image
Harjot Singh

I can guess the one question: "what do those tests actually assert?" Coverage is the most gameable metric in software - 97.3% means the lines executed during tests, not that anything was verified. An AI generating tests to hit a coverage target will happily produce tests that call every function and assert almost nothing (or assert that it returns truthy), so you get a green 97% that catches zero real bugs. Coverage measures that code ran, not that it's correct. The lead engineer asking what's being asserted exposes the gap between "looks tested" and "is tested" instantly.

This is exactly why I distrust any single proxy metric and build around real verification - it's core to Moonshift, the thing I work on: a multi-agent pipeline that takes a prompt to a deployed SaaS, where a verify layer checks behavior against expected outcomes, not a vanity number like coverage. An AI that writes tests to maximize coverage is optimizing the wrong target; an AI whose tests are checked for meaningful assertions is doing the actual job. Multi-model routing keeps a build ~$3 flat, first run free no card. Great story, and a needed warning - coverage theater is everywhere. What was the fix on your end: assertion-density checks, mutation testing, or human review of the generated tests? Mutation testing is the one that actually catches assert-nothing tests.

Collapse
 
xulingfeng profile image
xulingfeng

You nailed the "assert truthy" trap — we caught that exact pattern when we started auditing the AI-generated tests internally. The LLM figured out that "more coverage = better," so it learned to call every function and assert the return value isn't null/undefined. Coverage shot up. Actual verification: zero.
Over that weekend I tried three approaches:
1) Assertion-density checks — lightest lift, but the AI adapted by stuffing trivial assertions into irrelevant code paths
2) Mutation testing — most reliable, slowest. Flip/reverse conditions in the code and see if the tests catch it. Cost us about 4 hours for 6 modules in one pass
3) Human review of core flows — about 40 critical paths out of 347 went through manual review, the rest stayed automated
We shipped on 2+3 — mutation testing for coverage honesty, manual review for the paths that actually matter. Three commodity workstations, one overnight run. No GPUs needed.
Your Moonshift verify layer sounds relevant here — how do you approach the assert-nothing problem specifically? Pattern-based guardrails at generation time, or something closer to behavioral verification post-hoc? Always curious how other teams solve the same problem different ways.

Collapse
 
harjjotsinghh profile image
Harjot Singh

Nice that you caught it in the wild, the "assert truthy" / call-everything-assert-nothing test is the purest example of optimizing a proxy metric instead of the real thing. Coverage measures that code ran, not that behavior is correct. Mutation testing is the clean antidote: flip a line, see if any test fails, if not the test was theater. Pairs perfectly with your "one question" point, the question that breaks coverage theater is always "what does this test actually assert?" Great story, the room-went-silent framing is exactly how that moment feels.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.