Yurukusa

Posted on Feb 18 • Edited on Feb 19

Why Your AI Agent Needs a Quality Gate (Not Just Tests)

#claudecode #ai #devops #testing

Your AI agent can write code, deploy it, and even test it. But who decides if the output is actually good?

I ran into this problem while building Spell Cascade — a Vampire Survivors-like action game built entirely with AI. I'm not an engineer. I use Claude Code (Anthropic's AI coding assistant) and Godot 4.3 to ship real software, and the whole point is that the AI handles development autonomously while I sleep.

The problem? My AI agent would make a change, run the tests, see green checkmarks, commit, and move on. The tests passed. The code compiled. The game launched.

And the game was unplayable.

Zero damage taken in 60 seconds. Level-ups every 3.9 seconds (the "fun" range for Vampire Survivors-style games is 10-30 seconds). A difficulty rating the automated evaluator scored as "TOO_EASY."

All tests passing. All quality gone.

That's when I realized: tests verify correctness. Quality Gates verify value.

The Gap Between "Working" and "Good"

Here's a concrete example of the difference:

Check	What It Asks	Type
Unit test	"Does the fire spell deal the right damage?"	Correctness
Integration test	"Does the spell hit enemies and trigger XP drops?"	Correctness
Quality Gate	"Is the game actually fun to play for 60 seconds?"	Value

The first two are binary. Pass or fail. The third one is a judgment call — and that's exactly why most CI/CD pipelines don't have one.

When a human developer ships code, there's an implicit quality gate running in their head. They play the game. They feel the pacing. They notice when something is off. When an AI agent ships code at 3 AM while you're asleep, that implicit gate doesn't exist.

You need to make it explicit.

The Setup: An Autonomous Game Testing Pipeline

Before I explain the Quality Gate, here's the pipeline it lives in.

Spell Cascade is a top-down action game where players survive waves of enemies while collecting spells and upgrades. Think Vampire Survivors, but built by someone who can't write code.

The autonomous testing pipeline:

xvfb (X Virtual Framebuffer) runs a headless display — no monitor needed
SpellCascadeAutoTest.gd — a GDScript bot that auto-plays for 60 seconds, navigates menus, picks random upgrades, presses WASD to move, and collects telemetry
results.json — structured output: fires, level-ups, damage taken, HP timeline, enemy density samples, level-up timestamps
quality-gate.sh — reads results.json, compares against thresholds, outputs GO/CONDITIONAL/NO-GO

The bot isn't smart. It mashes buttons and picks random upgrades. That's the point. If a random bot can't have a reasonable experience in 60 seconds, a real player won't either.

The whole thing runs with one command:

quality-gate.sh

And it exits with code 0 (ship it) or code 1 (don't ship it).

The 3-Tier Architecture

I didn't start with 3 tiers. I started with 20 candidate checks, narrowed to 6, then grouped them into 3 tiers. The grouping matters because not all failures are equal.

Tier 1: Stability (Hard Gate)

Question: "Did the game even work?"

This tier is non-negotiable. If any check fails, the verdict is NO-GO immediately. No point evaluating balance if the game didn't boot.

Check	Threshold	Why
Game pass	`pass == true`	AutoTest completed without fatal errors
Spells fired	`total_fires >= 1`	Core combat loop is functioning
Level-ups	`level_ups >= 1`	Progression system is working

If total_fires is 0, it means the player couldn't use abilities. That's not a balance issue — that's a broken game. Tier 1 catches this and stops the pipeline cold.

Tier 2: Balance Band (Soft Gate)

Question: "Is the game worth playing?"

This is where it gets interesting. Tier 2 has four sub-checks, and the build needs to pass 3 out of 4 to get a GO. Passing 2 out of 4 gives a CONDITIONAL — the AI can commit but should flag the issue.

One exception: if the Difficulty Ceiling check fails (player died), it's an automatic NO-GO regardless of the other three. A player dying in the first 60 seconds of a Vampire Survivors-like is a hard dealbreaker.

Sub-check 1: Difficulty Floor

"Is the game too easy?"

min_damage_taken: 1

If the player takes zero damage in 60 seconds, the enemies might as well not exist. This was exactly the problem with my early builds — the quality evaluator flagged "TOO_EASY" but nothing stopped the AI from committing.

Sub-check 2: Difficulty Ceiling

"Is the game too hard?"

min_lowest_hp_pct: 0.10
must_survive_60s: true

The player's HP should never drop below 10% in the first minute. If it does, new players will quit. If the player actually dies (HP = 0%), the build is NO-GO no matter what else looks good.

Sub-check 3: Pacing

"Does progression feel right?"

min_avg_interval: 8.0s
max_avg_interval: 35.0s
min_gap_between_levelups: 2.0s

This one caught my biggest "tests pass, game sucks" moment. Average level-up interval was 3.9 seconds. That means the player was getting an upgrade menu every 4 seconds — constant interruption, no flow state possible. The pacing check enforces a band: not too frequent (menu fatigue), not too rare (boredom).

The burst check (min_gap_between_levelups: 2.0s) catches a subtler issue: even if the average is fine, two level-ups within 2 seconds of each other feels broken.

Sub-check 4: Density

"Are there enough enemies on screen?"

min_peak_enemies: 5
min_avg_enemies: 3

A Vampire Survivors-like with 2 enemies on screen is a walking simulator. The density check ensures the screen feels alive. These thresholds are intentionally low — early game should ramp up gradually, not overwhelm from second one.

The 3/4 Rule

Why 3 out of 4 instead of 4 out of 4?

Because game balance is messy. A run where the bot happens to dodge everything (damage = 0) but has great pacing, density, and ceiling is probably fine. Demanding perfection would create false negatives and slow down the autonomous loop.

But 2 out of 4 is a yellow flag. Something is meaningfully off.

Tier 3: Regression (Baseline Comparison)

Question: "Is this build worse than the last known good one?"

Every time the gate says GO, it saves the current results.json as the new baseline. The next run compares against it.

warn_threshold_pct: 25
nogo_threshold_pct: 50

If peak enemy count drops by more than 25% compared to baseline, the gate warns. More than 50%? NO-GO.

This catches the sneaky regressions. Your AI agent "fixes" a bug in the spawn system. Tests pass. But peak enemies dropped from 33 to 7. Without Tier 3, that ships.

Real Results: 26 Runs Over a Single Day

Here's what the gate produced across 26 unique runs (deduplicated from the raw log — some runs were replayed against cached results for testing):

Verdict	Count	Percentage
GO	18	69%
CONDITIONAL	4	15%
NO-GO	4	15%

The Failures Were Real

The 4 NO-GO verdicts weren't false alarms:

2 stability failures: The game didn't start properly. total_fires=0, level_ups=0, peak_enemies=0. These were broken builds that would have shipped as "tests pass" in a naive pipeline.
1 regression NO-GO: After a balance change that spiked enemy count to 153 (a spawn system bug), the next run with normal values (peak=7) showed a >50% regression against that inflated baseline. The gate correctly flagged it.
1 ceiling failure: lowest_hp_pct=0 — the player died. damage_taken=39 in 60 seconds. The AI had overcorrected from "too easy" to "impossibly hard."

The Best Run

tier2: 4/4
damage_taken: 16
lowest_hp_pct: 0.66 (player took real damage but survived comfortably)
avg_levelup_interval: 16.8s (right in the sweet spot)
peak_enemies: 33
verdict: GO, reasons: (none)

This was a build where the AI had iterated through several balance passes. The gate validated what "good" looks like numerically.

The Worst GO

tier2: 3/4
damage_taken: 0
avg_levelup_interval: 18.3s
peak_enemies: 21
reasons: difficulty_floor_warn

Damage was 0 — too easy — but pacing and density were solid. The gate let it through as 3/4, which is the right call. A run where the bot happens to dodge everything isn't necessarily a broken build. But the difficulty_floor_warn gets logged, and if it shows up in 3 consecutive runs, that's a pattern the AI should address.

The CONDITIONAL Cases

All 4 CONDITIONAL verdicts had the same pattern: difficulty_floor_warn + pacing_warn. The game was too easy and level-ups were too fast (2/4 tier2 checks). These builds work but need improvement — exactly the signal CONDITIONAL is designed to send.

Beyond Games: Generalizing the Pattern

This 3-tier architecture isn't game-specific. The core insight works anywhere an AI agent produces output that needs to be "good enough to ship."

Content Agent (Blog Posts, Documentation)

Tier	Checks
Stability	Spell check passes, no broken links, all images load
Balance	Reading level in target range, section length variance < 2x, CTA present
Regression	Word count not >30% shorter than previous, readability score stable

Code Agent (Pull Requests, Refactors)

Tier	Checks
Stability	Compiles, all tests pass, no new lint errors
Balance	Cyclomatic complexity < threshold, test coverage > floor, no files > 500 lines
Regression	Performance benchmarks within 25% of baseline, bundle size stable

Data Pipeline Agent

Tier	Checks
Stability	Schema validates, no null primary keys, row count > 0
Balance	Column distributions within expected ranges, no single-value columns in output
Regression	Row count within 25% of previous run, new nulls < 5%

The pattern is always the same:

Tier 1: Did it work at all? (binary)
Tier 2: Is the output within acceptable quality bands? (multi-check, majority rule)
Tier 3: Is it worse than what we had before? (baseline delta)

Implementation: It's Just a Bash Script

The entire quality gate is a ~220-line bash script with one dependency: jq. No frameworks. No SaaS. No SDK.

The Threshold File

All the magic numbers live in a single JSON file. Tune them without touching code:

{
  "tier1_stability": {
    "max_exit_code": 0,
    "max_script_errors": 0,
    "min_total_fires": 1,
    "min_level_ups": 1
  },
  "tier2_balance": {
    "difficulty_floor": { "min_damage_taken": 1 },
    "difficulty_ceiling": {
      "min_lowest_hp_pct": 0.10,
      "must_survive_60s": true
    },
    "pacing": {
      "min_avg_interval": 8.0,
      "max_avg_interval": 35.0,
      "min_gap_between_levelups": 2.0
    },
    "density": {
      "min_peak_enemies": 5,
      "min_avg_enemies": 3
    },
    "pass_threshold": 3,
    "nogo_on_ceiling_fail": true
  },
  "tier3_regression": {
    "warn_threshold_pct": 25,
    "nogo_threshold_pct": 50
  }
}

The Core Logic

The gate script follows a dead-simple flow:

#!/usr/bin/env bash
# Exit 0 = GO or CONDITIONAL, Exit 1 = NO-GO

VERDICT="GO"

# TIER 1: STABILITY (any fail = NO-GO)
if [[ "$PASS_VAL" != "true" ]]; then VERDICT="NO-GO"; fi
if [[ "$TOTAL_FIRES" -lt "$MIN_FIRES" ]]; then VERDICT="NO-GO"; fi
if [[ "$LEVEL_UPS" -lt "$MIN_LU" ]]; then VERDICT="NO-GO"; fi

# TIER 2: BALANCE BAND (3/4 sub-checks to pass)
# ... run 4 sub-checks, count passes ...
if [[ "$TIER2_PASSES" -ge 3 ]]; then
    echo "TIER2: PASS"
elif [[ "$CEILING_PASS" == false ]]; then
    VERDICT="NO-GO"  # dying is fatal
else
    VERDICT="CONDITIONAL"
fi

# TIER 3: REGRESSION (compare vs saved baseline)
if [[ -f "$LATEST_BASELINE" ]]; then
    DELTA_PCT=$(compare_metric "$PEAK_ENEMIES" "$BL_PEAK")
    if [[ "$DELTA_PCT" -gt 50 ]]; then VERDICT="NO-GO"; fi
    if [[ "$DELTA_PCT" -gt 25 ]]; then VERDICT="CONDITIONAL"; fi
fi

# Save baseline on GO
if [[ "$VERDICT" == "GO" ]]; then
    cp "$RESULTS_PATH" "$BASELINE_DIR/latest.json"
fi

# Log everything to JSONL for trend analysis
echo "$LOG_ENTRY" >> gate-log.jsonl

# Exit code drives the pipeline
[[ "$VERDICT" == "NO-GO" ]] && exit 1 || exit 0

Everything gets appended to a gate-log.jsonl file — one JSON object per run. This gives you trend analysis for free. When peak_enemies shows a slow downward trend across 10 runs, you catch it before it becomes a regression.

Running It

# Full pipeline: run game + evaluate
./quality-gate.sh

# Skip the game run, evaluate existing results
./quality-gate.sh --skip-run --results /path/to/results.json

# Use a specific baseline
./quality-gate.sh --baseline /path/to/baselines/

The full source is on GitHub: github.com/yurukusa/spell-cascade

What the Gate Doesn't Catch

I'd be dishonest if I didn't mention the gaps.

The gate can't evaluate "feel." A game can pass all 4 tier2 checks and still feel lifeless — bad animations, no screen shake, boring sound effects. I've started building a separate "Feel Scorecard" that measures action density (events/second), dead time (longest gap with no events), and reward frequency, but it's early.

The gate is only as good as the bot. The AutoTest bot moves randomly and picks upgrades randomly. It can't test "is the dodge mechanic satisfying?" or "does the boss fight have good telegraphing?" Those require human playtesting.

Baseline drift is a real problem. If the AI makes a series of small-but-negative changes (each under the 25% warn threshold), the baseline slowly degrades. The JSONL log helps here — you can chart trends — but the gate doesn't do it automatically yet.

One of my "best" runs had a data anomaly. Peak enemies hit 153 in a single run due to a spawn system bug. That became the baseline, which then made every subsequent normal run look like a massive regression. I had to manually reset the baseline. The system needs an outlier filter.

The Honest Scorecard

After implementing the Quality Gate, I asked myself: did it actually help?

Yes, with caveats.

It caught 4 builds that would have shipped broken. Two of those were stability failures the AI didn't notice (the game booted but core systems weren't initializing). One was the "overcorrected to impossible difficulty" build. One was a legit regression.

It also correctly let through builds that a stricter gate would have rejected. The 0-damage runs with good pacing were fine — the bot just happened to dodge everything. A 4/4 requirement would have created noise.

But the gate said GO on builds that a human player would flag in 30 seconds. Stiff animations. Boring enemy patterns. No visual feedback on hits. The gap between "numerically balanced" and "fun" is still a human judgment.

That's the next frontier: encoding "feel" into automated metrics. But even without that, having a GO/NO-GO gate between the AI and the commit history has already prevented the worst outcomes.

Key Takeaways

Tests are necessary but not sufficient. Passing tests means your code is correct. It doesn't mean your output is good.
The 3-tier pattern works everywhere. Stability (did it work?), Balance (is it good enough?), Regression (is it worse?). Apply it to content, code, data, or anything an AI agent produces.
Use majority voting for quality bands. Demanding 4/4 perfect creates false negatives. 3/4 with a hard veto on critical failures is the right balance for autonomous systems.
Log everything to JSONL. Individual gate verdicts are useful. The trend across 26 runs is where the real insights are.
Externalize thresholds. Put them in a JSON file, not in code. You'll tune them constantly, and your AI agent can modify them without touching the gate logic.
Be honest about the gaps. A quality gate doesn't replace human judgment. It catches the bottom 15% — the builds that should never ship — and that alone is worth the ~220 lines of bash.

This concept was born from building an autonomous game testing pipeline. I wrote a deeper dive into the Feel Scorecard — the metrics behind "does this game feel good?" — on Zenn (Japanese).

Curious what happens when the AI says GO but a human finds 3 bugs in 5 minutes? I wrote about that honest reckoning on Hatena Blog (Japanese).

Spell Cascade is playable now: yurukusa.itch.io/spell-cascade

Built by a non-engineer using Claude Code (Anthropic's AI coding assistant) + Godot 4.3. The quality gate, the game, the AutoTest bot — all of it written by AI, reviewed by a human who can't read most of the code.

"I want AI to work while I sleep" → CC-Codex Ops Kit. The guards from this article enabled 88-task overnight runs.

DEV Community