DEV Community

Human-on-the-Loop: AI Reviewing AI PRs at cortex -- 769 PRs/month while raising the quality bar (Series Part 3)

Ryosuke Tsuji on May 26, 2026

Hi, I'm Ryan, CTO at airCloset. Disclaimer: "cortex" in this article is the internal codename for an AI platform built in-house at airCloset. It ...

Read full post

Mykola Kondratiuk • May 29

running at scale requires betting humans will notice when to intervene - that's the real tradeoff. curious what signals surface a PR for actual human review in practice.

Ryosuke Tsuji • May 29

Great question — and I think the framing already points at the right concern. "Humans noticing when to intervene" is exactly the model that stops being reliable at this scale; once you're past ~700 PRs/month, attention-based oversight breaks.

What we do instead is make escalation structural, not attentional. The concrete signals that surface a PR for human review:

Quality-bar relaxation PRs — anything that lowers a lint rule, coverage threshold, or guideline binding is classified Critical in severity.md. The AI is forbidden from approving it; a human reviewer's approve is required. This is the meta-safety valve in the post.
Repeated auto-fix failures — if the author-mode AI can't pass CI after a few attempts, the PR gets a needs-human label.
Alert-Fix "can't fix" verdicts — when AI investigation can't pin down a root cause, it posts the investigation log to Slack and explicitly hands off (Part 4 of the series goes deeper on this).
PR-author opt-out on auto-merge — authors can disable auto-merge per PR for changes they want to land carefully.

None of these depend on a human happening to notice — they're explicit triggers built into the pipeline.

The deeper bet is that humans aren't watching individual PRs at all. They're watching aggregate AI behavior and tuning the guidelines (the policy layer). When the AI misjudges a class of issue, the right intervention is rewriting the rule, not stepping into the PR. That's what human-on-the-loop means in practice here.

Mykola Kondratiuk • May 30

structural escalation is the right frame - what signals break the glass for you? my attempt was blast-radius tags at design time, which helps but just shifts the attention problem upstream

Ryosuke Tsuji • May 30

Right — that's exactly the meta-problem. Any tag-based mechanism just relocates "who attends to the tag."

Our break-the-glass signals are deliberately action-coupled rather than attention-coupled:

Explicit AI handoff — Alert-Fix posts its investigation log to Slack and says "I couldn't fix this safely" when it can't pin down a root cause. No flag to scan; the system tells you it's stuck, with everything it tried.
Limit violations — repeated CI-fix failures, unresolvable conflicts, etc. get a needs-human label. That's literally the queue.
Categories pre-defined as human-only — quality-bar relaxation (lint/coverage downgrades), credential handling, production data ops. Classified Critical in severity.md such that the AI is forbidden from approving them. Not "AI flags for human review" — "AI cannot approve, so the PR sits until a human does."

The shape we landed on: the signal doesn't say "pay attention," it says "this is blocked until you act," and the context the human needs (diff, severity rationale, investigation log) is delivered with the signal. The queue is short and each item arrives with its own briefing.

To address the "attention shifts upstream" worry directly: the same problem applies at design-time too — blast-radius tags help only if a human is going to read every tag. The way we keep it bounded is the [Recurrence] loop: when a class of issue trips the human-only gate twice, the next PR is required to add a generation-time constraint (lint / type / CI guard) in the same PR that fixes the symptom. So the "human-only" categories shrink in absolute terms over time, even while throughput grows.

Your blast-radius tagging would slot into this nicely as one more pre-classified category — the question is whether the tag triggers attention or triggers action.

Mykola Kondratiuk • May 30

action-coupled is cleaner - context and signal arrive together. what happens when the handoff message itself gets buried in a busy channel?

Ryosuke Tsuji • May 30

Right — that's the natural next concern. The answer: the Slack message isn't the source of truth, it's just the notification. The actual queue lives on the PR itself.

When a PR breaks the glass:

A needs-human label is applied (or it sits in REQUEST_CHANGES state). Auto-merge is disabled.
The relevant Slack channel gets a thread with the investigation log / verdict / diff.
Optionally, the assigned reviewer gets a Slack DM via our notification relay.

The label + PR state is the durable queue. We treat "open PRs with needs-human label" (and "PRs awaiting my review") as the canonical pending-attention surface — the same way you'd treat a bug tracker. It doesn't depend on anyone happening to see a chat message in the moment.

So Slack is the push layer; the PR list is the pull layer. Both exist on purpose: push catches you when you're online; pull catches you when you come back from PTO or finally clear the queue. Slack channels are also severity-routed (#cortex-fatal is kept intentionally low-volume so critical items don't drown in the warning channel).

That said — this pattern is tuned for "will be picked up within hours / a day," not "must be picked up in 60 seconds." Anything that time-sensitive probably shouldn't sit behind an AI handoff in the first place; it belongs in pager-style escalation. The break-the-glass pattern handles the 'AI couldn't, but it's not on fire' band, which turns out to be most of what shows up.

Theo Valmis • May 29

769 PRs/month is the scale at which review-time controls start to bend. The interesting design question is which constraints belong at generation time (so the agent can never produce a violating PR) vs. which belong at review time (where humans still adjudicate judgment calls). The mix is what determines whether the quality bar actually rises.

Ryosuke Tsuji • May 29

Yes, this is the right design question. The generation-time vs review-time split is essentially Martin Fowler's Guides / Sensors taxonomy, and you're right that the mix is what determines whether the bar rises or just holds.

Cortex's bet is to push as much as possible to generation time — but with an explicit mechanism for the mix to shift over time.

Generation time (proactive):

Lint, type checks, a 90% coverage gate, and a 500-lines-per-file cap — all enforced pre-commit / in CI.
cpg context fed to the author AI (same graph the reviewer would query), so the author already has the context the reviewer is going to evaluate against.
Author-side CLAUDE.md encodes architectural rules so violations rarely get written in the first place.

Review time (reactive):

9-dimension AI review with a severity hierarchy and strict no-downgrade rules.
Severity gates merge based on the verdict marker.

The piece that I think actually moves the bar over time is what we call the [Recurrence] loop: bug-fix PRs are required by the review guideline (recurrence-prevention.md) to pick one of {add lint rule, horizontal rollout to other call sites, add guideline item, or "single occurrence — nothing"}. When a class of issue trips up review twice, the rule is required to migrate from review-time to generation-time in the same PR that fixes the symptom.

So "whether the quality bar rises" isn't a static-mix question — every review-time catch becomes a candidate for promotion into a generation-time constraint, and the loop keeps converting Sensors into Guides over time. Part 4 of the series gets into this in more detail.

Thanks for the framing — this is exactly the lens that surfaces what's actually load-bearing.