Hi, I'm Ryan, CTO at airCloset.
Disclaimer: "cortex" in this article is the internal codename for an AI platform built in-house at airCloset. It ...
For further actions, you may consider blocking this person and/or reporting abuse
running at scale requires betting humans will notice when to intervene - that's the real tradeoff. curious what signals surface a PR for actual human review in practice.
Great question — and I think the framing already points at the right concern. "Humans noticing when to intervene" is exactly the model that stops being reliable at this scale; once you're past ~700 PRs/month, attention-based oversight breaks.
What we do instead is make escalation structural, not attentional. The concrete signals that surface a PR for human review:
None of these depend on a human happening to notice — they're explicit triggers built into the pipeline.
The deeper bet is that humans aren't watching individual PRs at all. They're watching aggregate AI behavior and tuning the guidelines (the policy layer). When the AI misjudges a class of issue, the right intervention is rewriting the rule, not stepping into the PR. That's what human-on-the-loop means in practice here.
structural escalation is the right frame - what signals break the glass for you? my attempt was blast-radius tags at design time, which helps but just shifts the attention problem upstream
Right — that's exactly the meta-problem. Any tag-based mechanism just relocates "who attends to the tag."
Our break-the-glass signals are deliberately action-coupled rather than attention-coupled:
The shape we landed on: the signal doesn't say "pay attention," it says "this is blocked until you act," and the context the human needs (diff, severity rationale, investigation log) is delivered with the signal. The queue is short and each item arrives with its own briefing.
To address the "attention shifts upstream" worry directly: the same problem applies at design-time too — blast-radius tags help only if a human is going to read every tag. The way we keep it bounded is the [Recurrence] loop: when a class of issue trips the human-only gate twice, the next PR is required to add a generation-time constraint (lint / type / CI guard) in the same PR that fixes the symptom. So the "human-only" categories shrink in absolute terms over time, even while throughput grows.
Your blast-radius tagging would slot into this nicely as one more pre-classified category — the question is whether the tag triggers attention or triggers action.
action-coupled is cleaner - context and signal arrive together. what happens when the handoff message itself gets buried in a busy channel?
Right — that's the natural next concern. The answer: the Slack message isn't the source of truth, it's just the notification. The actual queue lives on the PR itself.
When a PR breaks the glass:
The label + PR state is the durable queue. We treat "open PRs with needs-human label" (and "PRs awaiting my review") as the canonical pending-attention surface — the same way you'd treat a bug tracker. It doesn't depend on anyone happening to see a chat message in the moment.
So Slack is the push layer; the PR list is the pull layer. Both exist on purpose: push catches you when you're online; pull catches you when you come back from PTO or finally clear the queue. Slack channels are also severity-routed (#cortex-fatal is kept intentionally low-volume so critical items don't drown in the warning channel).
That said — this pattern is tuned for "will be picked up within hours / a day," not "must be picked up in 60 seconds." Anything that time-sensitive probably shouldn't sit behind an AI handoff in the first place; it belongs in pager-style escalation. The break-the-glass pattern handles the 'AI couldn't, but it's not on fire' band, which turns out to be most of what shows up.
769 PRs/month is the scale at which review-time controls start to bend. The interesting design question is which constraints belong at generation time (so the agent can never produce a violating PR) vs. which belong at review time (where humans still adjudicate judgment calls). The mix is what determines whether the quality bar actually rises.
Yes, this is the right design question. The generation-time vs review-time split is essentially Martin Fowler's Guides / Sensors taxonomy, and you're right that the mix is what determines whether the bar rises or just holds.
Cortex's bet is to push as much as possible to generation time — but with an explicit mechanism for the mix to shift over time.
Generation time (proactive):
Review time (reactive):
The piece that I think actually moves the bar over time is what we call the [Recurrence] loop: bug-fix PRs are required by the review guideline (recurrence-prevention.md) to pick one of {add lint rule, horizontal rollout to other call sites, add guideline item, or "single occurrence — nothing"}. When a class of issue trips up review twice, the rule is required to migrate from review-time to generation-time in the same PR that fixes the symptom.
So "whether the quality bar rises" isn't a static-mix question — every review-time catch becomes a candidate for promotion into a generation-time constraint, and the loop keeps converting Sensors into Guides over time. Part 4 of the series gets into this in more detail.
Thanks for the framing — this is exactly the lens that surfaces what's actually load-bearing.