Hi, I'm Ryan, CTO at airCloset.
Disclaimer: "cortex" in this article is the internal codename for an AI platform built in-house at airCloset. It is unrelated to existing commercial services like Snowflake Cortex or Palo Alto Networks Cortex.
In Part 1 (intro) I covered the high level -- AI driving both PR reviews and incident response on top of cortex. In Part 2 (Product Graph) I went deep on cpg, the unified knowledge graph that fuses code, docs, DB schemas and infra into a single business-aware index.
This post is about the automated PR review pipeline -- AI reviews the PR, a separate AI applies the fixes, and the system merges automatically once policy gates pass. The usual critiques of AI-assisted development ("the reviewer becomes the bottleneck" and "AI code drops the quality bar") don't really apply here. The rest of this post unpacks why.
Series
| # | Theme | Key scene | Article |
|---|---|---|---|
| 1 | Series intro: cortex harness | PRs merging unattended / incidents fixed before anyone notices | ai-harness-intro |
| 2 | Product Graph (cpg) | Code / docs / DB / infra unified into one graph | cortex-product-graph |
| 3 | Auto PR review | webhook -> AI review -> auto-fix -> squash merge | This article ← you are here |
| 4 | Self-Healing + observability + auto-added guardrails | Alert -> AI investigates -> fix PR + new lint/type gate -> auto redeploy + same-pattern writes get auto-rejected | cortex-self-healing |
| 5 | Democratizing the maintenance phase | Domain experts open PRs to production; the harness owns the quality gate | cortex-non-engineer-prs |
| 6 | Series wrap-up | The underlying philosophy (what was given up, what was kept, why this design) plus a retrospective on the failures and lessons | Coming soon |
Start with last month's numbers
769 PRs merged.
Median time to merge: 31 minutes.
Human review involvement per PR: near-zero.
That's a typical 30 days on cortex (Apr 21 -- May 21).
Every one of those 769 PRs had an AI reviewer as the first reviewer, with an average of 10.8 review-fix loop iterations per PR (max 56). 1 in 5 merged within 10 minutes, roughly half within 30 minutes. What humans do now is look at review outcomes and tune the review prompt and the guidelines themselves -- this is human-on-the-loop, not human-in-the-loop. Humans operate on the policy layer, not the execution layer.
| Past 30 days | |
|---|---|
| PRs merged | 769 |
| AI reviewer coverage | 100% |
| Avg review iterations / PR | 10.8 |
| Max review iterations | 56 |
| Per-PR human review | ~0% |
| Median time-to-merge | 31 min |
| Merged within 10 min | 20% |
| Merged within 30 min | 49% |
This is a typical month on cortex now.
The common refrain -- "AI speeds up writing but reviews still bottleneck" and "AI-written code lowers quality" -- is something cortex absorbs through a pipeline where neither failure mode can take hold. Let me break it down.
How the review bottleneck stops forming
The conventional wisdom: the reviewer becomes the bottleneck
As AI writes faster, the load on whoever reviews the output grows proportionally. Anthropic's internal blog (How Anthropic teams use Claude Code) reports the same pattern -- the bottleneck has shifted from writing to reviewing, and senior engineers' work has moved from writing code toward integrating and reviewing AI output.
cortex hit exactly this. The moment we ran Claude Code at full throttle, writing speed jumped by an order of magnitude or more. Meanwhile the human time available to read and approve PRs only grew linearly. If the reviewer (=me) took a day off, the whole org stalled -- a classic single point of failure.
cortex's answer: move the reviewer role to AI as well
Part 1 and Part 2 kept asking the same recurring question: "how far do you push the harness?" cortex went all-in: the AI writes the code, the AI reviews the code. What humans keep their hands on is "tuning the prompts and guidelines themselves" -- not making decisions inside each individual PR, but watching the system from above and adjusting.
Three conditions had to hold for this to work:
-
The AI reviewer has enough context
A generic AI reviewer only sees the PR diff. The diff alone hides business meaning, upstream/downstream dependencies, and prior incident history. cortex feeds the Product Graph (cpg) from Part 2 -- a knowledge graph that fuses code, docs, DB schemas, and infra into one structure, with each node carrying business role and upstream/downstream dependencies -- into the AI reviewer, so it can trace impact into code that the PR didn't even touch. It catches:
- Missed upstream/downstream fixes
- Missed doc updates
- Tests that should have been updated but weren't
Diff-only AI review can never reach this territory.
-
Reviews are not improvisational
If reviews shift day to day, the team gets confused, and the AI can't be told what "correct" looks like. We enforce this by passing an explicit review-guideline document as the mandatory citation source for every review (we open-sourced a snapshot, see below).
-
False positives don't blanket-block merges
Treating every false positive as Critical breaks the workflow. We control this with a severity hierarchy (Critical / Major / Minor / Nit) plus strict no-downgrade rules.
So: the cpg from Part 2 solves "what context the AI sees," the review guidelines solve "what the AI should do" as Guides (pre-execution control), and the severity ladder + no-downgrade rules solve "what the AI must not do" as Sensors (post-execution control). This maps cleanly onto Martin Fowler's Guides / Sensors taxonomy (introduced back in Part 1).
One more upstream layer: before any of those three kicks in, a 500-lines-per-file lint keeps every file in any PR small enough to fit in a single AI session. That alone keeps AI review from breaking down, and unlike a human reviewer, the AI doesn't lose focus. There are plenty of other lints in front of the AI reviewer too, but the full picture belongs to Part 4 (Self-Healing + observability + auto-added guardrails).
How the auto-review system is wired
The implementation is a script running on each developer's machine. GitHub webhooks land on an in-house Event Relay server, get persisted to Firestore, and each developer's machine subscribes as an SSE client. On reconnect, Last-Event-ID replays anything missed -- zero event loss, single webhook registration. Reviewer-mode machines stay always-on, so any incoming review fires immediately. Author mode runs in the background on the PR author's own machine, alongside their normal dev work.
How we ended up with Event Relay
The current setup wasn't the original design.
- First: GitHub webhook → smee.io → each machine
- Then: GitHub webhook → Cloudflare Tunnel → each machine
- Now: GitHub webhook → in-house Event Relay with Firestore persistence → SSE to each machine
Both smee.io and Cloudflare Tunnel ran into connection drops and missed deliveries, which caused real misses for us. Switching to the in-house Event Relay brought event loss to zero (Firestore persistence + Last-Event-ID replay), and the relay turned into a general-purpose layer we could reuse.
The webhook ingestion for Self-Healing (covered in Part 4) actually goes through the exact same Event Relay. GitHub, Grafana, and other webhook sources get consolidated through one relay, and each machine's SSE client subscribes to whichever events it cares about. Having a single general-purpose webhook relay is a piece of infra that keeps paying off in unexpected ways -- worth investing in early.
When the reviewer's machine receives an event, the script spawns claude -p and walks through 9 dimensions (Graph / Architecture / Security / Test / Doc / Impact / Observability / AI-Antipattern / Recurrence) sequentially, then reads the verdict marker the AI emitted at the end and posts APPROVE or REQUEST_CHANGES via gh pr review.
A few notes:
-
Modes split the role -- the same script started with
--mode reviewerbecomes the reviewer process; with--mode authorit becomes the PR-author response process. The machine of whoever is assigned as reviewer runs reviewer mode; the machine of whoever opened the PR runs author mode. Event Relay multicasts the events, and each machine reacts in a distributed way. -
Per-PR worktree isolation -- author mode merges
origin/maininto a fresh worktree before spawning the AI. Multiple PRs can be handled in parallel without file state contaminating across them. -
9 dimensions checked sequentially in one session -- not parallel sub-agents. A single
claude -psession walks the 9 dimensions while keeping context shared, which also catches cross-dimension contradictions. - Review guidelines: public snapshot -- air-closet/cortex-review-guidelines (JP/EN). The live guidelines are inside cortex (private repo) and evolve daily; the public repo is a snapshot extracted for reference.
:::message alert
Guidelines alone scale only to projects in the tens-of-thousands-of-lines range. At cortex's scale (over 1M lines of code), the knowledge graph from Part 2 (cpg) is a hard prerequisite. Porting the guidelines without cpg won't reproduce the same review quality -- the AI reviewer simply can't navigate the codebase fast enough to reason about impact.
:::
Why sequential single-session review, not parallel sub-agents
We initially tried splitting the 9 dimensions across parallel sub-agents. Three problems emerged: cpg / guidelines / PR diff got injected 9 times (token cost balloons), cross-dimension findings couldn't reference each other (a [Test] issue rooted in a [Graph] violation gets dropped in isolation), and aggregating 9 outputs into a single verdict required its own machinery.
A single sequential session fixes all three: one cpg/guideline load, earlier findings stay in context for later dimensions (cross-dimension consistency comes for free), and one verdict marker at the end is the entire aggregation step.
We also swap CLAUDE.md to a review-specific version at startup. The default CLAUDE.md is dense with development-time context (Product Graph ops, prod-data safety, MCP ordering) -- noise for a reviewer. The review-specific version centers on severity, no-downgrade, and the verdict marker spec, keeping AI attention on the review task.
Cutting wasted context lifts judgment precision and token cost at the same time.
Operational knobs
A few filters and toggles we apply in actual use:
- Draft (WIP) PRs are excluded. GitHub Draft state is received but skipped; review starts firing once the author flips it to Ready for Review.
- Specific PRs can be targeted manually. The webhook is the normal trigger, but you can also kick off a review against a specific PR number from the CLI -- useful after a CI failure or for re-checking a single PR.
- Auto-merge is the PR author's call. Whether the pipeline runs through to auto-merge after APPROVE + CI green is set by the PR author. Default is on; for changes that go directly to prod, the author can flip it off and hit merge themselves.
Output structure: tags and severity
Every auto-review comment is structured as tag + severity + concrete example.
Tags (dimensions)
| Tag | Dimension | Primary target |
|---|---|---|
[Graph] |
Product Graph integrity |
@graph-* JSDoc, node dependencies, doc consistency |
[Doc] |
Doc consistency | Doc updates that should follow code changes, doc placement |
[Impact] |
Impact analysis | Missed upstream/downstream fixes, via: field inconsistency |
[Security] |
Security | Auth, input validation, secrets |
[Architecture] |
Composable Architecture | app/package boundaries, dependency direction |
[Test] |
Test quality | Coverage, matchers, naming |
[Observability] |
Observability | Structured logging, no-truncate rules |
[AI-Antipattern] |
AI-generated code traps | Hallucinated APIs, fallback overuse, dead code |
[Recurrence] |
Recurrence prevention | Bug-fix triage (lint / horizontal rollout / new guideline) |
Severity
| Severity | Criteria | Action |
|---|---|---|
| Critical | Security, data corruption, prod-risk, doc inconsistency, missing @graph-*, quality-bar relaxation |
REQUEST_CHANGES |
| Major | Spec violation, Composable Architecture violation, missing tests | REQUEST_CHANGES |
| Minor | Naming, maintainability, light refactor |
REQUEST_CHANGES (must be resolved) |
| Nit | Style preference, minor inconsistency |
APPROVE (comment only) |
The single most important rule is "no downgrade":
- "Following existing patterns" is not a valid reason to downgrade (existing violations are a separate problem to fix, not justification for new violations).
- "Will be addressed in a separate PR" or "incrementally" is not a valid reason to drop Critical/Major to Nit.
- "Leave a TODO/FIXME" is not a valid deferral path.
This is spelled out in severity.md, and the AI literally cites it back when it issues REQUEST_CHANGES.
Example: PR migrating the embedding model from gemini-embedding-001 to gemini-embedding-2
A real review comment is more useful than abstract description. Here's a typical one. This is a feature PR (feat(meet): dual-write embeddings to new 'embedding' column (v2)) that was merged on 2026-05-19, after 6 review-fix iterations across 1.5 hours.
First review (07:35:25):
The dual-write approach, the idempotent migration script, and the
Promise.allparallelization all look sound. Please address the 3 Critical and 2 Minor findings below before merging.
Critical
[Graph] Missing
@graph-businesstag (x3)
graph-integrity.mdrequires@graph-businesson every top-level declaration under the app layer (severity: Critical).
All 3 new declarations below are missing it (detailed comments in separate threads).
Existing code is also missing it, but "following existing patterns" is not a valid downgrade reason (severity.md).
generateEmbeddingV2(v2 embedding generation function)EMBEDDING_MODEL_V2(model name config)EMBEDDING_LOCATION_V2(region config)
[Graph]
embedMeetContent's@graph-connectsdoesn't reflectgenerateEmbeddingV2The JSDoc on
embedMeetContenthas@graph-connects generateEmbedding [calls] Generate embedding, but no corresponding@graph-connectsline has been added for the newly introducedgenerateEmbeddingV2call.
The graph will be missing an edge togenerateEmbeddingV2.* @graph-connects generateEmbedding [calls] Generate embedding + * @graph-connects generateEmbeddingV2 [calls] v2 embedding generation (dual-write) * @graph-connects insertMeetChunks [calls] Insert chunks into BQ
[Doc] Corresponding BigQuery schema doc is not updated
The "BigQuery schema" section in the related doc is missing the new
embeddingcolumn.
Bothgraph-integrity.mdandseverity.mddefine doc inconsistency as Critical.| `created_at` | TIMESTAMP | Created at | +| `embedding` | FLOAT64[] | Embedding vector (v2: gemini-embedding-2) |
Minor
[Test]
textEmbeddingV2value is not asserted
objectContainingallows extra fields, so the test still passes even when the v2 value is never set.textEmbedding: [0.1, 0.2, 0.3], + textEmbeddingV2: [0.1, 0.2, 0.3],[Test] No isolated scenario for "v2 returns null"
generateEmbeddingV2: mockGenerateEmbeddingreuses the v1 mock, so the case "v2 returns null while v1 succeeds" is not independently verified.
<!-- VERDICT:REQUEST_CHANGES -->
The takeaway is the precision of the details.
- File + line numbers are concrete.
- Suggested fixes are in diff format (copy-paste ready).
-
Source guideline (
graph-integrity.md/severity.md) is cited explicitly. - The typical excuse ("existing code has the same problem") is pre-emptively closed.
- The trailing
<!-- VERDICT:REQUEST_CHANGES -->is a machine-readable verdict marker -- the trigger that moves the PR intoREQUEST_CHANGESstate.
After this, the PR author (= usually another AI running on the author's machine) pushes a fix, the reviewer re-reviews. The next review confirms all 3 Criticals are actually resolved, raises the next Major / Critical, and so on. 6 iterations in 1.5 hours, finally APPROVE, auto-merge.
Plotted on a timeline:
With a human reviewer, this is "Critical x3 -> wait until tomorrow for the fix -> re-review the day after" -- 2 to 3 days per PR. cortex closes it in 90 minutes.
The difference between human review and auto review is not just speed. A single AI session walks all 9 dimensions in order and cites the guideline each time, which makes it much harder to miss the "deep" findings humans drop because their attention drifted -- doc consistency, recurrence-prevention judgments, weak matchers. Side-by-side comparison:
This is why the review bottleneck never forms here.
Evolving the guidelines: catching the moments AI gets it wrong, then fixing the rules
The review guidelines I've been referring to are not a static document. Running this in production surfaces recurring patterns where the AI mis-judges a specific class of issue. Each time that happens, we don't add a comment to the individual PR; we rewrite the guideline so the AI behaves correctly next time -- this is the meta-layer humans actually operate on.
A few concrete failures we hit on cortex, and how we closed each one by changing the rule, not the PR.
1. AI was downgrading because "existing code has the same issue"
Early on, immediately after flagging a violation the AI would add "however, since existing code has the same violation, I'm downgrading this to Nit" and self-downgrade. The result: violations on newly added code kept dropping to Nit, and the system kept emitting Approve.
We closed this by adding the no-downgrade rule to severity.md:
"Following existing patterns" is not a valid downgrade reason: if existing code violates a guideline, new code following that pattern still gets flagged at the same severity. Deferral language like "consider during the next refactor" is not accepted.
That wasn't enough on its own. Over time other excuse patterns surfaced -- "will be addressed in a separate PR," "will be addressed in the next session," "out of scope," "incrementally" -- so we added those as forbidden downgrade categories too. We also explicitly forbade deferring via TODO/FIXME comments in code. The mindset is: close every typical excuse path preemptively.
2. The final verdict had 3 options, and "comment-only" left PRs in limbo
The final verdict at the end of every review was originally APPROVE / REQUEST_CHANGES / COMMENT (approve / request changes / comment-only). When the AI picked COMMENT -- for example when only Minor issues existed -- the script took no action, the PR sat in review-pending forever, and ultimately someone had to manually pick it up. Classic anti-pattern, and it kept happening.
We collapsed the verdict to 2 options. Anything Minor or above is REQUEST_CHANGES, a missing verdict marker defaults to REQUEST_CHANGES (safe side), and only Nit-only or no findings (with CI passing) yields APPROVE. The principle: "if the judgment is ambiguous, fail-safe by defaulting to the blocking side (REQUEST_CHANGES)." Going all-in on that design eliminated the stuck-PR class entirely.
3. Checklist items had no severity, so the AI's judgment kept drifting
Originally, each guideline (graph-integrity.md, testing.md, etc.) was just a bulleted checklist. Items like "Is the test name descriptive?" or "Are mocks minimized?" were listed, but without per-item severity. As a result, the same violation could land as Major in one PR and Nit in another, depending on the session.
We converted every guideline's checklist into a severity / scope / criterion table:
| Severity | Scope | Criterion |
|---|---|---|
| Critical | All PRs | Missing @graph-business
|
| Major | App layer only | Missing tests |
| Minor | Shared packages only | More than 3 function args |
| Nit | All PRs | Naming inconsistency |
The scope column is a machine-decidable filter for which paths a check applies to, so the AI reviewer doesn't trigger irrelevant items on PRs outside that scope. Just putting it in a table -- the judgment reproducibility jumped significantly.
4. The existing guidelines didn't catch AI-specific traps
After running this for a while we noticed AI-generated code has its own cluster of antipatterns -- calling APIs that don't exist (hallucinated APIs -- something like user.findOrCreate() that looks plausible but isn't actually defined), swallowing errors and returning fallback values (e.g., silently returning an empty array when an upstream API fails), leaving unused functions (a refactor adds the new function but doesn't delete the old one, leaving dead code), expanding the modification scope beyond what was asked (you ask it to change one function and it reformats the whole file), adding unnecessary backward-compatibility code (creating a deprecated alias for an internal-only function) -- and security.md / testing.md couldn't catch these. There's a distinct class of "mistakes only AIs make."
We added a dedicated ai-antipattern.md for this. Reviews now pick these up explicitly under the [AI-Antipattern] tag. Reviewing AI output requires designing around AI-specific traps -- you don't get there just by porting human review heuristics onto an AI.
5. The AI tries to relax "the standard itself"
The last and most important pattern. When the AI was writing fix PRs, occasionally instead of fixing the guideline violation it would write a PR that relaxes the guideline. For example:
- Lower the test coverage threshold to avoid writing more tests
- Narrow the in-house lint rule's scope to make the violation go away
- Soften the guideline doc language from "recommended" to "preferred" to weaken the binding constraint
And the AI builds a formally-coherent justification: "existing code already violates this, so let's adjust the standard to match the implementation." Left unchecked, the AI gradually walks the quality bar down.
We closed this by adding "quality-bar relaxation" as a Critical in severity.md:
A PR that relaxes the quality bar -- guideline doc, lint rule, coverage threshold -- must not be Approved by the AI reviewer. It is sent back with
REQUEST_CHANGES. A human reviewer's approval is required. "Existing code already violates this" is not a valid justification for relaxation.
This is the one explicit boundary where we deliberately do not give the AI autonomous Approve authority. Whether the standard itself moves is a human decision. It's the meta-level safety valve for the "AI reviewing AI" architecture.
Evolving the guidelines is the meta-layer humans actually operate on
The common thread: "when the AI gets it wrong, don't override the individual PR -- rewrite the guideline so the fix propagates forward."
- AI escapes via "existing code has the same issue" -> add no-downgrade rule
- AI picks "comment-only" and PR stalls -> collapse to 2-option verdict
- AI's judgment drifts -> add severity / scope columns to every item
- AI falls into its own traps -> add the AI-Antipattern category
- AI tries to relax the standard -> classify standard-relaxation as Critical, require human Approve
As long as this loop turns, the guideline is a living document that absorbs the failure patterns AI produces in production. Don't try to write the perfect guideline up front. Catch the moment AI gets it wrong, and write the rule for that moment. That's the actual mechanism behind "quality doesn't drop even when humans aren't inside the loop."
And one more thread. Right now, the trigger for "AI got it wrong, time to rewrite the guideline" is still mostly a human judgment, but parts of that maintenance are gradually becoming automatable too. Self-Healing (Part 4 next time) -- where AI investigates production incidents, opens a fix PR, runs it through auto-review, and auto-redeploys -- requires every fix PR to write one of {add lint, add guideline, horizontal rollout} under the [Recurrence] lens. So the AI is increasingly participating in the maintenance of its own review criteria, with humans still in the loop on adoption. I'll come back to this in Part 4.
Auto-fix: a separate AI applies the changes and pushes
Once REQUEST_CHANGES lands, the same script running on the PR author's machine, but in author mode, picks up the event and starts working.
[REQUEST_CHANGES detected]
| SSE push via Event Relay
[Author mode boots on PR author's machine]
| Merge origin/main into a worktree
| (lockfile resolved up front, remaining conflicts handled by AI)
| Read the auto-review comment as context
| Run claude -p inside the worktree
| Commit + push the changes
| New SHA is delivered back to the reviewer's machine via Event Relay -> re-review
Two design choices matter here.
- Reviewer and author run on different machines in different sessions -- reviewer mode and author mode are the same script, but they run on different machines in different processes. "Is the original critique correct?" is judged independently. Unlike a single AI fixing its own complaints, the judgment passes between two separate sessions.
-
All iteration stays inside the same PR -- we don't spawn a new PR. The "fix the root cause, no deferrals" rule from Part 2 and the review guidelines kicks in here: if the AI tries to escape via
TODO/FIXMEor by splitting work out into a separate PR, the next review rejects it.
Auto-merge + parallel deploy
Once auto-review returns APPROVE and CI is fully green, the auto-merge script runs and squash-merges the PR.
[Auto review APPROVE + CI green]
|
auto-merge script
| squash merge to main
|
[main updated]
|
Turborepo build (affected packages only)
|
Pulumi up (multiple stacks in parallel)
|- API services
|- pipeline services
|- MCP servers
`- infra
|
[Deploy complete]
|
cpg index rebuilt (only changed nodes regenerate embeddings -- see Part 2)
pulumi up <stack1> <stack2> ... runs in parallel, so deploying 9 stacks at once finishes in about 8-12 minutes. End to end, merge-to-production is averaging 10-15 minutes.
This compounds nicely with Self-Healing PRs. Incident alert -> Self-Healing identifies root cause -> opens a fix PR -> auto review pass -> auto merge -> auto deploy runs as a single closed loop without human involvement (covered in Part 4).
The numbers, in more detail
Unpacking the headline numbers a bit further.
Depth of the review-fix loop
Across 769 PRs in 30 days, the average per PR was 10.8 review iterations, max 56. The fact that the average is past 10 means the first review almost always surfaces at least one finding.
The embedding-model migration PR shown earlier needed 6 iterations to merge, and that's representative of the average PR. What would take a human reviewer days, cortex resolves in minutes.
What the auto reviewer typically flags
The most common findings out of the first review:
-
[Graph] Missing
@graph-business-- a prerequisite cpg leans on (from Part 2). The classic finding on newly added declarations. -
[Doc] Doc inconsistency -- code changed but the corresponding
docs/section was not updated. -
[Test] Weak matchers --
objectContainingweakening value assertions, single-property checks viatoBe. -
[Observability] Unstructured error logs --
eventfield or required keys deviating from the structured-log spec. - [Recurrence] No recurrence-prevention action -- a bug-fix PR description not declaring which of {lint / horizontal rollout / add guideline / nothing} applies.
These are categories human reviewers frequently miss in practice, especially doc consistency and recurrence-prevention checks. The AI reviewer applies them mechanically on every PR.
Actual false-positive rate
It's not zero. A few times a month we get "this is Nit, not Major" type misjudgments. The fix path is the one described above -- not a comment on the individual PR, but a guideline edit that corrects the judgment for all subsequent reviews.
What changed / Bridge to Part 4
Over the past six months, the engineer's role on cortex shifted from "writer" and "reviewer" to "operator" -- the human running the system, not acting inside each individual decision.
- AI writes the code (Claude Code)
- AI reviews the code (auto review)
- A different AI applies the fixes (author mode running on the PR author's machine)
- AI decides when to merge (auto-merge script)
- Deploys go in parallel (Turborepo + Pulumi)
What stays in human hands: "what to build at all (product / requirements)," "is this direction actually right (architectural judgment)," "which guideline to add and where," and "look at the reviews and adjust prompts and guidelines accordingly." High-abstraction work -- not individual decisions, but watching the whole system from above and steering. From human-in-the-loop to human-on-the-loop, you could say.
The widely-reported phenomena -- "AI lowers quality," "the reviewer becomes the bottleneck" -- happen when the harness is extended on the writer side only, and the reviewer side is left to humans. If writing speeds up and reviewing doesn't, of course it bottlenecks. Of course things get missed.
cortex is the opposite. We extended the harness on the reviewer side first, before fully extending it on the writer side. Anthropic's observation that the bottleneck shifts from writing to reviewing is exactly right -- which is precisely why "move the reviewer role to AI as well" is the answer cortex chose.
"The AI writes the code, the AI reviews the code." That's the core of cortex's auto-review pipeline. Quality drop and review bottleneck are functions of how far you extend the harness -- they are not inherent to AI-assisted development.
Up next in Part 4 — Self-Healing + Recurrence Prevention: a pipeline where a production alert (observed via OTel/Loki/Mimir/Tempo/Faro) triggers AI investigation, an AI-authored fix PR plus a new lint/type gate, auto-review, auto-merge, and auto-redeploy. The fix and a recurrence-prevention guardrail land together, so the same class of incident structurally can't fire again. If auto review protects quality at PR time, Part 4 protects it at production time, while growing the quality gates themselves.
The headline number above includes Self-Healing PRs (production alerts that AI investigates, fixes, and auto-deploys). For certain classes of incidents, the fix is already merged before anyone has time to react — that's where cortex sits today.



Top comments (15)
running at scale requires betting humans will notice when to intervene - that's the real tradeoff. curious what signals surface a PR for actual human review in practice.
Great question — and I think the framing already points at the right concern. "Humans noticing when to intervene" is exactly the model that stops being reliable at this scale; once you're past ~700 PRs/month, attention-based oversight breaks.
What we do instead is make escalation structural, not attentional. The concrete signals that surface a PR for human review:
None of these depend on a human happening to notice — they're explicit triggers built into the pipeline.
The deeper bet is that humans aren't watching individual PRs at all. They're watching aggregate AI behavior and tuning the guidelines (the policy layer). When the AI misjudges a class of issue, the right intervention is rewriting the rule, not stepping into the PR. That's what human-on-the-loop means in practice here.
structural escalation is the right frame - what signals break the glass for you? my attempt was blast-radius tags at design time, which helps but just shifts the attention problem upstream
Right — that's exactly the meta-problem. Any tag-based mechanism just relocates "who attends to the tag."
Our break-the-glass signals are deliberately action-coupled rather than attention-coupled:
The shape we landed on: the signal doesn't say "pay attention," it says "this is blocked until you act," and the context the human needs (diff, severity rationale, investigation log) is delivered with the signal. The queue is short and each item arrives with its own briefing.
To address the "attention shifts upstream" worry directly: the same problem applies at design-time too — blast-radius tags help only if a human is going to read every tag. The way we keep it bounded is the [Recurrence] loop: when a class of issue trips the human-only gate twice, the next PR is required to add a generation-time constraint (lint / type / CI guard) in the same PR that fixes the symptom. So the "human-only" categories shrink in absolute terms over time, even while throughput grows.
Your blast-radius tagging would slot into this nicely as one more pre-classified category — the question is whether the tag triggers attention or triggers action.
action-coupled is cleaner - context and signal arrive together. what happens when the handoff message itself gets buried in a busy channel?
Right — that's the natural next concern. The answer: the Slack message isn't the source of truth, it's just the notification. The actual queue lives on the PR itself.
When a PR breaks the glass:
The label + PR state is the durable queue. We treat "open PRs with needs-human label" (and "PRs awaiting my review") as the canonical pending-attention surface — the same way you'd treat a bug tracker. It doesn't depend on anyone happening to see a chat message in the moment.
So Slack is the push layer; the PR list is the pull layer. Both exist on purpose: push catches you when you're online; pull catches you when you come back from PTO or finally clear the queue. Slack channels are also severity-routed (#cortex-fatal is kept intentionally low-volume so critical items don't drown in the warning channel).
That said — this pattern is tuned for "will be picked up within hours / a day," not "must be picked up in 60 seconds." Anything that time-sensitive probably shouldn't sit behind an AI handoff in the first place; it belongs in pager-style escalation. The break-the-glass pattern handles the 'AI couldn't, but it's not on fire' band, which turns out to be most of what shows up.
the PR-as-queue pattern is solid when the agent’s whole world is the repo. the one that broke this assumption for me: a contract validation agent that ran pre-PR. needs-human had nowhere to land until we added a draft PR just to carry the label. works but feels like fighting the abstraction.
Yeah, that's a real gap — agreed. PR-as-queue works when the unit of work is the PR; once you have agents whose unit of work is something else (pre-PR validation, long-lived branches, sandbox-style sessions), you have to either pick a different queue substrate or wrap the output in a PR shape just to give it somewhere to live.
We hit something similar with an annotation agent that runs on a long-lived branch that never merges. Ended up with a single permanently-open PR acting as the carrier for review comments and labels — same vibe as your draft-PR workaround. Works, but clearly fighting the abstraction.
Good exchange — useful pressure-test on how far the framing actually generalizes. Thanks for the back-and-forth.
wrapping in PR shape is a real workaround until you need rollback that doesn't map cleanly to git. what substrate did your annotation agent end up on? sounds like you found the edge of the same pattern.
Good direct question. The annotation agent ended up on:
So git stays the source of truth and side effects converge on it. "Rollback" for annotations is just git revert on the annotation branch; the indexing pipeline catches up. PR-as-queue still carries the human attention, label, and conversation; the indexing handles the state convergence.
But your "rollback doesn't map cleanly to git" point is real, and I think it's exactly
the edge of this pattern. It works as long as:
Once any of those fail — an already-mutated third-party state, an event already emitted to a fanout, a destructive DDL applied to prod data — git alone can't roll you back and PR-as-queue stops carrying enough state. At that point you need event sourcing, separate state stores, or compensating-action graphs. The pattern is great for "work that converges on git"; weaker for "work whose effects escape git."
So yeah — same edge, approached from a different direction. Useful to name it explicitly.
makes sense once the branch is a mutable store rather than a log. trying to serialize annotation state back into commits would get noisy fast.
Yeah, exactly that. We rebase / squash that branch liberally and let the PR's review threads carry the conversational state — commits aren't trying to be a record. Trade-off is losing commit-level audit on the branch, which works for annotations and wouldn't for something like financial state. "Mutable store vs log" is the cleaner framing.
that recoverable/not line is the whole call - annotations tolerate lossy history fine, financial state can't.
769 PRs/month is the scale at which review-time controls start to bend. The interesting design question is which constraints belong at generation time (so the agent can never produce a violating PR) vs. which belong at review time (where humans still adjudicate judgment calls). The mix is what determines whether the quality bar actually rises.
Yes, this is the right design question. The generation-time vs review-time split is essentially Martin Fowler's Guides / Sensors taxonomy, and you're right that the mix is what determines whether the bar rises or just holds.
Cortex's bet is to push as much as possible to generation time — but with an explicit mechanism for the mix to shift over time.
Generation time (proactive):
Review time (reactive):
The piece that I think actually moves the bar over time is what we call the [Recurrence] loop: bug-fix PRs are required by the review guideline (recurrence-prevention.md) to pick one of {add lint rule, horizontal rollout to other call sites, add guideline item, or "single occurrence — nothing"}. When a class of issue trips up review twice, the rule is required to migrate from review-time to generation-time in the same PR that fixes the symptom.
So "whether the quality bar rises" isn't a static-mix question — every review-time catch becomes a candidate for promotion into a generation-time constraint, and the loop keeps converting Sensors into Guides over time. Part 4 of the series gets into this in more detail.
Thanks for the framing — this is exactly the lens that surfaces what's actually load-bearing.