Ryosuke Tsuji

Posted on May 26 • Edited on Jun 8 • Originally published at ryantsuji.dev

Human-on-the-Loop: AI Reviewing AI PRs at cortex -- 769 PRs/month while raising the quality bar (Series Part 3)

#ai #codereview #devops #productivity

Hi, I'm Ryan, CTO at airCloset.

Disclaimer: "cortex" in this article is the internal codename for an AI platform built in-house at airCloset. It is unrelated to existing commercial services like Snowflake Cortex or Palo Alto Networks Cortex.

In Part 1 (intro) I covered the high level -- AI driving both PR reviews and incident response on top of cortex. In Part 2 (Product Graph) I went deep on cpg, the unified knowledge graph that fuses code, docs, DB schemas and infra into a single business-aware index.

This post is about the automated PR review pipeline -- AI reviews the PR, a separate AI applies the fixes, and the system merges automatically once policy gates pass. The usual critiques of AI-assisted development ("the reviewer becomes the bottleneck" and "AI code drops the quality bar") don't really apply here. The rest of this post unpacks why.

Series

#	Theme	Key scene	Article
1	Series intro: cortex harness	PRs merging unattended / incidents fixed before anyone notices	ai-harness-intro
2	Product Graph (cpg)	Code / docs / DB / infra unified into one graph	cortex-product-graph
3	Auto PR review	webhook -> AI review -> auto-fix -> squash merge	This article ← you are here
4	Self-Healing + observability + auto-added guardrails	Alert -> AI investigates -> fix PR + new lint/type gate -> auto redeploy + same-pattern writes get auto-rejected	cortex-self-healing
5	Democratizing the maintenance phase	Domain experts open PRs to production; the harness owns the quality gate	cortex-non-engineer-prs
6	Series wrap-up	The underlying philosophy (what was given up, what was kept, why this design) plus a retrospective on the failures and lessons	Coming soon

Start with last month's numbers

769 PRs merged.

Median time to merge: 31 minutes.

Human review involvement per PR: near-zero.

That's a typical 30 days on cortex (Apr 21 -- May 21).

Every one of those 769 PRs had an AI reviewer as the first reviewer, with an average of 10.8 review-fix loop iterations per PR (max 56). 1 in 5 merged within 10 minutes, roughly half within 30 minutes. What humans do now is look at review outcomes and tune the review prompt and the guidelines themselves -- this is human-on-the-loop, not human-in-the-loop. Humans operate on the policy layer, not the execution layer.

Past 30 days
PRs merged	769
AI reviewer coverage	100%
Avg review iterations / PR	10.8
Max review iterations	56
Per-PR human review	~0%
Median time-to-merge	31 min
Merged within 10 min	20%
Merged within 30 min	49%

This is a typical month on cortex now.

The common refrain -- "AI speeds up writing but reviews still bottleneck" and "AI-written code lowers quality" -- is something cortex absorbs through a pipeline where neither failure mode can take hold. Let me break it down.

How the review bottleneck stops forming

The conventional wisdom: the reviewer becomes the bottleneck

As AI writes faster, the load on whoever reviews the output grows proportionally. Anthropic's internal blog (How Anthropic teams use Claude Code) reports the same pattern -- the bottleneck has shifted from writing to reviewing, and senior engineers' work has moved from writing code toward integrating and reviewing AI output.

cortex hit exactly this. The moment we ran Claude Code at full throttle, writing speed jumped by an order of magnitude or more. Meanwhile the human time available to read and approve PRs only grew linearly. If the reviewer (=me) took a day off, the whole org stalled -- a classic single point of failure.

cortex's answer: move the reviewer role to AI as well

Part 1 and Part 2 kept asking the same recurring question: "how far do you push the harness?" cortex went all-in: the AI writes the code, the AI reviews the code. What humans keep their hands on is "tuning the prompts and guidelines themselves" -- not making decisions inside each individual PR, but watching the system from above and adjusting.

Three conditions had to hold for this to work:

The AI reviewer has enough context

A generic AI reviewer only sees the PR diff. The diff alone hides business meaning, upstream/downstream dependencies, and prior incident history. cortex feeds the Product Graph (cpg) from Part 2 -- a knowledge graph that fuses code, docs, DB schemas, and infra into one structure, with each node carrying business role and upstream/downstream dependencies -- into the AI reviewer, so it can trace impact into code that the PR didn't even touch. It catches:

- Missed upstream/downstream fixes
- Missed doc updates
- Tests that should have been updated but weren't

Diff-only AI review can never reach this territory.

Reviews are not improvisational

If reviews shift day to day, the team gets confused, and the AI can't be told what "correct" looks like. We enforce this by passing an explicit review-guideline document as the mandatory citation source for every review (we open-sourced a snapshot, see below).
False positives don't blanket-block merges

Treating every false positive as Critical breaks the workflow. We control this with a severity hierarchy (Critical / Major / Minor / Nit) plus strict no-downgrade rules.

So: the cpg from Part 2 solves "what context the AI sees," the review guidelines solve "what the AI should do" as Guides (pre-execution control), and the severity ladder + no-downgrade rules solve "what the AI must not do" as Sensors (post-execution control). This maps cleanly onto Martin Fowler's Guides / Sensors taxonomy (introduced back in Part 1).

One more upstream layer: before any of those three kicks in, a 500-lines-per-file lint keeps every file in any PR small enough to fit in a single AI session. That alone keeps AI review from breaking down, and unlike a human reviewer, the AI doesn't lose focus. There are plenty of other lints in front of the AI reviewer too, but the full picture belongs to Part 4 (Self-Healing + observability + auto-added guardrails).

How the auto-review system is wired

The implementation is a script running on each developer's machine. GitHub webhooks land on an in-house Event Relay server, get persisted to Firestore, and each developer's machine subscribes as an SSE client. On reconnect, Last-Event-ID replays anything missed -- zero event loss, single webhook registration. Reviewer-mode machines stay always-on, so any incoming review fires immediately. Author mode runs in the background on the PR author's own machine, alongside their normal dev work.

How we ended up with Event Relay

The current setup wasn't the original design.

First: GitHub webhook → smee.io → each machine
Then: GitHub webhook → Cloudflare Tunnel → each machine
Now: GitHub webhook → in-house Event Relay with Firestore persistence → SSE to each machine

Both smee.io and Cloudflare Tunnel ran into connection drops and missed deliveries, which caused real misses for us. Switching to the in-house Event Relay brought event loss to zero (Firestore persistence + Last-Event-ID replay), and the relay turned into a general-purpose layer we could reuse.

The webhook ingestion for Self-Healing (covered in Part 4) actually goes through the exact same Event Relay. GitHub, Grafana, and other webhook sources get consolidated through one relay, and each machine's SSE client subscribes to whichever events it cares about. Having a single general-purpose webhook relay is a piece of infra that keeps paying off in unexpected ways -- worth investing in early.

When the reviewer's machine receives an event, the script spawns claude -p and walks through 9 dimensions (Graph / Architecture / Security / Test / Doc / Impact / Observability / AI-Antipattern / Recurrence) sequentially, then reads the verdict marker the AI emitted at the end and posts APPROVE or REQUEST_CHANGES via gh pr review.

A few notes:

Modes split the role -- the same script started with --mode reviewer becomes the reviewer process; with --mode author it becomes the PR-author response process. The machine of whoever is assigned as reviewer runs reviewer mode; the machine of whoever opened the PR runs author mode. Event Relay multicasts the events, and each machine reacts in a distributed way.
Per-PR worktree isolation -- author mode merges origin/main into a fresh worktree before spawning the AI. Multiple PRs can be handled in parallel without file state contaminating across them.
9 dimensions checked sequentially in one session -- not parallel sub-agents. A single claude -p session walks the 9 dimensions while keeping context shared, which also catches cross-dimension contradictions.
Review guidelines: public snapshot -- air-closet/cortex-review-guidelines (JP/EN). The live guidelines are inside cortex (private repo) and evolve daily; the public repo is a snapshot extracted for reference.

:::message alert
Guidelines alone scale only to projects in the tens-of-thousands-of-lines range. At cortex's scale (over 1M lines of code), the knowledge graph from Part 2 (cpg) is a hard prerequisite. Porting the guidelines without cpg won't reproduce the same review quality -- the AI reviewer simply can't navigate the codebase fast enough to reason about impact.
:::

Why sequential single-session review, not parallel sub-agents

We initially tried splitting the 9 dimensions across parallel sub-agents. Three problems emerged: cpg / guidelines / PR diff got injected 9 times (token cost balloons), cross-dimension findings couldn't reference each other (a [Test] issue rooted in a [Graph] violation gets dropped in isolation), and aggregating 9 outputs into a single verdict required its own machinery.

A single sequential session fixes all three: one cpg/guideline load, earlier findings stay in context for later dimensions (cross-dimension consistency comes for free), and one verdict marker at the end is the entire aggregation step.

We also swap CLAUDE.md to a review-specific version at startup. The default CLAUDE.md is dense with development-time context (Product Graph ops, prod-data safety, MCP ordering) -- noise for a reviewer. The review-specific version centers on severity, no-downgrade, and the verdict marker spec, keeping AI attention on the review task.

Cutting wasted context lifts judgment precision and token cost at the same time.

Operational knobs

A few filters and toggles we apply in actual use:

Draft (WIP) PRs are excluded. GitHub Draft state is received but skipped; review starts firing once the author flips it to Ready for Review.
Specific PRs can be targeted manually. The webhook is the normal trigger, but you can also kick off a review against a specific PR number from the CLI -- useful after a CI failure or for re-checking a single PR.
Auto-merge is the PR author's call. Whether the pipeline runs through to auto-merge after APPROVE + CI green is set by the PR author. Default is on; for changes that go directly to prod, the author can flip it off and hit merge themselves.

Output structure: tags and severity

Every auto-review comment is structured as tag + severity + concrete example.

Tags (dimensions)

Tag	Dimension	Primary target
`[Graph]`	Product Graph integrity	`@graph-*` JSDoc, node dependencies, doc consistency
`[Doc]`	Doc consistency	Doc updates that should follow code changes, doc placement
`[Impact]`	Impact analysis	Missed upstream/downstream fixes, `via:` field inconsistency
`[Security]`	Security	Auth, input validation, secrets
`[Architecture]`	Composable Architecture	app/package boundaries, dependency direction
`[Test]`	Test quality	Coverage, matchers, naming
`[Observability]`	Observability	Structured logging, no-truncate rules
`[AI-Antipattern]`	AI-generated code traps	Hallucinated APIs, fallback overuse, dead code
`[Recurrence]`	Recurrence prevention	Bug-fix triage (lint / horizontal rollout / new guideline)

Severity

Severity	Criteria	Action
Critical	Security, data corruption, prod-risk, doc inconsistency, missing `@graph-*`, quality-bar relaxation	`REQUEST_CHANGES`
Major	Spec violation, Composable Architecture violation, missing tests	`REQUEST_CHANGES`
Minor	Naming, maintainability, light refactor	`REQUEST_CHANGES` (must be resolved)
Nit	Style preference, minor inconsistency	`APPROVE` (comment only)

The single most important rule is "no downgrade":

"Following existing patterns" is not a valid reason to downgrade (existing violations are a separate problem to fix, not justification for new violations).
"Will be addressed in a separate PR" or "incrementally" is not a valid reason to drop Critical/Major to Nit.
"Leave a TODO/FIXME" is not a valid deferral path.

This is spelled out in severity.md, and the AI literally cites it back when it issues REQUEST_CHANGES.

Example: PR migrating the embedding model from gemini-embedding-001 to gemini-embedding-2

A real review comment is more useful than abstract description. Here's a typical one. This is a feature PR (feat(meet): dual-write embeddings to new 'embedding' column (v2)) that was merged on 2026-05-19, after 6 review-fix iterations across 1.5 hours.

First review (07:35:25):

The dual-write approach, the idempotent migration script, and the Promise.all parallelization all look sound. Please address the 3 Critical and 2 Minor findings below before merging.

Critical

[Graph] Missing @graph-business tag (x3)

graph-integrity.md requires @graph-business on every top-level declaration under the app layer (severity: Critical).
All 3 new declarations below are missing it (detailed comments in separate threads).
Existing code is also missing it, but "following existing patterns" is not a valid downgrade reason (severity.md).

generateEmbeddingV2 (v2 embedding generation function)

EMBEDDING_MODEL_V2 (model name config)

EMBEDDING_LOCATION_V2 (region config)

[Graph] embedMeetContent's @graph-connects doesn't reflect generateEmbeddingV2

The JSDoc on embedMeetContent has @graph-connects generateEmbedding [calls] Generate embedding, but no corresponding @graph-connects line has been added for the newly introduced generateEmbeddingV2 call.
The graph will be missing an edge to generateEmbeddingV2.
   * @graph-connects generateEmbedding [calls] Generate embedding
+  * @graph-connects generateEmbeddingV2 [calls] v2 embedding generation (dual-write)
   * @graph-connects insertMeetChunks [calls] Insert chunks into BQ
[Doc] Corresponding BigQuery schema doc is not updated

The "BigQuery schema" section in the related doc is missing the new embedding column.
Both graph-integrity.md and severity.md define doc inconsistency as Critical.
 | `created_at`  | TIMESTAMP   | Created at                              |
+| `embedding`   | FLOAT64[]   | Embedding vector (v2: gemini-embedding-2) |
Minor

[Test] textEmbeddingV2 value is not asserted

objectContaining allows extra fields, so the test still passes even when the v2 value is never set.
         textEmbedding: [0.1, 0.2, 0.3],
+        textEmbeddingV2: [0.1, 0.2, 0.3],
[Test] No isolated scenario for "v2 returns null"

generateEmbeddingV2: mockGenerateEmbedding reuses the v1 mock, so the case "v2 returns null while v1 succeeds" is not independently verified.



The takeaway is the precision of the details.

File + line numbers are concrete.
Suggested fixes are in diff format (copy-paste ready).
Source guideline (graph-integrity.md / severity.md) is cited explicitly.
The typical excuse ("existing code has the same problem") is pre-emptively closed.
The trailing  is a machine-readable verdict marker -- the trigger that moves the PR into REQUEST_CHANGES state.

After this, the PR author (= usually another AI running on the author's machine) pushes a fix, the reviewer re-reviews. The next review confirms all 3 Criticals are actually resolved, raises the next Major / Critical, and so on. 6 iterations in 1.5 hours, finally APPROVE, auto-merge.

Plotted on a timeline:

With a human reviewer, this is "Critical x3 -> wait until tomorrow for the fix -> re-review the day after" -- 2 to 3 days per PR. cortex closes it in 90 minutes.

The difference between human review and auto review is not just speed. A single AI session walks all 9 dimensions in order and cites the guideline each time, which makes it much harder to miss the "deep" findings humans drop because their attention drifted -- doc consistency, recurrence-prevention judgments, weak matchers. Side-by-side comparison:

This is why the review bottleneck never forms here.

Evolving the guidelines: catching the moments AI gets it wrong, then fixing the rules

The review guidelines I've been referring to are not a static document. Running this in production surfaces recurring patterns where the AI mis-judges a specific class of issue. Each time that happens, we don't add a comment to the individual PR; we rewrite the guideline so the AI behaves correctly next time -- this is the meta-layer humans actually operate on.

A few concrete failures we hit on cortex, and how we closed each one by changing the rule, not the PR.

1. AI was downgrading because "existing code has the same issue"

Early on, immediately after flagging a violation the AI would add "however, since existing code has the same violation, I'm downgrading this to Nit" and self-downgrade. The result: violations on newly added code kept dropping to Nit, and the system kept emitting Approve.

We closed this by adding the no-downgrade rule to severity.md:

"Following existing patterns" is not a valid downgrade reason: if existing code violates a guideline, new code following that pattern still gets flagged at the same severity. Deferral language like "consider during the next refactor" is not accepted.

That wasn't enough on its own. Over time other excuse patterns surfaced -- "will be addressed in a separate PR," "will be addressed in the next session," "out of scope," "incrementally" -- so we added those as forbidden downgrade categories too. We also explicitly forbade deferring via TODO/FIXME comments in code. The mindset is: close every typical excuse path preemptively.

2. The final verdict had 3 options, and "comment-only" left PRs in limbo

The final verdict at the end of every review was originally APPROVE / REQUEST_CHANGES / COMMENT (approve / request changes / comment-only). When the AI picked COMMENT -- for example when only Minor issues existed -- the script took no action, the PR sat in review-pending forever, and ultimately someone had to manually pick it up. Classic anti-pattern, and it kept happening.

We collapsed the verdict to 2 options. Anything Minor or above is REQUEST_CHANGES, a missing verdict marker defaults to REQUEST_CHANGES (safe side), and only Nit-only or no findings (with CI passing) yields APPROVE. The principle: "if the judgment is ambiguous, fail-safe by defaulting to the blocking side (REQUEST_CHANGES)." Going all-in on that design eliminated the stuck-PR class entirely.

3. Checklist items had no severity, so the AI's judgment kept drifting

Originally, each guideline (graph-integrity.md, testing.md, etc.) was just a bulleted checklist. Items like "Is the test name descriptive?" or "Are mocks minimized?" were listed, but without per-item severity. As a result, the same violation could land as Major in one PR and Nit in another, depending on the session.

We converted every guideline's checklist into a severity / scope / criterion table:

Severity	Scope	Criterion
Critical	All PRs	Missing `@graph-business`
Major	App layer only	Missing tests
Minor	Shared packages only	More than 3 function args
Nit	All PRs	Naming inconsistency

The scope column is a machine-decidable filter for which paths a check applies to, so the AI reviewer doesn't trigger irrelevant items on PRs outside that scope. Just putting it in a table -- the judgment reproducibility jumped significantly.

4. The existing guidelines didn't catch AI-specific traps

After running this for a while we noticed AI-generated code has its own cluster of antipatterns -- calling APIs that don't exist (hallucinated APIs -- something like user.findOrCreate() that looks plausible but isn't actually defined), swallowing errors and returning fallback values (e.g., silently returning an empty array when an upstream API fails), leaving unused functions (a refactor adds the new function but doesn't delete the old one, leaving dead code), expanding the modification scope beyond what was asked (you ask it to change one function and it reformats the whole file), adding unnecessary backward-compatibility code (creating a deprecated alias for an internal-only function) -- and security.md / testing.md couldn't catch these. There's a distinct class of "mistakes only AIs make."

We added a dedicated ai-antipattern.md for this. Reviews now pick these up explicitly under the [AI-Antipattern] tag. Reviewing AI output requires designing around AI-specific traps -- you don't get there just by porting human review heuristics onto an AI.

5. The AI tries to relax "the standard itself"

The last and most important pattern. When the AI was writing fix PRs, occasionally instead of fixing the guideline violation it would write a PR that relaxes the guideline. For example:

Lower the test coverage threshold to avoid writing more tests
Narrow the in-house lint rule's scope to make the violation go away
Soften the guideline doc language from "recommended" to "preferred" to weaken the binding constraint

And the AI builds a formally-coherent justification: "existing code already violates this, so let's adjust the standard to match the implementation." Left unchecked, the AI gradually walks the quality bar down.

We closed this by adding "quality-bar relaxation" as a Critical in severity.md:

A PR that relaxes the quality bar -- guideline doc, lint rule, coverage threshold -- must not be Approved by the AI reviewer. It is sent back with REQUEST_CHANGES. A human reviewer's approval is required. "Existing code already violates this" is not a valid justification for relaxation.

This is the one explicit boundary where we deliberately do not give the AI autonomous Approve authority. Whether the standard itself moves is a human decision. It's the meta-level safety valve for the "AI reviewing AI" architecture.

Evolving the guidelines is the meta-layer humans actually operate on

The common thread: "when the AI gets it wrong, don't override the individual PR -- rewrite the guideline so the fix propagates forward."

AI escapes via "existing code has the same issue" -> add no-downgrade rule
AI picks "comment-only" and PR stalls -> collapse to 2-option verdict
AI's judgment drifts -> add severity / scope columns to every item
AI falls into its own traps -> add the AI-Antipattern category
AI tries to relax the standard -> classify standard-relaxation as Critical, require human Approve

As long as this loop turns, the guideline is a living document that absorbs the failure patterns AI produces in production. Don't try to write the perfect guideline up front. Catch the moment AI gets it wrong, and write the rule for that moment. That's the actual mechanism behind "quality doesn't drop even when humans aren't inside the loop."

And one more thread. Right now, the trigger for "AI got it wrong, time to rewrite the guideline" is still mostly a human judgment, but parts of that maintenance are gradually becoming automatable too. Self-Healing (Part 4 next time) -- where AI investigates production incidents, opens a fix PR, runs it through auto-review, and auto-redeploys -- requires every fix PR to write one of {add lint, add guideline, horizontal rollout} under the [Recurrence] lens. So the AI is increasingly participating in the maintenance of its own review criteria, with humans still in the loop on adoption. I'll come back to this in Part 4.

Auto-fix: a separate AI applies the changes and pushes

Once REQUEST_CHANGES lands, the same script running on the PR author's machine, but in author mode, picks up the event and starts working.

[REQUEST_CHANGES detected]
   | SSE push via Event Relay
[Author mode boots on PR author's machine]
   | Merge origin/main into a worktree
   |  (lockfile resolved up front, remaining conflicts handled by AI)
   | Read the auto-review comment as context
   | Run claude -p inside the worktree
   | Commit + push the changes
   | New SHA is delivered back to the reviewer's machine via Event Relay -> re-review

Two design choices matter here.

Reviewer and author run on different machines in different sessions -- reviewer mode and author mode are the same script, but they run on different machines in different processes. "Is the original critique correct?" is judged independently. Unlike a single AI fixing its own complaints, the judgment passes between two separate sessions.
All iteration stays inside the same PR -- we don't spawn a new PR. The "fix the root cause, no deferrals" rule from Part 2 and the review guidelines kicks in here: if the AI tries to escape via TODO/FIXME or by splitting work out into a separate PR, the next review rejects it.

Auto-merge + parallel deploy

Once auto-review returns APPROVE and CI is fully green, the auto-merge script runs and squash-merges the PR.

[Auto review APPROVE + CI green]
   |
auto-merge script
   | squash merge to main
   |
[main updated]
   |
Turborepo build (affected packages only)
   |
Pulumi up (multiple stacks in parallel)
   |- API services
   |- pipeline services
   |- MCP servers
   `- infra
   |
[Deploy complete]
   |
cpg index rebuilt (only changed nodes regenerate embeddings -- see Part 2)

pulumi up <stack1> <stack2> ... runs in parallel, so deploying 9 stacks at once finishes in about 8-12 minutes. End to end, merge-to-production is averaging 10-15 minutes.

This compounds nicely with Self-Healing PRs. Incident alert -> Self-Healing identifies root cause -> opens a fix PR -> auto review pass -> auto merge -> auto deploy runs as a single closed loop without human involvement (covered in Part 4).

The numbers, in more detail

Unpacking the headline numbers a bit further.

Depth of the review-fix loop

Across 769 PRs in 30 days, the average per PR was 10.8 review iterations, max 56. The fact that the average is past 10 means the first review almost always surfaces at least one finding.

The embedding-model migration PR shown earlier needed 6 iterations to merge, and that's representative of the average PR. What would take a human reviewer days, cortex resolves in minutes.

What the auto reviewer typically flags

The most common findings out of the first review:

[Graph] Missing @graph-business -- a prerequisite cpg leans on (from Part 2). The classic finding on newly added declarations.
[Doc] Doc inconsistency -- code changed but the corresponding docs/ section was not updated.
[Test] Weak matchers -- objectContaining weakening value assertions, single-property checks via toBe.
[Observability] Unstructured error logs -- event field or required keys deviating from the structured-log spec.
[Recurrence] No recurrence-prevention action -- a bug-fix PR description not declaring which of {lint / horizontal rollout / add guideline / nothing} applies.

These are categories human reviewers frequently miss in practice, especially doc consistency and recurrence-prevention checks. The AI reviewer applies them mechanically on every PR.

Actual false-positive rate

It's not zero. A few times a month we get "this is Nit, not Major" type misjudgments. The fix path is the one described above -- not a comment on the individual PR, but a guideline edit that corrects the judgment for all subsequent reviews.

What changed / Bridge to Part 4

Over the past six months, the engineer's role on cortex shifted from "writer" and "reviewer" to "operator" -- the human running the system, not acting inside each individual decision.

AI writes the code (Claude Code)
AI reviews the code (auto review)
A different AI applies the fixes (author mode running on the PR author's machine)
AI decides when to merge (auto-merge script)
Deploys go in parallel (Turborepo + Pulumi)

What stays in human hands: "what to build at all (product / requirements)," "is this direction actually right (architectural judgment)," "which guideline to add and where," and "look at the reviews and adjust prompts and guidelines accordingly." High-abstraction work -- not individual decisions, but watching the whole system from above and steering. From human-in-the-loop to human-on-the-loop, you could say.

The widely-reported phenomena -- "AI lowers quality," "the reviewer becomes the bottleneck" -- happen when the harness is extended on the writer side only, and the reviewer side is left to humans. If writing speeds up and reviewing doesn't, of course it bottlenecks. Of course things get missed.

cortex is the opposite. We extended the harness on the reviewer side first, before fully extending it on the writer side. Anthropic's observation that the bottleneck shifts from writing to reviewing is exactly right -- which is precisely why "move the reviewer role to AI as well" is the answer cortex chose.

"The AI writes the code, the AI reviews the code." That's the core of cortex's auto-review pipeline. Quality drop and review bottleneck are functions of how far you extend the harness -- they are not inherent to AI-assisted development.

Up next in Part 4 — Self-Healing + Recurrence Prevention: a pipeline where a production alert (observed via OTel/Loki/Mimir/Tempo/Faro) triggers AI investigation, an AI-authored fix PR plus a new lint/type gate, auto-review, auto-merge, and auto-redeploy. The fix and a recurrence-prevention guardrail land together, so the same class of incident structurally can't fire again. If auto review protects quality at PR time, Part 4 protects it at production time, while growing the quality gates themselves.

The headline number above includes Self-Healing PRs (production alerts that AI investigates, fixes, and auto-deploys). For certain classes of incidents, the fix is already merged before anyone has time to react — that's where cortex sits today.

Top comments (15)

Mykola Kondratiuk • May 29

running at scale requires betting humans will notice when to intervene - that's the real tradeoff. curious what signals surface a PR for actual human review in practice.

Ryosuke Tsuji • May 29

Great question — and I think the framing already points at the right concern. "Humans noticing when to intervene" is exactly the model that stops being reliable at this scale; once you're past ~700 PRs/month, attention-based oversight breaks.

What we do instead is make escalation structural, not attentional. The concrete signals that surface a PR for human review:

Quality-bar relaxation PRs — anything that lowers a lint rule, coverage threshold, or guideline binding is classified Critical in severity.md. The AI is forbidden from approving it; a human reviewer's approve is required. This is the meta-safety valve in the post.
Repeated auto-fix failures — if the author-mode AI can't pass CI after a few attempts, the PR gets a needs-human label.
Alert-Fix "can't fix" verdicts — when AI investigation can't pin down a root cause, it posts the investigation log to Slack and explicitly hands off (Part 4 of the series goes deeper on this).
PR-author opt-out on auto-merge — authors can disable auto-merge per PR for changes they want to land carefully.

None of these depend on a human happening to notice — they're explicit triggers built into the pipeline.

The deeper bet is that humans aren't watching individual PRs at all. They're watching aggregate AI behavior and tuning the guidelines (the policy layer). When the AI misjudges a class of issue, the right intervention is rewriting the rule, not stepping into the PR. That's what human-on-the-loop means in practice here.

Mykola Kondratiuk • May 30

structural escalation is the right frame - what signals break the glass for you? my attempt was blast-radius tags at design time, which helps but just shifts the attention problem upstream

Ryosuke Tsuji • May 30

Right — that's exactly the meta-problem. Any tag-based mechanism just relocates "who attends to the tag."

Our break-the-glass signals are deliberately action-coupled rather than attention-coupled:

Explicit AI handoff — Alert-Fix posts its investigation log to Slack and says "I couldn't fix this safely" when it can't pin down a root cause. No flag to scan; the system tells you it's stuck, with everything it tried.
Limit violations — repeated CI-fix failures, unresolvable conflicts, etc. get a needs-human label. That's literally the queue.
Categories pre-defined as human-only — quality-bar relaxation (lint/coverage downgrades), credential handling, production data ops. Classified Critical in severity.md such that the AI is forbidden from approving them. Not "AI flags for human review" — "AI cannot approve, so the PR sits until a human does."

The shape we landed on: the signal doesn't say "pay attention," it says "this is blocked until you act," and the context the human needs (diff, severity rationale, investigation log) is delivered with the signal. The queue is short and each item arrives with its own briefing.

To address the "attention shifts upstream" worry directly: the same problem applies at design-time too — blast-radius tags help only if a human is going to read every tag. The way we keep it bounded is the [Recurrence] loop: when a class of issue trips the human-only gate twice, the next PR is required to add a generation-time constraint (lint / type / CI guard) in the same PR that fixes the symptom. So the "human-only" categories shrink in absolute terms over time, even while throughput grows.

Your blast-radius tagging would slot into this nicely as one more pre-classified category — the question is whether the tag triggers attention or triggers action.

Mykola Kondratiuk • May 30

action-coupled is cleaner - context and signal arrive together. what happens when the handoff message itself gets buried in a busy channel?

Ryosuke Tsuji • May 30

Right — that's the natural next concern. The answer: the Slack message isn't the source of truth, it's just the notification. The actual queue lives on the PR itself.

When a PR breaks the glass:

A needs-human label is applied (or it sits in REQUEST_CHANGES state). Auto-merge is disabled.
The relevant Slack channel gets a thread with the investigation log / verdict / diff.
Optionally, the assigned reviewer gets a Slack DM via our notification relay.

The label + PR state is the durable queue. We treat "open PRs with needs-human label" (and "PRs awaiting my review") as the canonical pending-attention surface — the same way you'd treat a bug tracker. It doesn't depend on anyone happening to see a chat message in the moment.

So Slack is the push layer; the PR list is the pull layer. Both exist on purpose: push catches you when you're online; pull catches you when you come back from PTO or finally clear the queue. Slack channels are also severity-routed (#cortex-fatal is kept intentionally low-volume so critical items don't drown in the warning channel).

That said — this pattern is tuned for "will be picked up within hours / a day," not "must be picked up in 60 seconds." Anything that time-sensitive probably shouldn't sit behind an AI handoff in the first place; it belongs in pager-style escalation. The break-the-glass pattern handles the 'AI couldn't, but it's not on fire' band, which turns out to be most of what shows up.

Mykola Kondratiuk • May 30

the PR-as-queue pattern is solid when the agent’s whole world is the repo. the one that broke this assumption for me: a contract validation agent that ran pre-PR. needs-human had nowhere to land until we added a draft PR just to carry the label. works but feels like fighting the abstraction.

Ryosuke Tsuji • May 30

Yeah, that's a real gap — agreed. PR-as-queue works when the unit of work is the PR; once you have agents whose unit of work is something else (pre-PR validation, long-lived branches, sandbox-style sessions), you have to either pick a different queue substrate or wrap the output in a PR shape just to give it somewhere to live.

We hit something similar with an annotation agent that runs on a long-lived branch that never merges. Ended up with a single permanently-open PR acting as the carrier for review comments and labels — same vibe as your draft-PR workaround. Works, but clearly fighting the abstraction.

Good exchange — useful pressure-test on how far the framing actually generalizes. Thanks for the back-and-forth.

Mykola Kondratiuk • May 30

wrapping in PR shape is a real workaround until you need rollback that doesn't map cleanly to git. what substrate did your annotation agent end up on? sounds like you found the edge of the same pattern.

Ryosuke Tsuji • May 30

Good direct question. The annotation agent ended up on:

A long-lived branch that never merges, holding the annotation state
A permanently-open PR from that branch as the review / label surface
An indexing pipeline (BigQuery + embeddings) that auto-converges on whatever the git state of the branch is

So git stays the source of truth and side effects converge on it. "Rollback" for annotations is just git revert on the annotation branch; the indexing pipeline catches up. PR-as-queue still carries the human attention, label, and conversation; the indexing handles the state convergence.

But your "rollback doesn't map cleanly to git" point is real, and I think it's exactly
the edge of this pattern. It works as long as:

Side effects are idempotent (re-index on revert recovers state)
External systems can replay from git (deploy + redeploy = rollback)
There's a compensating action for the truly irreversible parts

Once any of those fail — an already-mutated third-party state, an event already emitted to a fanout, a destructive DDL applied to prod data — git alone can't roll you back and PR-as-queue stops carrying enough state. At that point you need event sourcing, separate state stores, or compensating-action graphs. The pattern is great for "work that converges on git"; weaker for "work whose effects escape git."

So yeah — same edge, approached from a different direction. Useful to name it explicitly.

Mykola Kondratiuk • May 30

makes sense once the branch is a mutable store rather than a log. trying to serialize annotation state back into commits would get noisy fast.

Ryosuke Tsuji • May 30

Yeah, exactly that. We rebase / squash that branch liberally and let the PR's review threads carry the conversational state — commits aren't trying to be a record. Trade-off is losing commit-level audit on the branch, which works for annotations and wouldn't for something like financial state. "Mutable store vs log" is the cleaner framing.

Mykola Kondratiuk • May 30

that recoverable/not line is the whole call - annotations tolerate lossy history fine, financial state can't.

Theo Valmis • May 29

769 PRs/month is the scale at which review-time controls start to bend. The interesting design question is which constraints belong at generation time (so the agent can never produce a violating PR) vs. which belong at review time (where humans still adjudicate judgment calls). The mix is what determines whether the quality bar actually rises.

Ryosuke Tsuji • May 29

Yes, this is the right design question. The generation-time vs review-time split is essentially Martin Fowler's Guides / Sensors taxonomy, and you're right that the mix is what determines whether the bar rises or just holds.

Cortex's bet is to push as much as possible to generation time — but with an explicit mechanism for the mix to shift over time.

Generation time (proactive):

Lint, type checks, a 90% coverage gate, and a 500-lines-per-file cap — all enforced pre-commit / in CI.
cpg context fed to the author AI (same graph the reviewer would query), so the author already has the context the reviewer is going to evaluate against.
Author-side CLAUDE.md encodes architectural rules so violations rarely get written in the first place.

Review time (reactive):

9-dimension AI review with a severity hierarchy and strict no-downgrade rules.
Severity gates merge based on the verdict marker.

The piece that I think actually moves the bar over time is what we call the [Recurrence] loop: bug-fix PRs are required by the review guideline (recurrence-prevention.md) to pick one of {add lint rule, horizontal rollout to other call sites, add guideline item, or "single occurrence — nothing"}. When a class of issue trips up review twice, the rule is required to migrate from review-time to generation-time in the same PR that fixes the symptom.

So "whether the quality bar rises" isn't a static-mix question — every review-time catch becomes a candidate for promotion into a generation-time constraint, and the loop keeps converting Sensors into Guides over time. Part 4 of the series gets into this in more detail.

Thanks for the framing — this is exactly the lens that surfaces what's actually load-bearing.

View full discussion (15 comments)

DEV Community

Human-on-the-Loop: AI Reviewing AI PRs at cortex -- 769 PRs/month while raising the quality bar (Series Part 3)

Series

Start with last month's numbers

How the review bottleneck stops forming

The conventional wisdom: the reviewer becomes the bottleneck

cortex's answer: move the reviewer role to AI as well

How the auto-review system is wired

How we ended up with Event Relay

Why sequential single-session review, not parallel sub-agents

Operational knobs

Output structure: tags and severity

Tags (dimensions)

Severity

Example: PR migrating the embedding model from gemini-embedding-001 to gemini-embedding-2

Critical

[Graph] Missing `@graph-business` tag (x3)

[Graph] `embedMeetContent`'s `@graph-connects` doesn't reflect `generateEmbeddingV2`

[Doc] Corresponding BigQuery schema doc is not updated

Minor

[Test] `textEmbeddingV2` value is not asserted

[Test] No isolated scenario for "v2 returns null"

Evolving the guidelines: catching the moments AI gets it wrong, then fixing the rules

1. AI was downgrading because "existing code has the same issue"

2. The final verdict had 3 options, and "comment-only" left PRs in limbo

3. Checklist items had no severity, so the AI's judgment kept drifting

4. The existing guidelines didn't catch AI-specific traps

5. The AI tries to relax "the standard itself"

Evolving the guidelines is the meta-layer humans actually operate on

Auto-fix: a separate AI applies the changes and pushes

Auto-merge + parallel deploy

The numbers, in more detail

Depth of the review-fix loop

What the auto reviewer typically flags

Actual false-positive rate

What changed / Bridge to Part 4

Top comments (15)

Series

Start with last month's numbers

How the review bottleneck stops forming

The conventional wisdom: the reviewer becomes the bottleneck

cortex's answer: move the reviewer role to AI as well

How the auto-review system is wired

How we ended up with Event Relay

Why sequential single-session review, not parallel sub-agents

Operational knobs

Output structure: tags and severity

Tags (dimensions)

Severity

Example: PR migrating the embedding model from gemini-embedding-001 to gemini-embedding-2

Critical

[Graph] Missing @graph-business tag (x3)

[Graph] embedMeetContent's @graph-connects doesn't reflect generateEmbeddingV2

[Doc] Corresponding BigQuery schema doc is not updated

Minor

[Test] textEmbeddingV2 value is not asserted

[Test] No isolated scenario for "v2 returns null"

Evolving the guidelines: catching the moments AI gets it wrong, then fixing the rules

1. AI was downgrading because "existing code has the same issue"

2. The final verdict had 3 options, and "comment-only" left PRs in limbo

3. Checklist items had no severity, so the AI's judgment kept drifting

4. The existing guidelines didn't catch AI-specific traps

5. The AI tries to relax "the standard itself"

Evolving the guidelines is the meta-layer humans actually operate on

Auto-fix: a separate AI applies the changes and pushes

Auto-merge + parallel deploy

The numbers, in more detail

Depth of the review-fix loop

What the auto reviewer typically flags

Actual false-positive rate

What changed / Bridge to Part 4

[Graph] Missing `@graph-business` tag (x3)

[Graph] `embedMeetContent`'s `@graph-connects` doesn't reflect `generateEmbeddingV2`

[Test] `textEmbeddingV2` value is not asserted