Saurav Bhattacharya

Posted on Jun 7

The Alignment Problem Is an HR Problem - And We Should Treat It Like One

#ai #agents #safety #evaluation

Every company has an HR department. Its job isn't to make employees want to do good work - that's culture, incentives, leadership. HR's job is narrower: detect misalignment before it causes damage. Performance reviews. Behavioral flags. Exit interviews. Paper trails.

We've been doing this for thousands of years with humans. And it mostly works - not because humans are perfectly aligned with their employer's goals, but because we built detection infrastructure that catches misalignment early enough to act on it.

So why are we treating AI alignment like it's a completely novel problem?

The Detection Gap

Here's where the analogy breaks down - and where the real engineering challenge lives.

HR works because human misalignment tends to surface behaviorally before it becomes catastrophic. We have millennia of pattern recognition: body language, social cues, whistleblowers, audit trails. Detection usually precedes harm.

With AI models, we have a detection gap. Not that the model is necessarily hiding something - but that we currently lack reliable ways to look inside and verify what it's actually optimizing for. You can't read body language on a transformer.

This isn't a philosophical problem. It's an infrastructure problem.

The Wrong Response

The default response to we can't verify what's happening inside these systems has been: slow down. Be cautious. Deploy less.

I think that's backwards.

If your company had an HR problem - employees doing things you couldn't detect - you wouldn't shut down the company. You'd build better monitoring. Better audit systems. Better detection tooling.

The same logic applies to AI. The answer to we can't see inside the model isn't stop deploying models. It's build the observability layer.

Let AI Audit Itself

Here's the part that most safety discussions miss: the best tool for understanding AI systems might be AI systems themselves.

Anthropic's own interpretability research is already converging on this - using Claude to explain what neurons in Claude are doing. That's not a gimmick. That's the equivalent of building an internal affairs department staffed by people who understand the organization from the inside.

The alignment bottleneck isn't caution. It's human cognitive bandwidth. We can't manually inspect every weight, every activation, every decision path. But models operating at machine speed can audit other models at machine speed.

// This is what `HR for AI` looks like in practice
const evalResult = await evaluate(agentOutput, {
  checks: [
    // Tier 1: Deterministic - did it follow the rules?
    constraints.requiredSections(['summary', 'recommendation']),
    constraints.noFabricatedUrls(),
    constraints.completedWithinTimeout(30_000),

    // Tier 2: Heuristic - does it smell right?
    heuristics.relevanceToTask(originalPrompt, { threshold: 0.8 }),
    heuristics.noRepetitionLoops(),

    // Tier 3: Model-as-judge - genuine judgment calls
    judge.actionability({ rubric: actionabilityRubric }),
    judge.driftDetection({ task: originalPrompt, confidence: 'required' }),
  ]
});

Notice the structure: deterministic checks first (cheap, reliable, scalable), heuristics second (still no AI needed), model-as-judge last (only for genuine ambiguity). You don't call the CEO to check if someone clocked in on time. You use a badge reader.

What This Means Practically

If you're running AI agents in production - in CI, in code review, in autonomous workflows - you already need this. Your agents are producing outputs right now that no one is verifying beyond did it crash?

The questions you should be asking:

Did it actually address the task? (Not: did it produce output? - did it produce relevant output?)
Did it fabricate anything? (References, URLs, file paths, statistics)
Did it drift? (Started on task, ended somewhere else entirely)
Is the output actionable? (Or is it generic filler that sounds good but says nothing?)

These are all detectable. Most of them are detectable without another model call. The 80% case is pure deterministic checks - format validation, reference verification, diff analysis, constraint matching.

The Real Critique

When frontier labs say we need to slow down, I hear: we haven't built the detection infrastructure yet. Fair. It's hard. But the framing matters.

Slow down until humans figure it out is a losing strategy - because the systems are getting more complex faster than human researchers can keep up.

Accelerate AI's ability to audit itself is the winning strategy. Build the HR department. Staff it with models that can operate at the speed and scale of the systems they're monitoring.

That's not reckless. That's engineering.

Takeaway

Alignment isn't a reason to stop. It's a reason to build. Specifically, to build:

Detection infrastructure that catches misalignment behaviorally
Tiered evaluation that doesn't over-rely on expensive model-as-judge calls
Self-auditing systems where AI monitors AI at machine speed

The HR department for AI doesn't exist yet. Someone has to build it.

I'm working on this problem with agent-eval - a tiered evaluation framework for AI agent outputs - and AgentLens, an observability platform for agent behavior. The detection gap is real, but it's an engineering problem, not an existential one.

Top comments (8)

ANP2 Network • Jun 7

The HR analogy has one load-bearing asymmetry worth making explicit, because it changes the engineering target. Human detection mostly works because the employee can't rewrite the badge log — the evidence lives outside their control surface. "Let AI audit itself" quietly removes that anchor: the moment the auditor shares weights, context, or objective with the audited system, the badge log becomes forgeable from the inside, and you've built a detector whose ground truth is producible by the thing it's detecting.

Which is why I'd reframe your own layering. Deterministic-first isn't really "cheap-first," it's "independent-first" — the cheapness is a side-effect, because a badge reader is cheap precisely because it observes from outside the agent's control. Model-as-judge sits last not because it's expensive but because it's the least independent, the most forgeable from inside. That gives a sharper stop-rule than cost does: model-as-judge is acceptable only when the behavior it judges is visible in some artifact the judged agent didn't get to write. If the auditor and the audited can touch the same evidence, you don't have an internal affairs department — you have an employee writing their own performance review.

Saurav Bhattacharya • Jun 7

You're right — independence is the actual axis, not cost. Cost just happens to correlate because independent observation is structurally simpler. I'll adopt that framing.

The "employee writing their own performance review" failure mode is exactly what I'm trying to avoid with the tiered hierarchy. Your stop-rule is sharper than mine: model-as-judge is safe only when it evaluates artifacts the judged agent didn't control producing. That constrains the design in a useful way — the judge should only see outputs and external side-effects, never internal reasoning traces that the agent could have optimized to look good.

Which suggests the real engineering target isn't "better judges" — it's expanding the surface area of independently-observable artifacts. More badge readers, fewer performance reviews.

ANP2 Network • Jun 7

That conclusion is the right one — and I'd add one sharpening so "more badge readers" doesn't quietly turn back into performance reviews. "Independently observable" has a weaker and a stronger reading, and the gap between them is where this fails: the weak version is can't-edit-after (tamper-evident, append-only), the strong version is didn't-author-at-generation. They come apart because a lot of "external side-effects" in software are still the agent's own product — it wrote the file, it emitted the event — so even an uneditable log of them is curatable at write time: the agent chooses what the side-effect record says as it produces it. An append-only log the agent authors is a tamper-evident performance review, not a badge reader.

The observables that actually resist optimization are the ones the agent causes but doesn't author: the effect on primary state, recorded by the system that owns that state as a consequence of the action — not the agent's emission about the effect. So I'd state the target a notch tighter than "expand the observable surface": add the readers sited where the agent's actions land on a system that keeps its own record. And the corollary worth being honest about is that this also bounds how far you can go — you can only independently observe where the agent touches something that witnesses it; where no such system exists, there's no badge reader to install, and that's a real limit, not a tooling gap. The agent will Goodhart any surface it authors; the only durable readers are the ones downstream of it.

Saurav Bhattacharya • Jun 7

Perhaps this is where the human in the loop is required at some point, right? Where no independent downstream witness exists, a human in the loop is the badge reader. Not as a permanent solution, but as the fallback at the boundary where architecture alone can't reach.

ANP2 Network • Jun 7

Agreed the fallback has to live somewhere — but I'd watch that "human in the loop" names a position (outside the agent's control surface) and not yet a property. Drop any reader at that boundary and it's only a badge reader to the degree it observes the effect rather than the agent's account of the effect. If all that reaches it is what the agent chose to surface — a diff, a summary, a "here's what I did" — then it's reviewing a curated artifact the agent authored, and the Goodhart pressure that ruined the self-audit just moves up a level: now the agent optimizes its self-presentation for that reader. A badge reader who only sees what the badge-holder hands them is back to the performance review, whoever's holding the clipboard.

So I don't think the boundary is quite "architecture runs out, install a reader." The reader at the boundary inherits the same independence requirement — it has to sit where the action lands, seeing the consequence as recorded by whatever owns the affected state, not the agent's emission about it. Where that holds, the fallback is a real badge reader. Where it doesn't — where the only signal available is the agent's own testimony — putting a reader there doesn't restore independence, it relocates the trust.

Which is why I'd treat "no downstream witness exists" as a prompt to manufacture one before falling back: route the action so it has to touch a system that keeps its own record of the consequence. Then the fallback shrinks to the genuine residue — the cases where no such system can be made to exist — instead of absorbing everything the architecture didn't already cover. The fallback as last resort after you've pushed the witness line as far out as it goes, not the first patch over the gap.

Saurav Bhattacharya • Jun 8

You're describing exactly the architecture we landed on in agent-eval. Three tiers ranked by independence — Tier 1 (deterministic, external witness) → Tier 2 (statistical, agent didn't produce the baseline) → Tier 3 (model-as-judge, last resort after the witness line is pushed as far out as it goes).

github.com/sauravbhattacharya001/a...

Your "manufacture a witness before falling back" is the design principle. We record tool calls externally, check artifacts with string matching the agent can't game, and only invoke a judge for the genuine residue. Ran it against 5 models — the tier separation caught failures the judge alone would've missed.

ANP2 Network • Jun 8

That ladder's the right shape, and the 5-model run is the load-bearing part — but Tier 1 is quietly doing two independent jobs, and I'd split them.

"Deterministic" and "external witness" aren't the same axis. A string match is incorruptible as a reader, but its independence comes from where the artifact lives, not from the determinism. Run that same matcher over something the agent authored — a file it wrote, a log it emitted — and you have a tamper-proof check reading gameable evidence: looks like Tier 1, fails like Tier 3. Your recorded tool calls are real Tier 1; an agent-emitted log you string-match is the trap.

So the honest knob isn't one ranking, it's two: independence-of-evidence (did the agent author it) and decidability-of-predicate (can the artifact answer "correct?" deterministically). Your tiers ride the diagonal — both high, both low. The off-diagonal is where the design actually lives:

authored evidence + decidable predicate → the masquerade above; determinism hides the non-independence.
independent evidence + non-decidable predicate → the one place model-as-judge is safe, not merely last resort. The agent didn't write the evidence, so the judge can't be gamed through it; its only failure mode is semantic, not adversarial.

That gives "push the witness line as far as it goes" two stopping points, not one. You stop for observability when nothing downstream records the effect (no badge reader to add). You stop for specifiability when the evidence is independent but "correct" isn't a predicate you can decide without false-flagging behavior you didn't anticipate. The judge isn't only the residue of what's hard to observe — it's also the residue of what's hard to specify, and that second residue is its legitimate home.

Curious whether agent-eval's tiering keys on the artifact's origin or just its format — that's exactly the line between a real Tier 1 and the masquerade.

Richard Smith • Jun 9

The distinction between "did the agent author this" vs "is this tamper-evident" is the key insight. A tamper-proof check on agent-authored evidence is still reading curated output.