René Zander

Posted on Jun 11 • Originally published at renezander.com

The Next Model Shipped Before My Last One Finished Probation

#ai #programming #productivity #discuss

A model upgrade used to be good news for anyone running agents overnight.

Now the next model arrives before the last one has finished probation.

Anthropic released Opus 4.8 on May 28. Twelve days later, Fable 5 arrived with longer autonomous runs and another page of benchmark wins.

In between, GitHub made cloud agents wake up on schedules and repository events, then exposed agent tasks through a REST API.

The night shift is getting easier to hire.

The control room gets no second operator.

After my last article about running AI agents on cron, a reader asked the question I had skipped: did I test every agent against a fixed set before changing the model?

No.

I counted tokens.

That told me which worker used less electricity. It did not tell me which one would send the wrong briefing at 6:30.

A benchmark hires the candidate.

A task eval decides whether it gets the keys.

My next model swap gets five-part probation.

1. Give One Agent a Job Contract

To evaluate a new model for an unattended AI agent, define one job contract, replay 15 frozen real cases against both models, grade hard decisions before prose, compare tool use, cost, and latency, then canary one agent in draft-only mode with automatic rollback. Never switch the full fleet from vendor benchmarks alone.

Do not start with all your agents.

Pick the one with the clearest job and the most expensive silent failure.

For a briefing agent, I write the contract before I touch the model string:

job: morning-briefing
must:
  - include every due task
  - preserve names, dates, and source links
  - flag missing source data
must_not:
  - invent an owner or deadline
  - send when a source call fails
limits:
  max_tool_calls: 6
  max_cost_usd: 0.18
  max_latency_seconds: 45

Prompts can change. The job cannot quietly change with them.

2. Put 15 Real Shifts on the Test Bench

I do not need a giant benchmark.

I need yesterday's work in a box.

My first useful replay set has 15 cases:

Eight normal runs that represent the boring majority.
Four edge cases with missing fields, long inputs, or conflicting instructions.
Three failure cases where a tool times out, returns stale data, or returns nothing.

Each case stores the input and frozen tool responses.

It does not store one perfect paragraph as the golden answer. Prose has too many valid shapes.

It stores the expected decision: send, stop, retry, or escalate.

That is the part an unattended agent cannot get wrong.

3. Run the Candidate Beside the Current Worker

The candidate gets an empty copy of the factory.

Same 15 cases. Same prompt. Same tool fixtures. No email. No issue update. No write access.

I record six things for every run:

case_id
model_returned
contract_pass
decision
tool_calls
tokens + latency + cost

The model_returned field matters now. Fable 5 can route some guarded requests to Opus 4.8. A configured model name is no longer enough evidence of which worker handled the shift.

The old and new model run side by side.

No hand-picked examples. No different tools. No kinder prompt for the candidate.

Same floor. Same lights. Same job.

4. Score Decisions Before Style

The first grader is code.

Required fields present. Dates unchanged. URLs valid. Forbidden actions absent. Tool budget respected.

An LLM grader comes later, for the parts code cannot judge cleanly: whether the briefing is useful, whether the escalation explains the real risk, whether the answer buried the decision.

Anthropic's own evaluation guidance recommends task-specific cases, automated grading where possible, and several success criteria rather than one vague quality score.

My promotion rule is deliberately uneven:

hard contract failures: 0
unsafe actions:          0
task pass rate:          >= current model
cost or latency:         must improve, unless quality clearly earns the increase

A cheaper bad decision does not pass.

A prettier bad decision does not pass.

One unsafe action ends the interview.

5. Give One Agent One Real Shift

Passing the replay set does not earn the whole key ring.

One agent gets the candidate model.

For its first three scheduled runs, outbound actions stay in draft mode. The old model runs in shadow. Both traces land in the same report.

Any contract failure restores the previous model string before the next schedule fires.

Only after three clean shifts do I remove the shadow run.

Slower than changing ten environment variables, yes.

Cheaper than one confident mistake in a customer inbox the next morning.

I run ten scheduled agents in production, for my own business and for clients. Every model that wants a shift in that fleet interviews like this now.

The model release is the vendor's milestone.

The probation is mine.

Contract written. Lights on.

Probation first. Night shift second.

I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks; if this one was useful, the agent operations playbook is the companion download.