If your LLM-as-judge calibration kappa moves around week to week and you cannot explain it from labeller behavior, the usual cause is the marginal distribution of your calibration set, not the labellers.
Quick refresher. Cohen's kappa is:
kappa = (Po - Pe) / (1 - Pe)
Where Po is observed agreement and Pe is expected agreement by chance. Pe depends on the marginal distribution of the labels in your set.
If 70% of last week's traces were labelled "acceptable" by labeller A and 25% "good" and 5% "bad", Pe is one number. If this week's mix is 50/40/10, Pe shifts. The labellers can be doing exactly the same thing and your kappa value moves.
Three things that help:
Sample your calibration set across multiple time windows (rolling 4-week window, stratified by time bucket). Reduces the chance that one week's traffic pattern dominates Pe.
Report per-class precision and recall alongside kappa. Kappa is one summary number; the per-class metrics tell you where the labeller-LLM disagreement actually sits.
For very small calibration sets (under 100 traces), use Wilson confidence intervals around the per-class precision instead of treating kappa as a point estimate. The Wilson interval is robust to small samples; the normal-approximation interval is not.
References for the calibration-set design and the small-sample math are in Cohen (1960) "A coefficient of agreement for nominal scales" and Wilson (1927) "Probable inference, the law of succession, and statistical inference." Both are short reads.
Top comments (1)
This is the trap that eats LLM-as-judge calibration and almost nobody attributes it correctly — they go hunting for labeller behavior changes when it's just Pe moving under them. The base-rate sensitivity of kappa is brutal precisely when your traffic mix is shifting, which is always.
One addition to your three fixes: alongside rolling stratified sampling, it's worth holding a fixed stratified calibration set so the marginals are stable by construction — then weekly movement actually means something instead of being an artifact of this week's traffic. You lose some freshness but gain a metric you can reason about. (PABAK, or just raw per-class agreement, is a handy sanity cross-check against kappa too.)
The other source of "drift you can't explain from labellers" that bit us: the judge model silently updating. If your kappa steps on a date that matches a provider model rev, it's not your set or your labellers — it's the judge. Are you pinning the judge model version across calibration windows, or letting it float?