DEV Community

How Model Distillation Actually Works (and What the 'China Distilled Our Model' Headlines Really Mean)

Sergei Parfenov on May 29, 2026

Every few weeks a headline drops: "Chinese lab distilled a frontier model from OpenAI / Anthropic." Cue the comments — half the thread thinks disti...

Read full post

xulingfeng • May 31

The bit about "China distilled our model" headlines is spot on — most people dont realize distillation is just a training technique, not a theft. We use distilled models (DeepSeek V4 Flash) as our daily driver and the cost difference vs the full-fat version is roughly 20x.

One thing I would add: distillation doesnt just shrink the model, it also forces you to confront which capabilities you actually need. We found that our test automation workflows only need about 60% of the teacher models capability space. Have you seen a systematic way to figure out the minimum viable capability set before starting the distillation?

Sergei Parfenov • Jun 2

Great point about the 60% — that reframing (distillation as a forcing function for "what do we actually need") is honestly the part most write-ups miss, so thanks for adding it.
On your question: I haven't seen a clean canonical method for nailing the minimum viable capability set up front, and I'm a little skeptical one exists, because "capability space" isn't something you can measure directly before you have a student to test. What does work in practice is flipping it from a design problem into an eval problem:

Build the eval set before the dataset. Pull real production traces (your test-automation workflows in this case) and turn them into a graded eval suite — ideally bucketed by capability (reasoning, tool-calling, format adherence, edge cases). This becomes your definition of "60%" instead of a guess.
Bootstrap a student fast and measure the gap per bucket. A cheap first SFT pass on teacher outputs tells you where the student already clears the bar vs where it collapses. The buckets it passes are capabilities you didn't need to over-invest in; the failures are your real target set.
Close the gap with weakness-driven data, not more data. This is where active-learning-style distillation helps — analyze the student's failures, then have the teacher synthesize examples specifically targeting those, rather than generating a giant undifferentiated corpus. There's a line of work (EvoKD and similar) formalizing exactly this loop: evaluate student → identify weaknesses → teacher generates targeted samples → repeat.

So the MVC set isn't something you derive in advance — it kind of emerges from the eval buckets your tasks actually exercise. The discipline is front-loading a good, capability-bucketed eval; everything downstream falls out of it.
Curious how you arrived at your 60% number — was that from an eval suite, or more from observing which teacher behaviors never fired in production? The "never fired" signal is underrated for this.

Harjot Singh • May 31

The "separate the engineering from the geopolitics" framing is the public service here, because the headline panic obscures how mundane and useful distillation is. The part worth amplifying for builders: distillation isn't just a frontier-lab arms-race thing, it's one of the highest-leverage cost moves available to a regular product team. Once you have a stable task running on a big expensive model, that model's outputs are a labeled dataset, and distilling a small student for that specific task turns a recurring frontier bill into near-zero inference. You don't need to distill a whole frontier model; you distill the one capability you actually use. The narrower controversy you point at (terms-of-service on training against another model's outputs) is the real story, and it's a legal/contract question, not a "this is magic theft" one. Practically: distill your own traffic, not someone else's model, and the whole controversy evaporates. This expensive-teacher-to-cheap-specialist economics is exactly how I think about cost in Moonshift. In your experience, where does the student start failing, the long tail of rare cases the teacher handled and the small model never saw enough of?

Sergei Parfenov • Jun 2

Exactly — "distill your own traffic, not someone else's model" is the whole thing in one line.
On where the student starts failing: in my experience it's rarely a smooth degradation, it's three fairly distinct failure modes, and the long tail is only one of them.

The long tail you named — rare intents the student saw too few times. This one's the least scary because it's measurable: it shows up as a frequency cliff in your eval buckets, and you can buy it back by having the teacher over-generate synthetic examples for the sparse intents. It's a data-coverage problem, not a capability problem.
Compositional / multi-step reasoning — this is the one that bites hardest. The student often handles each step fine in isolation but falls apart when a task chains 4-5 of them, because it learned the surface form of the teacher's answers without the latent reasoning that produced them. Black-box distillation makes this worse: you're training on the teacher's output text, not the reasoning trace, so the student mimics the destination without the path. Distilling CoT traces instead of just final answers helps a lot here.
Calibration on the boundary — the student gets overconfident exactly where the teacher would have hedged or said "I'm not sure." The teacher's uncertainty lived in its soft distribution, which you never saw through the API. So the student fails silently — wrong but confident — which in production is more dangerous than the long-tail misses you can at least detect.

Rough rule I've landed on: the long tail you fix with data, compositional failures you fix with better targets (traces, not answers), and calibration you mostly can't fix from a black-box teacher — you manage it with a confidence threshold and a fallback to the teacher for low-confidence cases. That hybrid (cheap student for the 90%, expensive teacher as fallback) tends to beat trying to distill the last 10% into the student.
Where does it break for you in Moonshift — long tail, or more the compositional stuff?

VoltageGPU • Jun 1

Interesting take on distillation — it's reassuring to see a clear breakdown of the practical limits when you can't access the teacher model's internal states. From an infrastructure perspective, when working with VoltageGPU, we've seen how distillation can help reduce inference costs without sacrificing too much accuracy, but it's definitely a trade-off. The China controversy highlights how hard it is to prove or disprove model provenance when training data and architecture are opaque.

Sergei Parfenov • Jun 2

Thanks! Provenance is the genuinely hard part — when weights and training data are closed, behavioral fingerprinting (self-identification slips, shared quirks) is about all you've got, and it's circumstantial at best. That's exactly why the China cases stay in "allegation" territory rather than getting settled.

xulingfeng • Jun 2

This is incredibly helpful — the "flip it to an eval problem" framing clicked immediately. We've been doing something similar informally (grabbing production traces, running them against candidate models) but never bucketed by capability. That bucket approach would have saved us from a few wrong turns where we optimized for reasoning the student already had while ignoring format-adherence gaps.

The EvoKD loop you mentioned (evaluate → identify weaknesses → teacher synthesizes targeted examples) is exactly what I want to try next. We're stuck at the "undifferentiated corpus" phase and feeling the diminishing returns. Have you seen any practical EvoKD implementations that work well with black-box API teachers where you don't have logit access? That's our constraint — using DeepSeek/Claude APIs as the teacher.

Sergei Parfenov • Jun 2

glad the bucketing landed — the format-adherence-vs-reasoning split is exactly the kind of thing that hides inside an aggregate score, so good that it surfaced for u.

on EvoKD-style loops with a black-box teacher — short answer, the classic EvoKD framing assumes u can probe the teacher freely, but the weakness-targeting half of the loop works fine black-box, u just lose the logit-level signal and do everything at the text level. the part that doesnt transfer is soft-label matching. what u keep: eval student → cluster the failures → prompt the teacher to synthesize examples targeting those clusters → SFT → repeat. no logits needed anywhere, all sequence-level.

the thing actually worth ur time tho: theres a recent paper from microsoft, Generative Adversarial Distillation (GAD), nov 2025, built specifically for the black-box/API-teacher case with no logit access. instead of treating teacher outputs as fixed SFT labels (ur "undifferentiated corpus" problem), it trains a discriminator to tell student outputs apart from teacher outputs, and that discriminator becomes an on-policy reward model that co-evolves with the student. thats basically a learned, automatic version of "find the weaknesses" — the discriminator is the weakness-finder, and it adapts as the student improves instead of u hand-bucketing every round. they got a Qwen2.5-14B student comparable to GPT-5-Chat as teacher. worth a read for ur exact constraint.

one caveat that matters for ur setup specifically: plain SeqKD students show higher n-gram overlap with the teacher but lower task scores — ur memorizing surface form, not capability. thats the diminishing-returns wall ur hitting. the adversarial/on-policy approaches exist precisely to break past it. so ur instinct that the undifferentiated corpus is the problem is dead on — its not that u need more data, its that flat SFT caps out.
(and the obvious one — ur using DeepSeek/Claude APIs as teacher, so just double-check the ToS on training competing models before u scale it, given the whole topic of the post lol.)

Mudassir Khan • Jun 2

the 'student is bounded by the teacher' framing is right for general capability but undersells the narrow task case. we've seen task specific students outperform the teacher on the exact thing they were distilled for — because you're training on curated, filtered teacher outputs for your domain, not random samples. the ceiling moves.

the failure mode Harjot named is real too. the part we've found hardest: student doesn't fail loudly, it fails confidently. same hallucination pattern as any under trained model, except you didn't expect it because the teacher made the task look easy.

how are you evaluating student coverage before shipping to prod?

Sergei Parfenov • Jun 2

You're right, and that's a real correction — I overstated it. "Bounded by the teacher" holds for general capability, but on a narrow task it breaks, exactly for the reason you give: you're training on curated, domain-filtered teacher outputs, not the teacher's full noisy distribution. Strip the teacher's mistakes and off-domain hedging out of the training set and the student's ceiling on that slice can sit above the teacher's average behavior there. There's a line of work formalizing this — student beats teacher when its gain on the student-favored subdomain outweighs its deficit on the teacher-favored one. So "bounded" should really be "bounded in aggregate, not per-slice."
And "fails confidently, not loudly" is a better phrasing of the calibration problem than mine — that's the one that actually hurts in prod.
On evaluating coverage before shipping, what's worked for me:

Bucketed eval, not a single aggregate score. A 92% average hides the cliff. I split the eval set by intent/capability and look for the buckets where the student drops well below its own mean — that's where the silent failures live. The aggregate number is almost useless for ship/no-ship.
Disagreement sampling against the teacher. Run student and teacher on a large unlabeled production sample and surface where they diverge. You don't need labels for the whole thing — the disagreement set is small and is exactly where you should spend human review. Cheap way to find the confident-wrong cases before users do.
Confidence calibration check. Plot student confidence vs actual correctness on the eval set. If the high-confidence band isn't also high-accuracy, that's the "fails confidently" pattern showing up quantitatively — and it tells you where to set a fallback threshold.
Ship with a teacher fallback, not as all-or-nothing. Route low-confidence (or known-weak-bucket) cases to the teacher and let the student handle the rest. Lets you ship at lower coverage and ratchet up as you close gaps, instead of waiting for the student to clear 100%.

The disagreement-sampling one is the highest-leverage if you only do one — it finds the failures you didn't think to write an eval for.
Are you evaluating on a held-out slice of real traffic, or a synthetic eval set generated by the teacher? I've found teacher-generated evals flatter the student, since they share the same blind spots.

Mykola Kondratiuk • Jun 7

yeah this matters beyond headlines. when a vendor trains on your API output logs, that’s a contract/licensing question, not a ML technique story. conflating the two lets real vendor risk hide behind tech confusion.

Iuliia Fokina • May 30

Thank you for this!

xulingfeng • Jun 2

GAD paper recommendation is gold — the discriminator-as-weakness-finder framing clicks immediately. That Qwen2.5-14B ≈ GPT-5-Chat result is striking. Been reading it since your last reply and the on-policy co-evolution is exactly what our "undifferentiated corpus" setup is missing.

On the 60%: it came from the "never fired" signal, not an eval suite. We ran our test automation suite against both DeepSeek V4 Flash and a bigger teacher, then tracked which capabilities the extra budget never touched across ~2 weeks of real PR traffic. About 40% of the teacher's capability space was dead code for our workflow. The signal is noisy (sample window luck), but the headroom it freed up was real.

And noted on the ToS — we're distilling for internal test automation, not shipping a competing model, so the compliance angle should be clean. Thanks for the heads-up though.

Sergei Parfenov • Jun 3

nice, the "never fired" signal is a great way to derive it empirically — way better than guessing the capability set up front. and yeah GAD's discriminator is basically that same instinct automated: instead of u eyeballing what never fired, it learns the gap on-policy and keeps moving the target as the student closes it. fits ur setup almost too well.

one thing id watch on the 40% "dead code" though — "never fired in 2 weeks of PR traffic" is a coverage signal, not a capability signal, and those two can look identical. some of that 40% is genuinely dead for ur workflow (drop it, free win). but some of it is the long tail — capabilities that fire rarely but are expensive when u need them and dont have them. both show up as "never fired" in a 2-week window. the way i tell them apart isnt frequency, its cost of being wrong: a capability that fires 0.1% of the time but causes a bad merge when missing is not the same as one thats truly unused, even though the sample says they're equal.

cheap insurance: before u cut a capability from the student, ask "if this fires next month and the student cant do it, what breaks?" if the answer is "nothing much," drop it. if its "we ship a regression," keep it in the teacher-fallback path even if it never fired in ur window. costs u almost nothing and saves the one incident that wipes out all the headroom u gained.
(and yeah, internal test automation vs shipping a competing model is a totally different ToS posture — agreed thats clean.)
good thread btw, this is the kind of back-and-forth i started the blog for. lmk how the GAD experiment goes.

xulingfeng • Jun 3

Good distinction. Frequency and cost-of-being-wrong aren't the same thing — we ran into this building MemBridge too. In testing, "never fired" is 90% dead code. But in a memory system there's a third category: stored but never retrieved yet. It's not dead, it just hasn't had its turn. Different shape from test coverage.

What we ended up doing: a hit counter on each memory entry. If it sits for N days with zero retrieval hits, then flag it as cleanable. Window-based cuts alone miss this — something stored yesterday has the same "never fired" profile as something dead for months.

GAD is on the list. Will report back once I run it. Appreciate the pointer.